CVS difference for ai05s/ai05-0137-1.txt

Differences between 1.1 and version 1.2
Log of other versions for file ai05s/ai05-0137-1.txt

--- ai05s/ai05-0137-1.txt	2009/02/13 06:09:31	1.1
+++ ai05s/ai05-0137-1.txt	2009/03/12 05:12:15	1.2
@@ -1,4 +1,4 @@
-!standard A.4.10                                       09-02-12  AI05-0137-1/01
+!standard A.4.10                                       09-03-03  AI05-0137-1/02
 !class Amendment 09-02-12
 !status work item 09-02-12
 !status received 09-02-11
@@ -45,6 +45,12 @@
    subtype Short_Encoding is Encoding_Scheme range UTF_8  .. UTF_16LE;
    subtype Long_Encoding  is Encoding_Scheme range UTF_16 .. UTF_16;
+   BOM_8    : constant String := Character'Val (16#EF#) & Character'Val (16#BB#) & Character'Val (16#BF#);
+   BOM_16BE : constant String := Character'Val (16#FE#) & Character'Val (16#FF#);
+   BOM_16LE : constant String := Character'Val (16#FF#) & Character'Val (16#FE#);
+   BOM_16   : constant Wide_String := (1 => Wide_Character'Val (16#FEFF#));
    function Encode (Item   : in Wide_String;      
                     Scheme : in Short_Encoding := UTF_8)
             return String;
@@ -95,6 +101,15 @@
 to a valid encoding (including overlong encoding for UTF_8). The lower
 bound of the returned string shall be 1.
+For each of the Decode functions whose Scheme parameter is of the
+Short_Encoding subtype, if the Item parameter starts with one of the
+BOM sequences then:
+- If this sequence identifies the same encoding as specified by the
+  Scheme parameter, the sequence is ignored;
+- Otherwise, Encoding_Error is raised
+The Encode functions shall not put BOM sequences in the result.
 Implementation permission
 An implementation is allowed to provide additional enumeration
@@ -103,13 +118,116 @@
 belong to the Short_Encoding subtype and literals corresponding to 16
 bits encoding schemes shall belong to the Long_Encoding subtype.
+If a BOM is defined for an implementation defined encoding scheme, a
+corresponding constant shall be added.
+Reminder of the issues:
+A character set is a set of abstract characters. An encoding assigns
+an integer value to each character; this value is called the
+code-point of the character. Normally, a character string should be
+represented as a sequence of code-points; however, it would waste a
+lot of space, since ISO 10646 defines 32 bit code-points. An encoding
+scheme is a representation of a string of characters, using a more
+economical representation. Typically, an encoding scheme uses a suite
+of integer values, where each code-point is represented by one or
+several consecutive values. UTF-8 is an encoding scheme that uses 8
+bit values. In some cases, UTF-8 defines several possible encodings
+for a code-point; in this case, the shortest one should be used; other
+encodings are called overlong encodings. UTF-16 uses 16 bit
+values. UTF-32 uses 32 bit values, which is of little interest since
+nothing is gained compared to UCS-32 (raw encoding).
+There is no problem when using a String to encode UTF-8, or a
+Wide_String to encode UTF-16. However, it is sometimes useful to
+encode/decode a UTF-16 (or even UTF-32) encoded text into/from a
+String; in that case, characters must be paired to form 16 bit values
+(or 32 bit values). This can be done in two ways, Big-Endian (high
+order character first) or Little Endian (low order character first). A
+special value, called BOM (Byte Order Mark, 16#FEFF#), can be used at
+the beginning of an encoded text (with 4 leading zeroes for
+UTF-32). The BOM corresponds to no code-point, and is discarded when
+decoding, but it is used to recognize whether a stream of bytes is
+Big-Endian or Little-Endian UTF-16 or UTF-32. By extension, the
+sequence 16#EF# 16#BB# 16#BF# can be used as BOM to identify UTF-8
+text (although there is no byte order issue in UTF-8; actually, use of
+BOM for UTF-8 is discouraged).
+Note that UTF-8 encoding could be used for file names that include
+characters that are not in ASCII. This package would allow adding an
+Implementation Advice (to Text_IO, Sequential_IO, and so on) to the
+effect that it is recommended to support file names encoded in UTF-8.
+Implementation choices:
+Strictly speaking, an encoded text should be an array of bytes, not of
+(wide_)characters. This proposal uses (wide_)string, but the encoding
+is defined in terms of position values of characters rather than
+characters themselves. It could be argued that it should be defined in
+terms of internal representation of characters, but we know that they
+are the same as the position values for (Wide_)Character.
+We chose to have a parameter to specify the encoding rather than
+providing one function for each encoding scheme, because it makes
+things easier for the user when the encoding scheme is recognized
+dynamically. It makes it also easier to provide implementation-defined
+encoding schemes.
+There are many other possible encoding schemes, including UTF-EBCDIC,
+Shift-JIS, SCSU, BOCU-1... It seemed sensible to provide only the most
+useful ones, while leaving the possibility (through an implementation
+permission) to provide others.
+When reading a file, a BOM can be expected as starting the first line
+of the file, but not subsequent lines. The proposed handling of BOM
+assumes the following pattern:
+1) Read the first line. Check (with the help of provided constants) if
+   the first line starts with a BOM. If yes, initialize the encoding
+   scheme accordingly; if no, choose a default encoding scheme.
+2) Decode all lines (including the first one) with the chosen encoding
+   scheme. Since the BOM is ignored by Decode functions, it is not
+   necessary to slice the first line specially.
+For encoding, it does not seem useful to have the BOM handled by the
+encoding functions, since it is easy to catenate the appropriate
+Open issues:
+Would it be better to have longer names (i.e. UTF_16_Big_Endian) for
+encoding schemes?
+Should UTF_32BE and UTF_32LE be supported (as Short_Encoding, together
+with the corresponding BOM)?
+Should UTF_32 be supported (and Long_Long_Encoding added, together
+with the corresponding BOM)?
+Is it useful to support UTF_8 BOM, since its use is discouraged?
+Should overlong UTF_8 sequences be accepted on input, rather than
+raising Encoding_Error?
+Alternative designs: 
+Short_Encoding and Long_Encoding could be different types rather than
+subtypes of a same type. 
+Arrays of Unsigned_8 or Unsigned_16 could be used in place of
+(Wide_)String. That would enforce strong typing to differentiate
+between an Ada String and an encoded string. OTOH, it is likely to be
+more of a burden than a help to most casual users. Moreover, it would
+not allow to keep ASIS program text as a Wide_String. 
-Strictly speaking, an encoded text should be an array of bytes, not a
-String. That would enforce strong typing to differentiate between an
-Ada String and an encoded string. OTOH, it is likely to be more of a
-burden than a help to most casual users. Moreover, it would not allow
-to keep ASIS program text as a Wide_String.
+Existing similar packages:
 Similar conversion functions are provided as part of xmlada and qtada.
 xmlada provides much more sophisticated services, such as supporting
@@ -117,8 +235,11 @@
 etc. However, it seems reasonable to provide only basic
 functionalities in the standard.
-Should UTF_32BE and UTF_32LE be supported (as Short_Encoding)?
-Should UTF_32 be supported (and Long_Long_Encoding added)?
+Gnat provides the package System.WCh_Con, but it converts only
+individual characters (not strings), does not support UTF-8, and is
+provided by generics that require a user-provided input/output formal
+function. Although more general, this solution would be too
+heavy-weight for the casual user.

Questions? Ask the ACAA Technical Agent