CVS difference for ai05s/ai05-0137-1.txt
--- ai05s/ai05-0137-1.txt 2009/02/13 06:09:31 1.1
+++ ai05s/ai05-0137-1.txt 2009/03/12 05:12:15 1.2
@@ -1,4 +1,4 @@
-!standard A.4.10 09-02-12 AI05-0137-1/01
+!standard A.4.10 09-03-03 AI05-0137-1/02
!class Amendment 09-02-12
!status work item 09-02-12
!status received 09-02-11
@@ -45,6 +45,12 @@
subtype Short_Encoding is Encoding_Scheme range UTF_8 .. UTF_16LE;
subtype Long_Encoding is Encoding_Scheme range UTF_16 .. UTF_16;
+ BOM_8 : constant String := Character'Val (16#EF#) & Character'Val (16#BB#) & Character'Val (16#BF#);
+ BOM_16BE : constant String := Character'Val (16#FE#) & Character'Val (16#FF#);
+ BOM_16LE : constant String := Character'Val (16#FF#) & Character'Val (16#FE#);
+
+ BOM_16 : constant Wide_String := (1 => Wide_Character'Val (16#FEFF#));
+
function Encode (Item : in Wide_String;
Scheme : in Short_Encoding := UTF_8)
return String;
@@ -95,6 +101,15 @@
to a valid encoding (including overlong encoding for UTF_8). The lower
bound of the returned string shall be 1.
+For each of the Decode functions whose Scheme parameter is of the
+Short_Encoding subtype, if the Item parameter starts with one of the
+BOM sequences then:
+- If this sequence identifies the same encoding as specified by the
+ Scheme parameter, the sequence is ignored;
+- Otherwise, Encoding_Error is raised
+
+The Encode functions shall not put BOM sequences in the result.
+
Implementation permission
An implementation is allowed to provide additional enumeration
@@ -103,13 +118,116 @@
belong to the Short_Encoding subtype and literals corresponding to 16
bits encoding schemes shall belong to the Long_Encoding subtype.
+If a BOM is defined for an implementation defined encoding scheme, a
+corresponding constant shall be added.
+
!discussion
+
+Reminder of the issues:
+
+A character set is a set of abstract characters. An encoding assigns
+an integer value to each character; this value is called the
+code-point of the character. Normally, a character string should be
+represented as a sequence of code-points; however, it would waste a
+lot of space, since ISO 10646 defines 32 bit code-points. An encoding
+scheme is a representation of a string of characters, using a more
+economical representation. Typically, an encoding scheme uses a suite
+of integer values, where each code-point is represented by one or
+several consecutive values. UTF-8 is an encoding scheme that uses 8
+bit values. In some cases, UTF-8 defines several possible encodings
+for a code-point; in this case, the shortest one should be used; other
+encodings are called overlong encodings. UTF-16 uses 16 bit
+values. UTF-32 uses 32 bit values, which is of little interest since
+nothing is gained compared to UCS-32 (raw encoding).
+
+There is no problem when using a String to encode UTF-8, or a
+Wide_String to encode UTF-16. However, it is sometimes useful to
+encode/decode a UTF-16 (or even UTF-32) encoded text into/from a
+String; in that case, characters must be paired to form 16 bit values
+(or 32 bit values). This can be done in two ways, Big-Endian (high
+order character first) or Little Endian (low order character first). A
+special value, called BOM (Byte Order Mark, 16#FEFF#), can be used at
+the beginning of an encoded text (with 4 leading zeroes for
+UTF-32). The BOM corresponds to no code-point, and is discarded when
+decoding, but it is used to recognize whether a stream of bytes is
+Big-Endian or Little-Endian UTF-16 or UTF-32. By extension, the
+sequence 16#EF# 16#BB# 16#BF# can be used as BOM to identify UTF-8
+text (although there is no byte order issue in UTF-8; actually, use of
+BOM for UTF-8 is discouraged).
+
+Note that UTF-8 encoding could be used for file names that include
+characters that are not in ASCII. This package would allow adding an
+Implementation Advice (to Text_IO, Sequential_IO, and so on) to the
+effect that it is recommended to support file names encoded in UTF-8.
+
+
+Implementation choices:
+
+Strictly speaking, an encoded text should be an array of bytes, not of
+(wide_)characters. This proposal uses (wide_)string, but the encoding
+is defined in terms of position values of characters rather than
+characters themselves. It could be argued that it should be defined in
+terms of internal representation of characters, but we know that they
+are the same as the position values for (Wide_)Character.
+
+We chose to have a parameter to specify the encoding rather than
+providing one function for each encoding scheme, because it makes
+things easier for the user when the encoding scheme is recognized
+dynamically. It makes it also easier to provide implementation-defined
+encoding schemes.
+
+There are many other possible encoding schemes, including UTF-EBCDIC,
+Shift-JIS, SCSU, BOCU-1... It seemed sensible to provide only the most
+useful ones, while leaving the possibility (through an implementation
+permission) to provide others.
+
+When reading a file, a BOM can be expected as starting the first line
+of the file, but not subsequent lines. The proposed handling of BOM
+assumes the following pattern:
+
+1) Read the first line. Check (with the help of provided constants) if
+ the first line starts with a BOM. If yes, initialize the encoding
+ scheme accordingly; if no, choose a default encoding scheme.
+
+2) Decode all lines (including the first one) with the chosen encoding
+ scheme. Since the BOM is ignored by Decode functions, it is not
+ necessary to slice the first line specially.
+
+For encoding, it does not seem useful to have the BOM handled by the
+encoding functions, since it is easy to catenate the appropriate
+constant.
+
+
+Open issues:
+
+Would it be better to have longer names (i.e. UTF_16_Big_Endian) for
+encoding schemes?
+
+Should UTF_32BE and UTF_32LE be supported (as Short_Encoding, together
+with the corresponding BOM)?
+
+Should UTF_32 be supported (and Long_Long_Encoding added, together
+with the corresponding BOM)?
+
+Is it useful to support UTF_8 BOM, since its use is discouraged?
+
+Should overlong UTF_8 sequences be accepted on input, rather than
+raising Encoding_Error?
+
+
+Alternative designs:
+
+Short_Encoding and Long_Encoding could be different types rather than
+subtypes of a same type.
+
+Arrays of Unsigned_8 or Unsigned_16 could be used in place of
+(Wide_)String. That would enforce strong typing to differentiate
+between an Ada String and an encoded string. OTOH, it is likely to be
+more of a burden than a help to most casual users. Moreover, it would
+not allow to keep ASIS program text as a Wide_String.
+
-Strictly speaking, an encoded text should be an array of bytes, not a
-String. That would enforce strong typing to differentiate between an
-Ada String and an encoded string. OTOH, it is likely to be more of a
-burden than a help to most casual users. Moreover, it would not allow
-to keep ASIS program text as a Wide_String.
+Existing similar packages:
Similar conversion functions are provided as part of xmlada and qtada.
xmlada provides much more sophisticated services, such as supporting
@@ -117,8 +235,11 @@
etc. However, it seems reasonable to provide only basic
functionalities in the standard.
-Should UTF_32BE and UTF_32LE be supported (as Short_Encoding)?
-Should UTF_32 be supported (and Long_Long_Encoding added)?
+Gnat provides the package System.WCh_Con, but it converts only
+individual characters (not strings), does not support UTF-8, and is
+provided by generics that require a user-provided input/output formal
+function. Although more general, this solution would be too
+heavy-weight for the casual user.
!appendix
Questions? Ask the ACAA Technical Agent