CVS difference for ai05s/ai05-0137-1.txt
--- ai05s/ai05-0137-1.txt 2009/03/12 05:12:15 1.2
+++ ai05s/ai05-0137-1.txt 2009/07/11 03:06:22 1.3
@@ -1,10 +1,12 @@
-!standard A.4.10 09-03-03 AI05-0137-1/02
+!standard A.4.11 09-06-30 AI05-0137-1/03
!class Amendment 09-02-12
+!status Amendment 201Z 09-06-30
+!status ARG Approved 7-0-0 09-06-13
!status work item 09-02-12
!status received 09-02-11
!priority Medium
!difficulty Easy
-!subject New conversion package
+!subject String encoding package
!summary
@@ -28,24 +30,26 @@
!wording
-The following clause is added as A.4.10:
+The following clause is added as A.4.11:
-A.4.10 String encoding
-The language-defined package Strings.Encoding provides facilities for
+A.4.11 String encoding
+
+The language-defined package Strings.UTF_Encoding provides facilities for
encoding and decoding strings in various character encoding schemes.
Static Semantics
-The library package Strings.Encoding has the following declaration:
+The library package Strings.UTF_Encoding has the following declaration:
-package Ada.Strings.Encoding is
- pragma Pure (Encoding);
+package Ada.Strings.UTF_Encoding is
+ pragma Pure (UTF_Encoding);
- type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE, UTF_16);
+ type Encoding_Scheme is (UTF_None, UTF_8, UTF_16BE, UTF_16LE, UTF_16);
subtype Short_Encoding is Encoding_Scheme range UTF_8 .. UTF_16LE;
subtype Long_Encoding is Encoding_Scheme range UTF_16 .. UTF_16;
- BOM_8 : constant String := Character'Val (16#EF#) & Character'Val (16#BB#) & Character'Val (16#BF#);
+ BOM_8 : constant String := Character'Val (16#EF#) & Character'Val (16#BB#) &
+ Character'Val (16#BF#);
BOM_16BE : constant String := Character'Val (16#FE#) & Character'Val (16#FF#);
BOM_16LE : constant String := Character'Val (16#FF#) & Character'Val (16#FE#);
@@ -70,17 +74,21 @@
function Decode (Item : in Wide_String;
Scheme : in Long_Encoding := UTF_16)
return Wide_Wide_String;
+
+ function Encoding (Item : in String) return Encoding_Scheme;
+
+ function Encoding (Item : in Wide_String) return Encoding_Scheme;
Encoding_Error : exception;
-end Ada.Strings.Encoding;
-The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds
-to the encoding scheme defined by RFC 3629; UTF_16BE corresponds to
-the UTF-16 encoding scheme of Unicode on 8 bits, big endian; UTF_16LE
-corresponds to the UTF-16 encoding scheme of Unicode on 8 bits, little
-endian; and UTF_16 corresponds to the UTF-16 encoding scheme of
-Unicode on 16 bits.
+end Ada.Strings.UTF_Encoding;
+The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8
+encoding scheme defined by Annex D of ISO/IEC 106046. UTF_16 corresponds to the
+UTF-16 encoding scheme defined by Annex C of ISO/IEC 106046 stored in 16 bits;
+UTF_16BE corresponds to the UTF-16 encoding scheme stored in 8 bits, big endian;
+and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian.
+
The subtype Short_Encoding covers the values of Encoding_Scheme for 8
bits encoding schemes, and the subtype Long_Encoding covers the values of
Encoding_Scheme for 16 bits encoding schemes.
@@ -90,7 +98,7 @@
Wide_String) whose characters have position values that correspond to
the encoding of the Item parameter according to the encoding scheme
specified by the Scheme parameter. For UTF_8, no overlong encoding
-shall be returned. The lower bound of the returned string shall be 1.
+is returned. The lower bound of the returned string shall be 1.
Each of the Decode functions takes a String (respectively Wide_String)
Item parameter which is assumed to contain characters whose position
@@ -99,31 +107,36 @@
Wide_String (respectively Wide_Wide_String). The exception
Encoding_Error is propagated if the input string does not correspond
to a valid encoding (including overlong encoding for UTF_8). The lower
-bound of the returned string shall be 1.
+bound of the returned string is 1.
For each of the Decode functions whose Scheme parameter is of the
Short_Encoding subtype, if the Item parameter starts with one of the
BOM sequences then:
- If this sequence identifies the same encoding as specified by the
Scheme parameter, the sequence is ignored;
-- Otherwise, Encoding_Error is raised
+- Otherwise, Encoding_Error is raised.
+
+The Encode functions do put BOM sequences in the result.
-The Encode functions shall not put BOM sequences in the result.
+For each of the Encoding functions, if the initial characters of Item match
+a BOM, the corresponding encoding is returned; otherwise, UTF_None is returned.
-Implementation permission
+Implementation advice
-An implementation is allowed to provide additional enumeration
-literals to the type Encoding_Scheme for other character encoding
-schemes. Literals corresponding to 8 bits encoding schemes shall
-belong to the Short_Encoding subtype and literals corresponding to 16
-bits encoding schemes shall belong to the Long_Encoding subtype.
+If an implementation supports other encoding schemes, another similar child
+of Ada.Strings should be defined.
-If a BOM is defined for an implementation defined encoding scheme, a
-corresponding constant shall be added.
+Note: A BOM can be included in a file or other entity to indicate the encoding;
+it is skipped when decoding. An explicit concatenation is needed to include a BOM
+in an encoded entity (it is not added automatically). Typically, only the first
+line of a file or other entity will contain a BOM. When decoding, the appropriate
+Encoding function can be used on the first line to determine the encoding; that
+encoding will then be used in subsequent calls to Decode to convert all of the
+lines to an internal format.
!discussion
-Reminder of the issues:
+Background on character encoding:
A character set is a set of abstract characters. An encoding assigns
an integer value to each character; this value is called the
@@ -179,13 +192,13 @@
There are many other possible encoding schemes, including UTF-EBCDIC,
Shift-JIS, SCSU, BOCU-1... It seemed sensible to provide only the most
useful ones, while leaving the possibility (through an implementation
-permission) to provide others.
+advice) to provide others.
When reading a file, a BOM can be expected as starting the first line
of the file, but not subsequent lines. The proposed handling of BOM
assumes the following pattern:
-1) Read the first line. Check (with the help of provided constants) if
+1) Read the first line. Check (with function Encoding) if
the first line starts with a BOM. If yes, initialize the encoding
scheme accordingly; if no, choose a default encoding scheme.
@@ -198,23 +211,6 @@
constant.
-Open issues:
-
-Would it be better to have longer names (i.e. UTF_16_Big_Endian) for
-encoding schemes?
-
-Should UTF_32BE and UTF_32LE be supported (as Short_Encoding, together
-with the corresponding BOM)?
-
-Should UTF_32 be supported (and Long_Long_Encoding added, together
-with the corresponding BOM)?
-
-Is it useful to support UTF_8 BOM, since its use is discouraged?
-
-Should overlong UTF_8 sequences be accepted on input, rather than
-raising Encoding_Error?
-
-
Alternative designs:
Short_Encoding and Long_Encoding could be different types rather than
@@ -241,6 +237,210 @@
function. Although more general, this solution would be too
heavy-weight for the casual user.
+!corrigendum A.4.11(0)
+
+@dinsc
+
+The language-defined package Strings.UTF_Encoding provides facilities for
+encoding and decoding strings in various character encoding schemes.
+
+@s8<@i<Static Semantics>>
+
+The library package Strings.UTF_Encoding has the following declaration:
+
+@xcode<@b<package> Ada.Strings.UTF_Encoding @b<is>
+ @b<pragma> Pure (UTF_Encoding);
+
+ @b<type> Encoding_Scheme @b<is> (UTF_None, UTF_8, UTF_16BE, UTF_16LE, UTF_16);
+ @b<subtype> Short_Encoding @b<is> Encoding_Scheme @b<range> UTF_8 .. UTF_16LE;
+ @b<subtype> Long_Encoding @b<is> Encoding_Scheme @b<range> UTF_16 .. UTF_16;
+
+ BOM_8 : @b<constant> String := Character'Val (16#EF#) & Character'Val (16#BB#) &
+ Character'Val (16#BF#);
+ BOM_16BE : @b<constant> String := Character'Val (16#FE#) & Character'Val (16#FF#);
+ BOM_16LE : @b<constant> String := Character'Val (16#FF#) & Character'Val (16#FE#);
+
+ BOM_16 : @b<constant> Wide_String := (1 =@> Wide_Character'Val (16#FEFF#));
+
+ @b<function> Encode (Item : @b<in> Wide_String;
+ Scheme : @b<in> Short_Encoding := UTF_8)
+ @b<return> String;
+ @b<function> Encode (Item : @b<in> Wide_Wide_String;
+ Scheme : @b<in> Short_Encoding := UTF_8)
+ @b<return> String;
+ @b<function> Decode (Item : @b<in> String;
+ Scheme : @b<in> Short_Encoding := UTF_8)
+ @b<return> Wide_String;
+ @b<function> Decode (Item : @b<in> String;
+ Scheme : @b<in> Short_Encoding := UTF_8)
+ @b<return> Wide_Wide_String;
+
+ @b<function> Encode (Item : @b<in> Wide_Wide_String;
+ Scheme : @b<in> Long_Encoding := UTF_16)
+ @b<return> Wide_String;
+ @b<function> Decode (Item : @b<in> Wide_String;
+ Scheme : @b<in> Long_Encoding := UTF_16)
+ @b<return> Wide_Wide_String;
+
+ @b<function> Encoding (Item : @b<in> String) @b<return> Encoding_Scheme;
+
+ @b<function> Encoding (Item : @b<in> Wide_String) @b<return> Encoding_Scheme;
+
+ Encoding_Error : @b<exception>;
+
+@b<end> Ada.Strings.UTF_Encoding;>
+
+The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8
+encoding scheme defined by Annex D of ISO/IEC 106046. UTF_16 corresponds to the
+UTF-16 encoding scheme defined by Annex C of ISO/IEC 106046 stored in 16 bits;
+UTF_16BE corresponds to the UTF-16 encoding scheme stored in 8 bits, big endian;
+and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian.
+
+The subtype Short_Encoding covers the values of Encoding_Scheme for 8
+bits encoding schemes, and the subtype Long_Encoding covers the values of
+Encoding_Scheme for 16 bits encoding schemes.
+
+Each of the Encode functions takes a Wide_String (respectively
+Wide_Wide_String) Item parameter and returns a String (respectively
+Wide_String) whose characters have position values that correspond to
+the encoding of the Item parameter according to the encoding scheme
+specified by the Scheme parameter. For UTF_8, no overlong encoding
+is returned. The lower bound of the returned string shall be 1.
+
+Each of the Decode functions takes a String (respectively Wide_String)
+Item parameter which is assumed to contain characters whose position
+values correspond to a valid encoding according to the encoding scheme
+specified by the Scheme parameter, and returns the corresponding
+Wide_String (respectively Wide_Wide_String). The exception
+Encoding_Error is propagated if the input string does not correspond
+to a valid encoding (including overlong encoding for UTF_8). The lower
+bound of the returned string is 1.
+
+For each of the Decode functions whose Scheme parameter is of the
+Short_Encoding subtype, if the Item parameter starts with one of the
+BOM sequences then:
+@xbullet<If this sequence identifies the same encoding as specified by the
+Scheme parameter, the sequence is ignored;>
+@xbullet<Otherwise, Encoding_Error is raised.>
+
+The Encode functions do put BOM sequences in the result.
+
+For each of the Encoding functions, if the initial characters of Item match
+a BOM, the corresponding encoding is returned; otherwise, UTF_None is returned.
+
+@s8<@i<Implementation Advice>>
+
+If an implementation supports other encoding schemes, another similar child
+of Ada.Strings should be defined.
+
+NOTE@hr
+@s9<14 A BOM can be included in a file or other entity to indicate the encoding;
+it is skipped when decoding. An explicit concatenation is needed to include a BOM
+in an encoded entity (it is not added automatically). Typically, only the first
+line of a file or other entity will contain a BOM. When decoding, the appropriate
+Encoding function can be used on the first line to determine the encoding; that
+encoding will then be used in subsequent calls to Decode to convert all of the
+lines to an internal format.>
+
!appendix
+From: Randy Brukardt
+Sent: Wednesday, July 1, 2009 8:12 PM
+
+This AI (version /03) has the following paragraph:
+
+The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the encoding
+scheme defined by RFC 3629; UTF_16BE corresponds to the UTF-16 encoding scheme of
+Unicode on 8 bits, big endian; UTF_16LE corresponds to the UTF-16 encoding scheme
+of Unicode on 8 bits, little endian; and UTF_16 corresponds to the UTF-16 encoding
+scheme of Unicode on 16 bits.
+
+I don't think this would pass muster in that it references documents that aren't
+International Standards. Recall that we went to incredible lengths in Amendment 1
+to avoid ever mentioning the word "Unicode" in the normative text of the standard
+(even though we referenced it). I suppose something might have changed, but in the
+absence of such knowledge, I'd prefer to avoid it.
+
+And what the heck is RFC 3629 (from an ISO perspective)?
+
+So I wonder what this ought to say. Does ISO/IEC 10646:2003 (or newer) contain these
+encodings? If so, we ought to reference that. Is there some other International
+Standard that defines these encodings? Then we need a normative reference to that.
+If there isn't any such thing, we'll need to include full normative references to
+*something* defining these encodings in clause 1.2 (and we'll have to be prepared to
+drop this whole package if there are objections -- it would be silly to derail Ada
+over a UTF encoding package).
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Wednesday, July 1, 2009 8:38 PM
+
+ISO 10646 is a freely available standard (incredible as that may seem). I am
+downloading a copy so I can try to answer your questions. I suspect that someone
+else might know the answers off the top of their head...
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Wednesday, July 1, 2009 8:56 PM
+
+UTF-16 is defined in Annex C, and
+UTF-8 is defined in Annex D of ISO 10646:2003.
+Unicode 4.0 is referenced in a "Note"
+in Annex C, indicating that it defines
+UTF-16, UTF-16LE, and UTF-16BE.
+
+The Normative reference section of ISO-10646 lists the following:
+
+ Unicode Standard Annex, UAX#9, The Unicode Bidi-
+ rectional Algorithm, Version 4.0.0, 2003-04-17.
+
+So apparently one can refer to the Unicode Standard in an ISO document!
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Wednesday, July 1, 2009 9:14 PM
+
+Thanks for looking this up. (If I had realized that 10646 was freely available,
+I would have done this myself.) Clearly, we ought to reference the appropriate
+annex of 10646 here - we might as well be consistent with the rest of our standard.
+
+So here's proposed wording. I assumed that the terms UTF-8 and UTF-16 are actually
+defined by 10646 (right?). I changed the order of the text to read better, meaning
+that the literals are not defined in order. Tough. :-)
+
+The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8
+encoding scheme defined by Annex D of ISO/IEC 106046. UTF_16 corresponds to the
+UTF-16 encoding scheme defined by Annex C of ISO/IEC 106046 stored in 16 bits;
+UTF_16BE corresponds to the UTF-16 encoding scheme stored in 8 bits, big endian;
+and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian.
+
+Is the above good enough, or do we need an AARM note like the following??
+
+AARM Note: How the UTF-16 encoding is stored in 8 and 16 bits is defined by
+reference to Unicode 4.0.0 in ISO/IEC 106046: "Unicode Standard Annex, UAX#9, The
+Unicode Bidirectional Algorithm, Version 4.0.0, 2003-04-17."
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Wednesday, July 1, 2009 9:21 PM
+
+...
+> Is the above good enough, or do we need an AARM note like the following??
+
+The above is fine by me. Anyone who goes and looks up UTF-16 in ISO-10646 will find
+the reference to Unicode.
+
+> AARM Note: How the UTF-16 encoding is stored in 8 and 16 bits is
+> defined by reference to Unicode 4.0.0 in ISO/IEC 106046: "Unicode
+> Standard Annex, UAX#9, The Unicode Bidirectional Algorithm, Version 4.0.0, 2003-04-17."
+
+I don't think this is necessary, and in fact that isn't where UTF-16LE and UTF-16BE are
+defined. ISO-10646 doesn't identify exactly where they are defined. It merely says they
+are defined somewhere in Unicode 4.0. That reference I included was just to show you that
+they were willing to make a formal reference to the Unicode Standard.
+
****************************************************************
Questions? Ask the ACAA Technical Agent