CVS difference for ai05s/ai05-0137-1.txt

Differences between 1.2 and version 1.3
Log of other versions for file ai05s/ai05-0137-1.txt

--- ai05s/ai05-0137-1.txt	2009/03/12 05:12:15	1.2
+++ ai05s/ai05-0137-1.txt	2009/07/11 03:06:22	1.3
@@ -1,10 +1,12 @@
-!standard A.4.10                                       09-03-03  AI05-0137-1/02
+!standard A.4.11                                       09-06-30  AI05-0137-1/03
 !class Amendment 09-02-12
+!status Amendment 201Z 09-06-30
+!status ARG Approved  7-0-0  09-06-13
 !status work item 09-02-12
 !status received 09-02-11
 !priority Medium
 !difficulty Easy
-!subject New conversion package
+!subject String encoding package
 
 !summary
 
@@ -28,24 +30,26 @@
 
 !wording
 
-The following clause is added as A.4.10:
+The following clause is added as A.4.11:
 
-A.4.10 String encoding
-The language-defined package Strings.Encoding provides facilities for
+A.4.11 String encoding
+
+The language-defined package Strings.UTF_Encoding provides facilities for
 encoding and decoding strings in various character encoding schemes.
 
 Static Semantics
 
-The library package Strings.Encoding has the following declaration:
+The library package Strings.UTF_Encoding has the following declaration:
 
-package Ada.Strings.Encoding is
-   pragma Pure (Encoding);
+package Ada.Strings.UTF_Encoding is
+   pragma Pure (UTF_Encoding);
    
-   type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE, UTF_16);
+   type Encoding_Scheme is (UTF_None, UTF_8, UTF_16BE, UTF_16LE, UTF_16);
    subtype Short_Encoding is Encoding_Scheme range UTF_8  .. UTF_16LE;
    subtype Long_Encoding  is Encoding_Scheme range UTF_16 .. UTF_16;
    
-   BOM_8    : constant String := Character'Val (16#EF#) & Character'Val (16#BB#) & Character'Val (16#BF#);
+   BOM_8    : constant String := Character'Val (16#EF#) & Character'Val (16#BB#) &
+                 Character'Val (16#BF#);
    BOM_16BE : constant String := Character'Val (16#FE#) & Character'Val (16#FF#);
    BOM_16LE : constant String := Character'Val (16#FF#) & Character'Val (16#FE#);
 
@@ -70,17 +74,21 @@
    function Decode (Item   : in Wide_String;      
                     Scheme : in Long_Encoding := UTF_16) 
             return Wide_Wide_String;
+
+   function Encoding (Item : in String) return Encoding_Scheme;
+
+   function Encoding (Item : in Wide_String) return Encoding_Scheme;
    
    Encoding_Error : exception;
-end Ada.Strings.Encoding;
 
-The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds
-to the encoding scheme defined by RFC 3629; UTF_16BE corresponds to
-the UTF-16 encoding scheme of Unicode on 8 bits, big endian; UTF_16LE
-corresponds to the UTF-16 encoding scheme of Unicode on 8 bits, little
-endian; and UTF_16 corresponds to the UTF-16 encoding scheme of
-Unicode on 16 bits.
+end Ada.Strings.UTF_Encoding;
 
+The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8
+encoding scheme defined by Annex D of ISO/IEC 106046. UTF_16 corresponds to the
+UTF-16 encoding scheme defined by Annex C of ISO/IEC 106046 stored in 16 bits;
+UTF_16BE corresponds to the UTF-16 encoding scheme stored in 8 bits, big endian;
+and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian.
+
 The subtype Short_Encoding covers the values of Encoding_Scheme for 8
 bits encoding schemes, and the subtype Long_Encoding covers the values of
 Encoding_Scheme for 16 bits encoding schemes.
@@ -90,7 +98,7 @@
 Wide_String) whose characters have position values that correspond to
 the encoding of the Item parameter according to the encoding scheme
 specified by the Scheme parameter. For UTF_8, no overlong encoding
-shall be returned. The lower bound of the returned string shall be 1.
+is returned. The lower bound of the returned string shall be 1.
 
 Each of the Decode functions takes a String (respectively Wide_String)
 Item parameter which is assumed to contain characters whose position
@@ -99,31 +107,36 @@
 Wide_String (respectively Wide_Wide_String). The exception
 Encoding_Error is propagated if the input string does not correspond
 to a valid encoding (including overlong encoding for UTF_8). The lower
-bound of the returned string shall be 1.
+bound of the returned string is 1.
 
 For each of the Decode functions whose Scheme parameter is of the
 Short_Encoding subtype, if the Item parameter starts with one of the
 BOM sequences then:
 - If this sequence identifies the same encoding as specified by the
   Scheme parameter, the sequence is ignored;
-- Otherwise, Encoding_Error is raised
+- Otherwise, Encoding_Error is raised.
+
+The Encode functions do put BOM sequences in the result.
 
-The Encode functions shall not put BOM sequences in the result.
+For each of the Encoding functions, if the initial characters of Item match
+a BOM, the corresponding encoding is returned; otherwise, UTF_None is returned.
 
-Implementation permission
+Implementation advice
 
-An implementation is allowed to provide additional enumeration
-literals to the type Encoding_Scheme for other character encoding
-schemes. Literals corresponding to 8 bits encoding schemes shall
-belong to the Short_Encoding subtype and literals corresponding to 16
-bits encoding schemes shall belong to the Long_Encoding subtype.
+If an implementation supports other encoding schemes, another similar child
+of Ada.Strings should be defined.
 
-If a BOM is defined for an implementation defined encoding scheme, a
-corresponding constant shall be added.
+Note: A BOM can be included in a file or other entity to indicate the encoding;
+it is skipped when decoding. An explicit concatenation is needed to include a BOM
+in an encoded entity (it is not added automatically). Typically, only the first
+line of a file or other entity will contain a BOM. When decoding, the appropriate
+Encoding function can be used on the first line to determine the encoding; that
+encoding will then be used in subsequent calls to Decode to convert all of the
+lines to an internal format.
 
 !discussion
 
-Reminder of the issues:
+Background on character encoding:
 
 A character set is a set of abstract characters. An encoding assigns
 an integer value to each character; this value is called the
@@ -179,13 +192,13 @@
 There are many other possible encoding schemes, including UTF-EBCDIC,
 Shift-JIS, SCSU, BOCU-1... It seemed sensible to provide only the most
 useful ones, while leaving the possibility (through an implementation
-permission) to provide others.
+advice) to provide others.
 
 When reading a file, a BOM can be expected as starting the first line
 of the file, but not subsequent lines. The proposed handling of BOM
 assumes the following pattern:
 
-1) Read the first line. Check (with the help of provided constants) if
+1) Read the first line. Check (with function Encoding) if
    the first line starts with a BOM. If yes, initialize the encoding
    scheme accordingly; if no, choose a default encoding scheme.
 
@@ -198,23 +211,6 @@
 constant.
 
 
-Open issues:
-
-Would it be better to have longer names (i.e. UTF_16_Big_Endian) for
-encoding schemes?
-
-Should UTF_32BE and UTF_32LE be supported (as Short_Encoding, together
-with the corresponding BOM)?
-
-Should UTF_32 be supported (and Long_Long_Encoding added, together
-with the corresponding BOM)?
-
-Is it useful to support UTF_8 BOM, since its use is discouraged?
-
-Should overlong UTF_8 sequences be accepted on input, rather than
-raising Encoding_Error?
-
-
 Alternative designs: 
 
 Short_Encoding and Long_Encoding could be different types rather than
@@ -241,6 +237,210 @@
 function. Although more general, this solution would be too
 heavy-weight for the casual user.
 
+!corrigendum A.4.11(0)
+
+@dinsc
+
+The language-defined package Strings.UTF_Encoding provides facilities for
+encoding and decoding strings in various character encoding schemes.
+
+@s8<@i<Static Semantics>>
+
+The library package Strings.UTF_Encoding has the following declaration:
+
+@xcode<@b<package> Ada.Strings.UTF_Encoding @b<is>
+   @b<pragma> Pure (UTF_Encoding);
+   
+   @b<type> Encoding_Scheme @b<is> (UTF_None, UTF_8, UTF_16BE, UTF_16LE, UTF_16);
+   @b<subtype> Short_Encoding @b<is> Encoding_Scheme @b<range> UTF_8  .. UTF_16LE;
+   @b<subtype> Long_Encoding  @b<is> Encoding_Scheme @b<range> UTF_16 .. UTF_16;
+   
+   BOM_8    : @b<constant> String := Character'Val (16#EF#) & Character'Val (16#BB#) &
+                 Character'Val (16#BF#);
+   BOM_16BE : @b<constant> String := Character'Val (16#FE#) & Character'Val (16#FF#);
+   BOM_16LE : @b<constant> String := Character'Val (16#FF#) & Character'Val (16#FE#);
+
+   BOM_16   : @b<constant> Wide_String := (1 =@> Wide_Character'Val (16#FEFF#));
+
+   @b<function> Encode (Item   : @b<in> Wide_String;      
+                    Scheme : @b<in> Short_Encoding := UTF_8)
+            @b<return> String;
+   @b<function> Encode (Item   : @b<in> Wide_Wide_String; 
+                    Scheme : @b<in> Short_Encoding := UTF_8) 
+            @b<return> String;
+   @b<function> Decode (Item   : @b<in> String;           
+                    Scheme : @b<in> Short_Encoding := UTF_8) 
+            @b<return> Wide_String;
+   @b<function> Decode (Item   : @b<in> String;           
+                    Scheme : @b<in> Short_Encoding := UTF_8) 
+            @b<return> Wide_Wide_String;
+
+   @b<function> Encode (Item   : @b<in> Wide_Wide_String; 
+                    Scheme : @b<in> Long_Encoding := UTF_16) 
+            @b<return> Wide_String;
+   @b<function> Decode (Item   : @b<in> Wide_String;      
+                    Scheme : @b<in> Long_Encoding := UTF_16) 
+            @b<return> Wide_Wide_String;
+
+   @b<function> Encoding (Item : @b<in> String) @b<return> Encoding_Scheme;
+
+   @b<function> Encoding (Item : @b<in> Wide_String) @b<return> Encoding_Scheme;
+   
+   Encoding_Error : @b<exception>;
+
+@b<end> Ada.Strings.UTF_Encoding;>
+
+The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8
+encoding scheme defined by Annex D of ISO/IEC 106046. UTF_16 corresponds to the
+UTF-16 encoding scheme defined by Annex C of ISO/IEC 106046 stored in 16 bits;
+UTF_16BE corresponds to the UTF-16 encoding scheme stored in 8 bits, big endian;
+and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian.
+
+The subtype Short_Encoding covers the values of Encoding_Scheme for 8
+bits encoding schemes, and the subtype Long_Encoding covers the values of
+Encoding_Scheme for 16 bits encoding schemes.
+
+Each of the Encode functions takes a Wide_String (respectively
+Wide_Wide_String) Item parameter and returns a String (respectively
+Wide_String) whose characters have position values that correspond to
+the encoding of the Item parameter according to the encoding scheme
+specified by the Scheme parameter. For UTF_8, no overlong encoding
+is returned. The lower bound of the returned string shall be 1.
+
+Each of the Decode functions takes a String (respectively Wide_String)
+Item parameter which is assumed to contain characters whose position
+values correspond to a valid encoding according to the encoding scheme
+specified by the Scheme parameter, and returns the corresponding
+Wide_String (respectively Wide_Wide_String). The exception
+Encoding_Error is propagated if the input string does not correspond
+to a valid encoding (including overlong encoding for UTF_8). The lower
+bound of the returned string is 1.
+
+For each of the Decode functions whose Scheme parameter is of the
+Short_Encoding subtype, if the Item parameter starts with one of the
+BOM sequences then:
+@xbullet<If this sequence identifies the same encoding as specified by the
+Scheme parameter, the sequence is ignored;>
+@xbullet<Otherwise, Encoding_Error is raised.>
+
+The Encode functions do put BOM sequences in the result.
+
+For each of the Encoding functions, if the initial characters of Item match
+a BOM, the corresponding encoding is returned; otherwise, UTF_None is returned.
+
+@s8<@i<Implementation Advice>>
+
+If an implementation supports other encoding schemes, another similar child
+of Ada.Strings should be defined.
+
+NOTE@hr
+@s9<14  A BOM can be included in a file or other entity to indicate the encoding;
+it is skipped when decoding. An explicit concatenation is needed to include a BOM
+in an encoded entity (it is not added automatically). Typically, only the first
+line of a file or other entity will contain a BOM. When decoding, the appropriate
+Encoding function can be used on the first line to determine the encoding; that
+encoding will then be used in subsequent calls to Decode to convert all of the
+lines to an internal format.>
+
 !appendix
 
+From: Randy Brukardt
+Sent: Wednesday, July 1, 2009  8:12 PM
+
+This AI (version /03) has the following paragraph:
+
+The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the encoding
+scheme defined by RFC 3629; UTF_16BE corresponds to the UTF-16 encoding scheme of
+Unicode on 8 bits, big endian; UTF_16LE corresponds to the UTF-16 encoding scheme
+of Unicode on 8 bits, little endian; and UTF_16 corresponds to the UTF-16 encoding
+scheme of Unicode on 16 bits.
+
+I don't think this would pass muster in that it references documents that aren't
+International Standards. Recall that we went to incredible lengths in Amendment 1
+to avoid ever mentioning the word "Unicode" in the normative text of the standard
+(even though we referenced it). I suppose something might have changed, but in the
+absence of such knowledge, I'd prefer to avoid it.
+
+And what the heck is RFC 3629 (from an ISO perspective)?
+
+So I wonder what this ought to say. Does ISO/IEC 10646:2003 (or newer) contain these
+encodings? If so, we ought to reference that. Is there some other International
+Standard that defines these encodings? Then we need a normative reference to that.
+If there isn't any such thing, we'll need to include full normative references to
+*something* defining these encodings in clause 1.2 (and we'll have to be prepared to
+drop this whole package if there are objections -- it would be silly to derail Ada
+over a UTF encoding package).
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Wednesday, July 1, 2009  8:38 PM
+
+ISO 10646 is a freely available standard (incredible as that may seem).  I am
+downloading a copy so I can try to answer your questions.  I suspect that someone
+else might know the answers off the top of their head...
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Wednesday, July 1, 2009  8:56 PM
+
+UTF-16 is defined in Annex C, and
+UTF-8 is defined in Annex D of ISO 10646:2003.
+Unicode 4.0 is referenced in a "Note"
+in Annex C, indicating that it defines
+UTF-16, UTF-16LE, and UTF-16BE.
+
+The Normative reference section of ISO-10646 lists the following:
+
+   Unicode Standard Annex, UAX#9, The Unicode Bidi-
+   rectional Algorithm, Version 4.0.0, 2003-04-17.
+
+So apparently one can refer to the Unicode Standard in an ISO document!
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Wednesday, July 1, 2009  9:14 PM
+
+Thanks for looking this up. (If I had realized that 10646 was freely available,
+I would have done this myself.) Clearly, we ought to reference the appropriate
+annex of 10646 here - we might as well be consistent with the rest of our standard.
+
+So here's proposed wording. I assumed that the terms UTF-8 and UTF-16 are actually
+defined by 10646 (right?). I changed the order of the text to read better, meaning
+that the literals are not defined in order. Tough. :-)
+
+The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8
+encoding scheme defined by Annex D of ISO/IEC 106046. UTF_16 corresponds to the
+UTF-16 encoding scheme defined by Annex C of ISO/IEC 106046 stored in 16 bits;
+UTF_16BE corresponds to the UTF-16 encoding scheme stored in 8 bits, big endian;
+and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian.
+
+Is the above good enough, or do we need an AARM note like the following??
+
+AARM Note: How the UTF-16 encoding is stored in 8 and 16 bits is defined by
+reference to Unicode 4.0.0 in ISO/IEC 106046: "Unicode Standard Annex, UAX#9, The
+Unicode Bidirectional Algorithm, Version 4.0.0, 2003-04-17."
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Wednesday, July 1, 2009  9:21 PM
+
+...
+> Is the above good enough, or do we need an AARM note like the following??
+
+The above is fine by me.  Anyone who goes and looks up UTF-16 in ISO-10646 will find
+the reference to Unicode.
+
+> AARM Note: How the UTF-16 encoding is stored in 8 and 16 bits is 
+> defined by reference to Unicode 4.0.0 in ISO/IEC 106046: "Unicode 
+> Standard Annex, UAX#9, The Unicode Bidirectional Algorithm, Version 4.0.0, 2003-04-17."
+
+I don't think this is necessary, and in fact that isn't where UTF-16LE and UTF-16BE are
+defined.  ISO-10646 doesn't identify exactly where they are defined. It merely says they
+are defined somewhere in Unicode 4.0.  That reference I included was just to show you that
+they were willing to make a formal reference to the Unicode Standard.
+ 
 ****************************************************************

Questions? Ask the ACAA Technical Agent