!standard A.4.11 09-10-12 AI05-0137-1/05 !class Amendment 09-02-12 !status Amendment 201Z 09-06-30 !status ARG Approved 7-0-0 09-06-13 !status work item 09-02-12 !status received 09-02-11 !priority Medium !difficulty Easy !subject String encoding package !summary A new child package of Ada.Strings is added to support conversions between Wide_String/Wide_Wide_String and UTF_8/UTF_16 encoding. !problem SI99-0041 requires the adoption of UTF_16 for the encoding of program text in ASIS. Similarly, many real-world applications use UTF-8 or UTF-16 encodings. However, the Ada Standard provides no way to actually construct or use such text strings. It would be useful for ASIS users, but also for the Ada community at large to define a package to handle encoding/decoding between Wide_String/Wide_Wide_String and UTF_8/UTF_16. !proposal (See summary.) !wording The following clause is added as A.4.11: A.4.11 String encoding The language-defined package Strings.UTF_Encoding provides facilities for encoding and decoding strings in various character encoding schemes. Static Semantics The library package Strings.UTF_Encoding has the following declaration: package Ada.Strings.UTF_Encoding is pragma Pure (UTF_Encoding); type Encoding_Scheme is (UTF_None, UTF_8, UTF_16BE, UTF_16LE, UTF_16); subtype Short_Encoding is Encoding_Scheme range UTF_8 .. UTF_16LE; subtype Long_Encoding is Encoding_Scheme range UTF_16 .. UTF_16; BOM_8 : constant String := Character'Val(16#EF#) & Character'Val(16#BB#) & Character'Val(16#BF#); BOM_16BE : constant String := Character'Val(16#FE#) & Character'Val(16#FF#); BOM_16LE : constant String := Character'Val(16#FF#) & Character'Val(16#FE#); BOM_16 : constant Wide_String := (1 => Wide_Character'Val(16#FEFF#)); function Encode (Item : in Wide_String; Scheme : in Short_Encoding := UTF_8) return String; function Encode (Item : in Wide_Wide_String; Scheme : in Short_Encoding := UTF_8) return String; function Decode (Item : in String; Scheme : in Short_Encoding := UTF_8) return Wide_String; function Decode (Item : in String; Scheme : in Short_Encoding := UTF_8) return Wide_Wide_String; function Encode (Item : in Wide_Wide_String; Scheme : in Long_Encoding := UTF_16) return Wide_String; function Decode (Item : in Wide_String; Scheme : in Long_Encoding := UTF_16) return Wide_Wide_String; function Encoding (Item : in String) return Encoding_Scheme; function Encoding (Item : in Wide_String) return Encoding_Scheme; Encoding_Error : exception; end Ada.Strings.UTF_Encoding; The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 106046. UTF_16 corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC 106046 stored in 16 bits; UTF_16BE corresponds to the UTF-16 encoding scheme stored in 8 bits, big endian; and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian. The subtype Short_Encoding covers the values of Encoding_Scheme for 8 bit encoding schemes, and the subtype Long_Encoding covers the values of Encoding_Scheme for 16 bit encoding schemes. Each of the Encode functions takes a Wide_String (respectively Wide_Wide_String) Item parameter and returns a String (respectively Wide_String) whose characters have position values that correspond to the encoding of the Item parameter according to the encoding scheme specified by the Scheme parameter. For UTF_8, no overlong encoding is returned. The lower bound of the returned string is 1. Each of the Decode functions takes a String (respectively Wide_String) Item parameter which is assumed to contain characters whose position values correspond to a valid encoding according to the encoding scheme specified by the Scheme parameter, and returns the corresponding Wide_String (respectively Wide_Wide_String). The exception Encoding_Error is propagated if the input string does not correspond to a valid encoding (including overlong encoding for UTF_8). The lower bound of the returned string is 1. For each of the Decode functions whose Scheme parameter is of the Short_Encoding subtype, if the Item parameter starts with one of the BOM sequences then: - If this sequence identifies the same encoding as specified by the Scheme parameter, the sequence is ignored; - Otherwise, Encoding_Error is raised. The Encode functions do not put BOM sequences in the result. For each of the Encoding functions, if the initial characters of Item match a BOM, the corresponding encoding is returned; otherwise, UTF_None is returned. Implementation advice If an implementation supports other encoding schemes, another similar child of Ada.Strings should be defined. Note: A BOM (Byte-Order Mark, code position 16#FEFF#) can be included in a file or other entity to indicate the encoding; it is skipped when decoding. An explicit concatenation is needed to include a BOM in an encoded entity (it is not added automatically). Typically, only the first line of a file or other entity will contain a BOM. When decoding, the appropriate Encoding function can be used on the first line to determine the encoding; that encoding will then be used in subsequent calls to Decode to convert all of the lines to an internal format. !discussion Background on character encoding: A character set is a set of abstract characters. An encoding assigns an integer value to each character; this value is called the code-point of the character. Normally, a character string should be represented as a sequence of code-points; however, it would waste a lot of space, since ISO 10646 defines 32-bit code-points. An encoding scheme is a representation of a string of characters, using a more economical representation. Typically, an encoding scheme uses a suite of integer values, where each code-point is represented by one or several consecutive values. UTF-8 is an encoding scheme that uses 8-bit values. In some cases, UTF-8 defines several possible encodings for a code-point; in this case, the shortest one should be used; other encodings are called overlong encodings. UTF-16 uses 16-bit values. UTF-32 uses 32-bit values, which is of little interest since nothing is gained compared to UCS-32 (raw encoding). There is no problem when using a String to encode UTF-8, or a Wide_String to encode UTF-16. However, it is sometimes useful to encode/decode a UTF-16 (or even UTF-32) encoded text into/from a String; in that case, characters must be paired to form 16-bit values (or 32-bit values). This can be done in two ways, Big Endian (high order character first) or Little Endian (low order character first). A special value, called BOM (Byte Order Mark, 16#FEFF#), can be used at the beginning of an encoded text (with 4 leading zeroes for UTF-32). The BOM corresponds to no code-point, and is discarded when decoding, but it is used to recognize whether a stream of bytes is Big Endian or Little Endian UTF-16 or UTF-32. By extension, the sequence 16#EF# 16#BB# 16#BF# can be used as BOM to identify UTF-8 text (although there is no byte order issue in UTF-8; actually, use of BOM for UTF-8 is discouraged). Note that UTF-8 encoding could be used for file names that include characters that are not in ASCII. This package would allow adding an Implementation Advice (to Text_IO, Sequential_IO, and so on) to the effect that it is recommended to support file names encoded in UTF-8. Implementation choices: Strictly speaking, an encoded text should be an array of bytes, not of (wide_)characters. This proposal uses (wide_)string, but the encoding is defined in terms of position values of characters rather than characters themselves. It could be argued that it should be defined in terms of internal representation of characters, but we know that they are the same as the position values for (Wide_)Character. We chose to have a parameter to specify the encoding rather than providing one function for each encoding scheme, because it makes things easier for the user when the encoding scheme is recognized dynamically. It makes it also easier to provide implementation-defined encoding schemes. There are many other possible encoding schemes, including UTF-EBCDIC, Shift-JIS, SCSU, BOCU-1... It seemed sensible to provide only the most useful ones, while leaving the possibility (through an implementation advice) to provide others. When reading a file, a BOM can be expected as starting the first line of the file, but not subsequent lines. The proposed handling of BOM assumes the following pattern: 1) Read the first line. Check (with function Encoding) if the first line starts with a BOM. If yes, initialize the encoding scheme accordingly; if no, choose a default encoding scheme. 2) Decode all lines (including the first one) with the chosen encoding scheme. Since the BOM is ignored by Decode functions, it is not necessary to slice the first line specially. For encoding, it does not seem useful to have the BOM handled by the encoding functions, since it is easy to concatenate the appropriate constant. Alternative designs: Short_Encoding and Long_Encoding could be different types rather than subtypes of a same type. Arrays of Unsigned_8 or Unsigned_16 could be used in place of (Wide_)String. That would enforce strong typing to differentiate between an Ada String and an encoded string. OTOH, it is likely to be more of a burden than a help to most casual users. Moreover, it would not allow to keep ASIS program text to be kept as a Wide_String. Existing similar packages: Similar conversion functions are provided as part of xmlada and qtada. xmlada provides much more sophisticated services, such as supporting conversions to various ccs, converting in place in buffers, etc. However, it seems reasonable to provide only basic functionalities in the standard. Gnat provides the package System.WCh_Con, but it converts only individual characters (not strings), does not support UTF-8, and is provided by generics that require a user-provided input/output formal function. Although more general, this solution would be too heavyweight for the casual user. !corrigendum A.4.11(0) @dinsc The language-defined package Strings.UTF_Encoding provides facilities for encoding and decoding strings in various character encoding schemes. @s8<@i> The library package Strings.UTF_Encoding has the following declaration: @xcode<@b Ada.Strings.UTF_Encoding @b @b Pure (UTF_Encoding); @b Encoding_Scheme @b (UTF_None, UTF_8, UTF_16BE, UTF_16LE, UTF_16); @b Short_Encoding @b Encoding_Scheme @b UTF_8 .. UTF_16LE; @b Long_Encoding @b Encoding_Scheme @b UTF_16 .. UTF_16; BOM_8 : @b String := Character'Val(16#EF#) & Character'Val(16#BB#) & Character'Val(16#BF#); BOM_16BE : @b String := Character'Val(16#FE#) & Character'Val(16#FF#); BOM_16LE : @b String := Character'Val(16#FF#) & Character'Val(16#FE#); BOM_16 : @b Wide_String := (1 =@> Wide_Character'Val(16#FEFF#)); @b Encode (Item : @b Wide_String; Scheme : @b Short_Encoding := UTF_8) @b String; @b Encode (Item : @b Wide_Wide_String; Scheme : @b Short_Encoding := UTF_8) @b String; @b Decode (Item : @b String; Scheme : @b Short_Encoding := UTF_8) @b Wide_String; @b Decode (Item : @b String; Scheme : @b Short_Encoding := UTF_8) @b Wide_Wide_String; @b Encode (Item : @b Wide_Wide_String; Scheme : @b Long_Encoding := UTF_16) @b Wide_String; @b Decode (Item : @b Wide_String; Scheme : @b Long_Encoding := UTF_16) @b Wide_Wide_String; @b Encoding (Item : @b String) @b Encoding_Scheme; @b Encoding (Item : @b Wide_String) @b Encoding_Scheme; Encoding_Error : @b; @b Ada.Strings.UTF_Encoding;> The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 106046. UTF_16 corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC 106046 stored in 16 bits; UTF_16BE corresponds to the UTF-16 encoding scheme stored in 8 bits, big endian; and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian. The subtype Short_Encoding covers the values of Encoding_Scheme for 8-bit encoding schemes, and the subtype Long_Encoding covers the values of Encoding_Scheme for 16-bit encoding schemes. Each of the Encode functions takes a Wide_String (respectively Wide_Wide_String) Item parameter and returns a String (respectively Wide_String) whose characters have position values that correspond to the encoding of the Item parameter according to the encoding scheme specified by the Scheme parameter. For UTF_8, no overlong encoding is returned. The lower bound of the returned string is 1. Each of the Decode functions takes a String (respectively Wide_String) Item parameter which is assumed to contain characters whose position values correspond to a valid encoding according to the encoding scheme specified by the Scheme parameter, and returns the corresponding Wide_String (respectively Wide_Wide_String). The exception Encoding_Error is propagated if the input string does not correspond to a valid encoding (including overlong encoding for UTF_8). The lower bound of the returned string is 1. For each of the Decode functions whose Scheme parameter is of the Short_Encoding subtype, if the Item parameter starts with one of the BOM sequences then: @xbullet @xbullet The Encode functions do not put BOM sequences in the result. For each of the Encoding functions, if the initial characters of Item match a BOM, the corresponding encoding is returned; otherwise, UTF_None is returned. @s8<@i> If an implementation supports other encoding schemes, another similar child of Ada.Strings should be defined. NOTE@hr @s9<14 A BOM (Byte-Order Mark, code position 16#FEFF#) can be included in a file or other entity to indicate the encoding; it is skipped when decoding. An explicit concatenation is needed to include a BOM in an encoded entity (it is not added automatically). Typically, only the first line of a file or other entity will contain a BOM. When decoding, the appropriate Encoding function can be used on the first line to determine the encoding; that encoding will then be used in subsequent calls to Decode to convert all of the lines to an internal format.> !appendix From: Randy Brukardt Sent: Wednesday, July 1, 2009 8:12 PM This AI (version /03) has the following paragraph: The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the encoding scheme defined by RFC 3629; UTF_16BE corresponds to the UTF-16 encoding scheme of Unicode on 8 bits, big endian; UTF_16LE corresponds to the UTF-16 encoding scheme of Unicode on 8 bits, little endian; and UTF_16 corresponds to the UTF-16 encoding scheme of Unicode on 16 bits. I don't think this would pass muster in that it references documents that aren't International Standards. Recall that we went to incredible lengths in Amendment 1 to avoid ever mentioning the word "Unicode" in the normative text of the standard (even though we referenced it). I suppose something might have changed, but in the absence of such knowledge, I'd prefer to avoid it. And what the heck is RFC 3629 (from an ISO perspective)? So I wonder what this ought to say. Does ISO/IEC 10646:2003 (or newer) contain these encodings? If so, we ought to reference that. Is there some other International Standard that defines these encodings? Then we need a normative reference to that. If there isn't any such thing, we'll need to include full normative references to *something* defining these encodings in clause 1.2 (and we'll have to be prepared to drop this whole package if there are objections -- it would be silly to derail Ada over a UTF encoding package). **************************************************************** From: Tucker Taft Sent: Wednesday, July 1, 2009 8:38 PM ISO 10646 is a freely available standard (incredible as that may seem). I am downloading a copy so I can try to answer your questions. I suspect that someone else might know the answers off the top of their head... **************************************************************** From: Tucker Taft Sent: Wednesday, July 1, 2009 8:56 PM UTF-16 is defined in Annex C, and UTF-8 is defined in Annex D of ISO 10646:2003. Unicode 4.0 is referenced in a "Note" in Annex C, indicating that it defines UTF-16, UTF-16LE, and UTF-16BE. The Normative reference section of ISO-10646 lists the following: Unicode Standard Annex, UAX#9, The Unicode Bidi- rectional Algorithm, Version 4.0.0, 2003-04-17. So apparently one can refer to the Unicode Standard in an ISO document! **************************************************************** From: Randy Brukardt Sent: Wednesday, July 1, 2009 9:14 PM Thanks for looking this up. (If I had realized that 10646 was freely available, I would have done this myself.) Clearly, we ought to reference the appropriate annex of 10646 here - we might as well be consistent with the rest of our standard. So here's proposed wording. I assumed that the terms UTF-8 and UTF-16 are actually defined by 10646 (right?). I changed the order of the text to read better, meaning that the literals are not defined in order. Tough. :-) The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 106046. UTF_16 corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC 106046 stored in 16 bits; UTF_16BE corresponds to the UTF-16 encoding scheme stored in 8 bits, big endian; and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian. Is the above good enough, or do we need an AARM note like the following?? AARM Note: How the UTF-16 encoding is stored in 8 and 16 bits is defined by reference to Unicode 4.0.0 in ISO/IEC 106046: "Unicode Standard Annex, UAX#9, The Unicode Bidirectional Algorithm, Version 4.0.0, 2003-04-17." **************************************************************** From: Tucker Taft Sent: Wednesday, July 1, 2009 9:21 PM ... > Is the above good enough, or do we need an AARM note like the following?? The above is fine by me. Anyone who goes and looks up UTF-16 in ISO-10646 will find the reference to Unicode. > AARM Note: How the UTF-16 encoding is stored in 8 and 16 bits is > defined by reference to Unicode 4.0.0 in ISO/IEC 106046: "Unicode > Standard Annex, UAX#9, The Unicode Bidirectional Algorithm, Version 4.0.0, 2003-04-17." I don't think this is necessary, and in fact that isn't where UTF-16LE and UTF-16BE are defined. ISO-10646 doesn't identify exactly where they are defined. It merely says they are defined somewhere in Unicode 4.0. That reference I included was just to show you that they were willing to make a formal reference to the Unicode Standard. ****************************************************************