!standard A.4.10 09-02-12 AI05-0137-1/01 !class Amendment 09-02-12 !status work item 09-02-12 !status received 09-02-11 !priority Medium !difficulty Easy !subject New conversion package !summary A new child package of Ada.Strings is added to support conversions between Wide_String/Wide_Wide_String and UTF_8/UTF_16 encoding. !problem SI99-0041 requires the adoption of UTF_16 for the encoding of program text in ASIS. Similarly, many real-world applications use UTF-8 or UTF-16 encodings. However, the Ada Standard provides no way to actually construct or use such text strings. It would be useful for ASIS users, but also for the Ada community at-large to define a package to handle encoding/decoding between Wide_String/Wide_Wide_String and UTF_8/UTF_16. !proposal (See summary.) !wording The following clause is added as A.4.10: A.4.10 String encoding The language-defined package Strings.Encoding provides facilities for encoding and decoding strings in various character encoding schemes. Static Semantics The library package Strings.Encoding has the following declaration: package Ada.Strings.Encoding is pragma Pure (Encoding); type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE, UTF_16); subtype Short_Encoding is Encoding_Scheme range UTF_8 .. UTF_16LE; subtype Long_Encoding is Encoding_Scheme range UTF_16 .. UTF_16; function Encode (Item : in Wide_String; Scheme : in Short_Encoding := UTF_8) return String; function Encode (Item : in Wide_Wide_String; Scheme : in Short_Encoding := UTF_8) return String; function Decode (Item : in String; Scheme : in Short_Encoding := UTF_8) return Wide_String; function Decode (Item : in String; Scheme : in Short_Encoding := UTF_8) return Wide_Wide_String; function Encode (Item : in Wide_Wide_String; Scheme : in Long_Encoding := UTF_16) return Wide_String; function Decode (Item : in Wide_String; Scheme : in Long_Encoding := UTF_16) return Wide_Wide_String; Encoding_Error : exception; end Ada.Strings.Encoding; The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the encoding scheme defined by RFC 3629; UTF_16BE corresponds to the UTF-16 encoding scheme of Unicode on 8 bits, big endian; UTF_16LE corresponds to the UTF-16 encoding scheme of Unicode on 8 bits, little endian; and UTF_16 corresponds to the UTF-16 encoding scheme of Unicode on 16 bits. The subtype Short_Encoding covers the values of Encoding_Scheme for 8 bits encoding schemes, and the subtype Long_Encoding covers the values of Encoding_Scheme for 16 bits encoding schemes. Each of the Encode functions takes a Wide_String (respectively Wide_Wide_String) Item parameter and returns a String (respectively Wide_String) whose characters have position values that correspond to the encoding of the Item parameter according to the encoding scheme specified by the Scheme parameter. For UTF_8, no overlong encoding shall be returned. The lower bound of the returned string shall be 1. Each of the Decode functions takes a String (respectively Wide_String) Item parameter which is assumed to contain characters whose position values correspond to a valid encoding according to the encoding scheme specified by the Scheme parameter, and returns the corresponding Wide_String (respectively Wide_Wide_String). The exception Encoding_Error is propagated if the input string does not correspond to a valid encoding (including overlong encoding for UTF_8). The lower bound of the returned string shall be 1. Implementation permission An implementation is allowed to provide additional enumeration literals to the type Encoding_Scheme for other character encoding schemes. Literals corresponding to 8 bits encoding schemes shall belong to the Short_Encoding subtype and literals corresponding to 16 bits encoding schemes shall belong to the Long_Encoding subtype. !discussion Strictly speaking, an encoded text should be an array of bytes, not a String. That would enforce strong typing to differentiate between an Ada String and an encoded string. OTOH, it is likely to be more of a burden than a help to most casual users. Moreover, it would not allow to keep ASIS program text as a Wide_String. Similar conversion functions are provided as part of xmlada and qtada. xmlada provides much more sophisticated services, such as supporting conversions to various ccs, converting in place in buffers, etc. However, it seems reasonable to provide only basic functionalities in the standard. Should UTF_32BE and UTF_32LE be supported (as Short_Encoding)? Should UTF_32 be supported (and Long_Long_Encoding added)? !appendix ****************************************************************