Version 1.2 of ai05s/ai05-0137-1.txt

Unformatted version of ai05s/ai05-0137-1.txt version 1.2
Other versions for file ai05s/ai05-0137-1.txt

!standard A.4.10          09-03-03 AI05-0137-1/02
!class Amendment 09-02-12
!status work item 09-02-12
!status received 09-02-11
!priority Medium
!difficulty Easy
!subject New conversion package
A new child package of Ada.Strings is added to support conversions between Wide_String/Wide_Wide_String and UTF_8/UTF_16 encoding.
SI99-0041 requires the adoption of UTF_16 for the encoding of program text in ASIS. Similarly, many real-world applications use UTF-8 or UTF-16 encodings. However, the Ada Standard provides no way to actually construct or use such text strings.
It would be useful for ASIS users, but also for the Ada community at-large to define a package to handle encoding/decoding between Wide_String/Wide_Wide_String and UTF_8/UTF_16.
(See summary.)
The following clause is added as A.4.10:
A.4.10 String encoding The language-defined package Strings.Encoding provides facilities for encoding and decoding strings in various character encoding schemes.
Static Semantics
The library package Strings.Encoding has the following declaration:
package Ada.Strings.Encoding is pragma Pure (Encoding);
type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE, UTF_16); subtype Short_Encoding is Encoding_Scheme range UTF_8 .. UTF_16LE; subtype Long_Encoding is Encoding_Scheme range UTF_16 .. UTF_16;
BOM_8 : constant String := Character'Val (16#EF#) & Character'Val (16#BB#) & Character'Val (16#BF#); BOM_16BE : constant String := Character'Val (16#FE#) & Character'Val (16#FF#); BOM_16LE : constant String := Character'Val (16#FF#) & Character'Val (16#FE#);
BOM_16 : constant Wide_String := (1 => Wide_Character'Val (16#FEFF#));
function Encode (Item : in Wide_String; Scheme : in Short_Encoding := UTF_8) return String; function Encode (Item : in Wide_Wide_String; Scheme : in Short_Encoding := UTF_8) return String; function Decode (Item : in String; Scheme : in Short_Encoding := UTF_8) return Wide_String; function Decode (Item : in String; Scheme : in Short_Encoding := UTF_8) return Wide_Wide_String;
function Encode (Item : in Wide_Wide_String; Scheme : in Long_Encoding := UTF_16) return Wide_String; function Decode (Item : in Wide_String; Scheme : in Long_Encoding := UTF_16) return Wide_Wide_String;
Encoding_Error : exception; end Ada.Strings.Encoding;
The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the encoding scheme defined by RFC 3629; UTF_16BE corresponds to the UTF-16 encoding scheme of Unicode on 8 bits, big endian; UTF_16LE corresponds to the UTF-16 encoding scheme of Unicode on 8 bits, little endian; and UTF_16 corresponds to the UTF-16 encoding scheme of Unicode on 16 bits.
The subtype Short_Encoding covers the values of Encoding_Scheme for 8 bits encoding schemes, and the subtype Long_Encoding covers the values of Encoding_Scheme for 16 bits encoding schemes.
Each of the Encode functions takes a Wide_String (respectively Wide_Wide_String) Item parameter and returns a String (respectively Wide_String) whose characters have position values that correspond to the encoding of the Item parameter according to the encoding scheme specified by the Scheme parameter. For UTF_8, no overlong encoding shall be returned. The lower bound of the returned string shall be 1.
Each of the Decode functions takes a String (respectively Wide_String) Item parameter which is assumed to contain characters whose position values correspond to a valid encoding according to the encoding scheme specified by the Scheme parameter, and returns the corresponding Wide_String (respectively Wide_Wide_String). The exception Encoding_Error is propagated if the input string does not correspond to a valid encoding (including overlong encoding for UTF_8). The lower bound of the returned string shall be 1.
for each of the Decode functions whose Scheme parameter is of the Short_Encoding subtype, if the Item parameter starts with one of the BOM sequences then: - if this sequence identifies the same encoding as specified by the Scheme parameter, the sequence is ignored; - Otherwise, Encoding_Error is raised
The Encode functions shall not put BOM sequences in the result.
Implementation permission
An implementation is allowed to provide additional enumeration literals to the type Encoding_Scheme for other character encoding schemes. Literals corresponding to 8 bits encoding schemes shall belong to the Short_Encoding subtype and literals corresponding to 16 bits encoding schemes shall belong to the Long_Encoding subtype.
If a BOM is defined for an implementation defined encoding scheme, a corresponding constant shall be added.
Reminder of the issues:
A character set is a set of abstract characters. An encoding assigns an integer value to each character; this value is called the code-point of the character. Normally, a character string should be represented as a sequence of code-points; however, it would waste a lot of space, since ISO 10646 defines 32 bit code-points. An encoding scheme is a representation of a string of characters, using a more economical representation. Typically, an encoding scheme uses a suite of integer values, where each code-point is represented by one or several consecutive values. UTF-8 is an encoding scheme that uses 8 bit values. In some cases, UTF-8 defines several possible encodings for a code-point; in this case, the shortest one should be used; other encodings are called overlong encodings. UTF-16 uses 16 bit values. UTF-32 uses 32 bit values, which is of little interest since nothing is gained compared to UCS-32 (raw encoding).
There is no problem when using a String to encode UTF-8, or a Wide_String to encode UTF-16. However, it is sometimes useful to encode/decode a UTF-16 (or even UTF-32) encoded text into/from a String; in that case, characters must be paired to form 16 bit values (or 32 bit values). This can be done in two ways, Big-Endian (high order character first) or Little Endian (low order character first). A special value, called BOM (Byte Order Mark, 16#FEFF#), can be used at the beginning of an encoded text (with 4 leading zeroes for UTF-32). The BOM corresponds to no code-point, and is discarded when decoding, but it is used to recognize whether a stream of bytes is Big-Endian or Little-Endian UTF-16 or UTF-32. By extension, the sequence 16#EF# 16#BB# 16#BF# can be used as BOM to identify UTF-8 text (although there is no byte order issue in UTF-8; actually, use of BOM for UTF-8 is discouraged).
Note that UTF-8 encoding could be used for file names that include characters that are not in ASCII. This package would allow adding an Implementation Advice (to Text_IO, Sequential_IO, and so on) to the effect that it is recommended to support file names encoded in UTF-8.
Implementation choices:
Strictly speaking, an encoded text should be an array of bytes, not of (wide_)characters. This proposal uses (wide_)string, but the encoding is defined in terms of position values of characters rather than characters themselves. It could be argued that it should be defined in terms of internal representation of characters, but we know that they are the same as the position values for (Wide_)Character.
We chose to have a parameter to specify the encoding rather than providing one function for each encoding scheme, because it makes things easier for the user when the encoding scheme is recognized dynamically. It makes it also easier to provide implementation-defined encoding schemes.
There are many other possible encoding schemes, including UTF-EBCDIC, Shift-JIS, SCSU, BOCU-1... It seemed sensible to provide only the most useful ones, while leaving the possibility (through an implementation permission) to provide others.
When reading a file, a BOM can be expected as starting the first line of the file, but not subsequent lines. The proposed handling of BOM assumes the following pattern:
1) Read the first line. Check (with the help of provided constants) if
the first line starts with a BOM. If yes, initialize the encoding scheme accordingly; if no, choose a default encoding scheme.
2) Decode all lines (including the first one) with the chosen encoding
scheme. Since the BOM is ignored by Decode functions, it is not necessary to slice the first line specially.
For encoding, it does not seem useful to have the BOM handled by the encoding functions, since it is easy to catenate the appropriate constant.
Open issues:
Would it be better to have longer names (i.e. UTF_16_Big_Endian) for encoding schemes?
Should UTF_32BE and UTF_32LE be supported (as Short_Encoding, together with the corresponding BOM)?
Should UTF_32 be supported (and Long_Long_Encoding added, together with the corresponding BOM)?
Is it useful to support UTF_8 BOM, since its use is discouraged?
Should overlong UTF_8 sequences be accepted on input, rather than raising Encoding_Error?
Alternative designs:
Short_Encoding and Long_Encoding could be different types rather than subtypes of a same type.
Arrays of Unsigned_8 or Unsigned_16 could be used in place of (Wide_)String. That would enforce strong typing to differentiate between an Ada String and an encoded string. OTOH, it is likely to be more of a burden than a help to most casual users. Moreover, it would not allow to keep ASIS program text as a Wide_String.
Existing similar packages:
Similar conversion functions are provided as part of xmlada and qtada. xmlada provides much more sophisticated services, such as supporting conversions to various ccs, converting in place in buffers, etc. However, it seems reasonable to provide only basic functionalities in the standard.
Gnat provides the package System.WCh_Con, but it converts only individual characters (not strings), does not support UTF-8, and is provided by generics that require a user-provided input/output formal function. Although more general, this solution would be too heavy-weight for the casual user.


Questions? Ask the ACAA Technical Agent