Version 1.1 of ai05s/ai05-0137-1.txt
!standard A.4.10 09-02-12 AI05-0137-1/01
!class Amendment 09-02-12
!status work item 09-02-12
!status received 09-02-11
!priority Medium
!difficulty Easy
!subject New conversion package
!summary
A new child package of Ada.Strings is added to support conversions
between Wide_String/Wide_Wide_String and UTF_8/UTF_16 encoding.
!problem
SI99-0041 requires the adoption of UTF_16 for the encoding of program
text in ASIS. Similarly, many real-world applications use UTF-8 or
UTF-16 encodings. However, the Ada Standard provides no way to actually
construct or use such text strings.
It would be useful for ASIS users, but also for the Ada
community at-large to define a package to handle encoding/decoding
between Wide_String/Wide_Wide_String and UTF_8/UTF_16.
!proposal
(See summary.)
!wording
The following clause is added as A.4.10:
A.4.10 String encoding
The language-defined package Strings.Encoding provides facilities for
encoding and decoding strings in various character encoding schemes.
Static Semantics
The library package Strings.Encoding has the following declaration:
package Ada.Strings.Encoding is
pragma Pure (Encoding);
type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE, UTF_16);
subtype Short_Encoding is Encoding_Scheme range UTF_8 .. UTF_16LE;
subtype Long_Encoding is Encoding_Scheme range UTF_16 .. UTF_16;
function Encode (Item : in Wide_String;
Scheme : in Short_Encoding := UTF_8)
return String;
function Encode (Item : in Wide_Wide_String;
Scheme : in Short_Encoding := UTF_8)
return String;
function Decode (Item : in String;
Scheme : in Short_Encoding := UTF_8)
return Wide_String;
function Decode (Item : in String;
Scheme : in Short_Encoding := UTF_8)
return Wide_Wide_String;
function Encode (Item : in Wide_Wide_String;
Scheme : in Long_Encoding := UTF_16)
return Wide_String;
function Decode (Item : in Wide_String;
Scheme : in Long_Encoding := UTF_16)
return Wide_Wide_String;
Encoding_Error : exception;
end Ada.Strings.Encoding;
The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds
to the encoding scheme defined by RFC 3629; UTF_16BE corresponds to
the UTF-16 encoding scheme of Unicode on 8 bits, big endian; UTF_16LE
corresponds to the UTF-16 encoding scheme of Unicode on 8 bits, little
endian; and UTF_16 corresponds to the UTF-16 encoding scheme of
Unicode on 16 bits.
The subtype Short_Encoding covers the values of Encoding_Scheme for 8
bits encoding schemes, and the subtype Long_Encoding covers the values of
Encoding_Scheme for 16 bits encoding schemes.
Each of the Encode functions takes a Wide_String (respectively
Wide_Wide_String) Item parameter and returns a String (respectively
Wide_String) whose characters have position values that correspond to
the encoding of the Item parameter according to the encoding scheme
specified by the Scheme parameter. For UTF_8, no overlong encoding
shall be returned. The lower bound of the returned string shall be 1.
Each of the Decode functions takes a String (respectively Wide_String)
Item parameter which is assumed to contain characters whose position
values correspond to a valid encoding according to the encoding scheme
specified by the Scheme parameter, and returns the corresponding
Wide_String (respectively Wide_Wide_String). The exception
Encoding_Error is propagated if the input string does not correspond
to a valid encoding (including overlong encoding for UTF_8). The lower
bound of the returned string shall be 1.
Implementation permission
An implementation is allowed to provide additional enumeration
literals to the type Encoding_Scheme for other character encoding
schemes. Literals corresponding to 8 bits encoding schemes shall
belong to the Short_Encoding subtype and literals corresponding to 16
bits encoding schemes shall belong to the Long_Encoding subtype.
!discussion
Strictly speaking, an encoded text should be an array of bytes, not a
String. That would enforce strong typing to differentiate between an
Ada String and an encoded string. OTOH, it is likely to be more of a
burden than a help to most casual users. Moreover, it would not allow
to keep ASIS program text as a Wide_String.
Similar conversion functions are provided as part of xmlada and qtada.
xmlada provides much more sophisticated services, such as supporting
conversions to various ccs, converting in place in buffers,
etc. However, it seems reasonable to provide only basic
functionalities in the standard.
Should UTF_32BE and UTF_32LE be supported (as Short_Encoding)?
Should UTF_32 be supported (and Long_Long_Encoding added)?
!appendix
****************************************************************
Questions? Ask the ACAA Technical Agent