Version 1.6 of ai05s/ai05-0137-2.txt

Unformatted version of ai05s/ai05-0137-2.txt version 1.6
Other versions for file ai05s/ai05-0137-2.txt

!standard A.4.11          10-10-19 AI05-0137-2/03
!class Amendment 10-05-07
!status Amendment 2012 10-05-07
!status ARG Approved 6-0-3 10-06-20
!status work item 10-05-07
!status received 10-03-15
!priority Medium
!difficulty Easy
!subject String encoding packages
!summary
New child packages of Ada.Strings are added to support conversions between String/Wide_String/Wide_Wide_String and UTF_8/UTF_16 encoding. These packages are intended to replace the already approved versions from AI05-0137-1.
!problem
SI99-0041 requires the adoption of UTF_16 for the encoding of program text in ASIS. Similarly, many real-world applications use UTF-8 or UTF-16 encodings. However, the Ada Standard provides no way to actually construct or use such text strings.
It would be useful for ASIS users, but also for the Ada community at large to define a package to handle encoding/decoding between String/Wide_String/Wide_Wide_String and UTF_8/UTF_16.
!proposal
(See summary.)
!wording
The following clause is added as A.4.11:
A.4.11 String encoding
Facilities for encoding, decoding, and converting strings in various character encoding schemes are provided by packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings, and Strings.UTF_Encoding.Wide_Wide_Strings.
Static Semantics
The encoding library packages have the following declarations:
package Ada.Strings.UTF_Encoding is pragma Pure (UTF_Encoding);
-- Declarations common to the string encoding packages type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
subtype UTF_String is String;
subtype UTF_8_String is String;
subtype UTF_16_Wide_String is Wide_String;
Encoding_Error : exception;
BOM_8 : constant UTF_8_String := Character'Val (16#EF#) & Character'Val (16#BB#) & Character'Val (16#BF#);
BOM_16BE : constant UTF_String := Character'Val (16#FE#) & Character'Val (16#FF#);
BOM_16LE : constant UTF_String := Character'Val (16#FF#) & Character'Val (16#FE#);
BOM_16 : constant UTF_16_Wide_String := (1 => Wide_Character'Val (16#FEFF#));
function Encoding (Item : UTF_String; Default : Encoding_Scheme := UTF_8) return Encoding_Scheme;
end Ada.Strings.UTF_Encoding;
package Ada.Strings.UTF_Encoding.Conversions is pragma Pure (Conversions);
-- Conversions between various encoding schemes function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_16_Wide_String;
function Convert (Item : UTF_8_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
function Convert (Item : UTF_16_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
function Convert (Item : UTF_16_Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
end Ada.Strings.UTF_Encoding.Conversions;
package Ada.Strings.UTF_Encoding.Strings is pragma Pure (Strings);
-- Encoding / decoding between String and various encoding schemes function Encode (Item : String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
function Encode (Item : String; Output_BOM : Boolean := False) return UTF_8_String;
function Encode (Item : String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return String;
function Decode (Item : UTF_8_String) return String;
function Decode (Item : UTF_16_Wide_String) return String;
end Ada.Strings.UTF_Encoding.Strings;
package Ada.Strings.UTF_Encoding.Wide_Strings is pragma Pure (Wide_Strings);
-- Encoding / decoding between Wide_String and various encoding schemes function Encode (Item : Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_String;
function Decode (Item : UTF_8_String) return Wide_String;
function Decode (Item : UTF_16_Wide_String) return Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Strings;
package Ada.Strings.UTF_Encoding.Wide_Wide_Strings is pragma Pure (Wide_Wide_Strings);
-- Encoding / decoding between Wide_Wide_String and various encoding schemes function Encode (Item : Wide_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
function Decode (Item : UTF_8_String) return Wide_Wide_String;
function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 106046. UTF_16BE corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC 106046 stored in 8 bits, big endian; and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian.
The subtype UTF_String is used to represent a String of 8-bit values containing a sequence of values encoded in one of three ways (UTF-8, UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used to represent a String of 8-bit values containing a sequence of values encoded in UTF-8. The subtype UTF_16_Wide_String is used to represent a Wide_String of 16-bit values containing a sequence of values encoded in UTF-16.
The BOM_8, BOM_16BE, BOM_16LE and BOM_16 constants correspond to values used at the start of a string to indicate the encoding.
For all Convert and Decode functions, an initial BOM in the input that matches the expected encoding scheme is ignored, and a different initial BOM causes Encoding_Error to be propagated.
For all Convert and Encode functions, a BOM is included at the start of the output string if the Output_BOM parameter is set to True.
The exception Encoding_Error is also propagated in the following situations:
* By a Decode function when a UTF encoded string contains an invalid encoding sequence.
* By a Decode function when the expected encoding is UTF-16BE or UTF-16LE and the input string has an odd length.
* By a Decode function yielding a String when the decoding of a sequence results in a code-point whose value exceeds 16#FF#.
* By a Decode function yielding a Wide_String when the decoding of a sequence results in a code-point whose value exceeds 16#FFFF#.
* By an Encode function taking a Wide_String as input when an invalid character appears in the input. In particular the characters whose position is in the range 16#D800# .. 16#DFFF# are invalid because they conflict with UTF-16 surrogate encodings, and the characters whose position is 16#FFFE# or 16#FFFF# are also invalid because they conflict with BOM codes.
Each of the Convert and Encode functions returns a UTF_String (respectively UTF_8_String and UTF_16_String) value whose characters have position values that correspond to the encoding of the Item parameter according to the encoding scheme required by the function or specified by its Output_Scheme parameter. For UTF_8, no overlong encoding is returned. The lower bound of the returned string is 1.
Each of the Decode functions takes a UTF_String (respectively UTF_8_String and UTF_16_String) Item parameter which is assumed to contain characters whose position values correspond to a valid encoding sequence according to the encoding scheme required by the function or specified by its Input_Scheme parameter, and returns the corresponding String, Wide_String or Wide_Wide_String value. The lower bound of the returned string is 1.
function Encoding (Item : UTF_String; Default : Encoding_Scheme := UTF_8) return Encoding_Scheme;
Inspects a UTF_String value to determine whether it starts with a BOM for UTF-8, UTF-16BE, or UTF_16LE. If so, returns the scheme corresponding to the BOM; returns the value of Default otherwise.
function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
Converts from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and generates an output encoded in one of these three schemes as specified by Output_Scheme.
function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_16_Wide_String;
Converts from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and generates an output encoded in UTF-16.
function Convert (Item : UTF_8_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
Converts from input encoded in UTF-8 and generates an output encoded in UTF-16.
function Convert (Item : UTF_16_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
Converts from input encoded in UTF-16 and generates an output encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
function Convert (Item : UTF_16_Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
Converts from input encoded in UTF-16 and generates an output encoded in UTF-8.
function Encode (Item : String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
Encodes from String input, and generates an output encoded in UTF-8, UTF-16LE or UTF-16BE encoding as specified by Output_Scheme.
function Encode (Item : String; Output_BOM : Boolean := False) return UTF_8_String;
Encodes from String input, and generates an output encoded in UTF-8 encoding.
function Encode (Item : String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
Encodes from String input, and generates an output encoded in UTF_16 encoding.
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return String;
Decodes from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and returns the corresponding String value.
function Decode (Item : UTF_8_String) return String;
Decodes from input encoded in UTF-8, and returns the corresponding String value.
function Decode (Item : UTF_16_Wide_String) return String;
Decodes from input encoded in UTF-16, and returns the corresponding String value.
function Encode (Item : Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
Encodes from Wide_String input, and generates an output encoded in UTF-8, UTF-16LE or UTF-16BE encoding as specified by Output_Scheme.
function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
Encodes from Wide_String input, and generates an output encoded in UTF-8 encoding.
function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
Encodes from Wide_String input, and generates an output encoded in UTF_16 encoding.
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_String;
Decodes from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and returns the corresponding Wide_String value.
function Decode (Item : UTF_8_String) return Wide_String;
Decodes from input encoded in UTF-8, and returns the corresponding Wide_String value.
function Decode (Item : UTF_16_Wide_String) return Wide_String;
Decodes from input encoded in UTF-16, and returns the corresponding Wide_String value.
function Encode (Item : Wide_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
Encodes from Wide_Wide_String input, and generates an output encoded in UTF-8, UTF-16LE or UTF-16BE encoding as specified by Output_Scheme.
function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
Encodes from Wide_Wide_String input, and generates an output encoded in UTF-8 encoding.
function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
Encodes from Wide_Wide_String input, and generates an output encoded in UTF_16 encoding.
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
Decodes from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and returns the corresponding Wide_Wide_String value.
function Decode (Item : UTF_8_String) return Wide_Wide_String;
Decodes from input encoded in UTF-8, and returns the corresponding Wide_Wide_String value.
function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
Decodes from input encoded in UTF-16, and returns the corresponding Wide_Wide_String value.
Implementation Advice
If an implementation supports other encoding schemes, similar children of Ada.Strings should be defined.
Note: A BOM (Byte-Order Mark, code position 16#FEFF#) can be included in a file or other entity to indicate the encoding; it is skipped when decoding. Typically, only the first line of a file or other entity contains a BOM. When decoding, the Encoding function can be called on the first line to determine the encoding; this encoding will then be used in subsequent calls to Decode to convert all of the lines to an internal format.
!discussion
Background on character encoding:
A character set is a set of abstract characters. An encoding assigns an integer value to each character; this value is called the code-point of the character. Normally, a character string should be represented as a sequence of code-points; however, it would waste a lot of space, since ISO 10646 defines 32-bit code-points. An encoding scheme is a representation of a string of characters, using a more economical representation. Typically, an encoding scheme uses a suite of integer values, where each code-point is represented by one or several consecutive values. UTF-8 is an encoding scheme that uses 8-bit values. In some cases, UTF-8 defines several possible encodings for a code-point; in this case, the shortest one should be used; other encodings are called overlong encodings. UTF-16 uses 16-bit values. UTF-32 uses 32-bit values, which is of little interest since nothing is gained compared to UCS-32 (raw encoding).
There is no problem when using a String to encode UTF-8, or a Wide_String to encode UTF-16. However, it is sometimes useful to encode/decode a UTF-16 (or even UTF-32) encoded text into/from a String; in that case, characters must be paired to form 16-bit values (or 32-bit values). This can be done in two ways, Big Endian (high order character first) or Little Endian (low order character first). A special value, called BOM (Byte Order Mark, 16#FEFF#), can be used at the beginning of an encoded text (with 4 leading zeroes for UTF-32). The BOM corresponds to no code-point, and is discarded when decoding, but it is used to recognize whether a stream of bytes is Big Endian or Little Endian UTF-16 or UTF-32. By extension, the sequence 16#EF# 16#BB# 16#BF# can be used as BOM to identify UTF-8 text (although there is no byte order issue in UTF-8; actually, use of BOM for UTF-8 is discouraged).
Note that UTF-8 encoding could be used for file names that include characters that are not in ASCII. This package would allow adding an Implementation Advice (to Text_IO, Sequential_IO, and so on) to the effect that it is recommended to support file names encoded in UTF-8.
Implementation choices:
Strictly speaking, an encoded text should be an array of bytes, not of (wide_)characters. This proposal uses (wide_)string, but the encoding is defined in terms of position values of characters rather than characters themselves. It could be argued that it should be defined in terms of internal representation of characters, but we know that they are the same as the position values for (Wide_)Character.
It is necessary to have decoding functions with a parameter that specifies the encoding, because it makes things easier when the encoding scheme is recognized dynamically. Functions whose encoding scheme is implicit are also provided for the most common encoding schemes to make it simpler for programs that require a statically defined encoding scheme.
There are many other possible encoding schemes, including UTF-EBCDIC, Shift-JIS, SCSU, BOCU-1... It seemed sensible to provide only the most useful ones, while leaving the possibility (through Implementation Advice) to provide others.
When reading a file, a BOM can be expected as starting the first line of the file, but not subsequent lines. The proposed handling of BOM assumes the following pattern:
1) Read the first line. Call function Encoding on that line with an
appropriate default to use if the line does not start with a BOM. Initialize the encoding scheme to the value returned by the function.
2) Decode all lines (including the first one) with the chosen encoding
scheme. Since the BOM is ignored by Decode functions, it is not necessary to slice the first line specially.
The Wide_Wide_String functions have been put in a separate package to avoid dragging in the corresponding code when Wide_Wide_Strings are not used in the application code.
Alternative designs:
Arrays of Unsigned_8 or Unsigned_16 could be used in place of (Wide_)String. That would enforce strong typing to differentiate between an Ada String and an encoded string. On the other hand, it is likely to be more of a burden than a help to most casual users. Moreover, it would not allow ASIS program text to be kept as a Wide_String.
Existing similar packages:
Similar conversion functions are provided as part of xmlada and qtada. xmlada provides much more sophisticated services, such as supporting conversions to various ccs, converting in place in buffers, etc. However, it seems reasonable to provide only basic functionalities in the standard.
Gnat provides the package System.WCh_Con, but it converts only individual characters (not strings), does not support UTF-8, and is provided by generics that require a user-provided input/output formal function. Although more general, this solution would be too heavyweight for the casual user.
!corrigendum A.4.11(0)
Insert new clause:
Facilities for encoding, decoding, and converting strings in various character encoding schemes are provided by packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings, and Strings.UTF_Encoding.Wide_Wide_Strings.
Static Semantics
The encoding library packages have the following declarations:
package Ada.Strings.UTF_Encoding is pragma Pure (UTF_Encoding);
-- Declarations common to the string encoding packages type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
subtype UTF_String is String;
subtype UTF_8_String is String;
subtype UTF_16_Wide_String is Wide_String;
Encoding_Error : exception;
BOM_8 : constant UTF_8_String := Character'Val (16#EF#) & Character'Val (16#BB#) & Character'Val (16#BF#);
BOM_16BE : constant UTF_String := Character'Val (16#FE#) & Character'Val (16#FF#);
BOM_16LE : constant UTF_String := Character'Val (16#FF#) & Character'Val (16#FE#);
BOM_16 : constant UTF_16_Wide_String := (1 => Wide_Character'Val (16#FEFF#));
function Encoding (Item : UTF_String; Default : Encoding_Scheme := UTF_8) return Encoding_Scheme;
end Ada.Strings.UTF_Encoding;
package Ada.Strings.UTF_Encoding.Conversions is pragma Pure (Conversions);
-- Conversions between various encoding schemes function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_16_Wide_String;
function Convert (Item : UTF_8_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
function Convert (Item : UTF_16_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
function Convert (Item : UTF_16_Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
end Ada.Strings.UTF_Encoding.Conversions;
package Ada.Strings.UTF_Encoding.Strings is pragma Pure (Strings);
-- Encoding / decoding between String and various encoding schemes function Encode (Item : String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
function Encode (Item : String; Output_BOM : Boolean := False) return UTF_8_String;
function Encode (Item : String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return String;
function Decode (Item : UTF_8_String) return String;
function Decode (Item : UTF_16_Wide_String) return String;
end Ada.Strings.UTF_Encoding.Strings;
package Ada.Strings.UTF_Encoding.Wide_Strings is pragma Pure (Wide_Strings);
-- Encoding / decoding between Wide_String and various encoding schemes function Encode (Item : Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_String;
function Decode (Item : UTF_8_String) return Wide_String;
function Decode (Item : UTF_16_Wide_String) return Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Strings;
package Ada.Strings.UTF_Encoding.Wide_Wide_Strings is pragma Pure (Wide_Wide_Strings);
-- Encoding / decoding between Wide_Wide_String and various encoding schemes function Encode (Item : Wide_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
function Decode (Item : UTF_8_String) return Wide_Wide_String;
function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 106046. UTF_16BE corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC 106046 stored in 8 bits, big endian; and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian.
The subtype UTF_String is used to represent a String of 8-bit values containing a sequence of values encoded in one of three ways (UTF-8, UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used to represent a String of 8-bit values containing a sequence of values encoded in UTF-8. The subtype UTF_16_Wide_String is used to represent a Wide_String of 16-bit values containing a sequence of values encoded in UTF-16.
The BOM_8, BOM_16BE, BOM_16LE and BOM_16 constants correspond to values used at the start of a string to indicate the encoding.
For all Convert and Decode functions, an initial BOM in the input that matches the expected encoding scheme is ignored, and a different initial BOM causes Encoding_Error to be propagated.
For all Convert and Encode functions, a BOM is included at the start of the output string if the Output_BOM parameter is set to True.
The exception Encoding_Error is also propagated in the following situations:
Each of the Convert and Encode functions returns a UTF_String (respectively UTF_8_String and UTF_16_String) value whose characters have position values that correspond to the encoding of the Item parameter according to the encoding scheme required by the function or specified by its Output_Scheme parameter. For UTF_8, no overlong encoding is returned. The lower bound of the returned string is 1.
Each of the Decode functions takes a UTF_String (respectively UTF_8_String and UTF_16_String) Item parameter which is assumed to contain characters whose position values correspond to a valid encoding sequence according to the encoding scheme required by the function or specified by its Input_Scheme parameter, and returns the corresponding String, Wide_String, or Wide_Wide_String value. The lower bound of the returned string is 1.
function Encoding (Item : UTF_String; Default : Encoding_Scheme := UTF_8) return Encoding_Scheme;
Inspects a UTF_String value to determine whether it starts with a BOM for UTF-8, UTF-16BE, or UTF_16LE. If so, returns the scheme corresponding to the BOM; returns the value of Default otherwise.
function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
Converts from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and generates an output encoded in one of these three schemes as specified by Output_Scheme.
function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_16_Wide_String;
Converts from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and generates an output encoded in UTF-16.
function Convert (Item : UTF_8_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
Converts from input encoded in UTF-8 and generates an output encoded in UTF-16.
function Convert (Item : UTF_16_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
Converts from input encoded in UTF-16 and generates an output encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
function Convert (Item : UTF_16_Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
Converts from input encoded in UTF-16 and generates an output encoded in UTF-8.
function Encode (Item : String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
Encodes from String input, and generates an output encoded in UTF-8, UTF-16LE or UTF-16BE encoding as specified by Output_Scheme.
function Encode (Item : String; Output_BOM : Boolean := False) return UTF_8_String;
Encodes from String input, and generates an output encoded in UTF-8 encoding.
function Encode (Item : String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
Encodes from String input, and generates an output encoded in UTF_16 encoding.
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return String;
Decodes from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and returns the corresponding String value.
function Decode (Item : UTF_8_String) return String;
Decodes from input encoded in UTF-8, and returns the corresponding String value.
function Decode (Item : UTF_16_Wide_String) return String;
Decodes from input encoded in UTF-16, and returns the corresponding String value.
function Encode (Item : Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
Encodes from Wide_String input, and generates an output encoded in UTF-8, UTF-16LE or UTF-16BE encoding as specified by Output_Scheme.
function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
Encodes from String input, and generates an output encoded in UTF-8 encoding.
function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
Encodes from Wide_String input, and generates an output encoded in UTF_16 encoding.
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_String;
Decodes from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and returns the corresponding Wide_String value.
function Decode (Item : UTF_8_String) return Wide_String;
Decodes from input encoded in UTF-8, and returns the corresponding Wide_String value.
function Decode (Item : UTF_16_Wide_String) return Wide_String;
Decodes from input encoded in UTF-16, and returns the corresponding Wide_String value.
function Encode (Item : Wide_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String;
Encodes from Wide_Wide_String input, and generates an output encoded in UTF-8, UTF-16LE or UTF-16BE encoding as specified by Output_Scheme.
function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_8_String;
Encodes from Wide_Wide_String input, and generates an output encoded in UTF-8 encoding.
function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String;
Encodes from Wide_Wide_String input, and generates an output encoded in UTF_16 encoding.
function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
Decodes from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and returns the corresponding Wide_Wide_String value.
function Decode (Item : UTF_8_String) return Wide_Wide_String;
Decodes from input encoded in UTF-8, and returns the corresponding Wide_Wide_String value.
function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
Decodes from input encoded in UTF-16, and returns the corresponding Wide_Wide_String value.
Implementation Advice
If an implementation supports other encoding schemes, similar children of Ada.Strings should be defined.
NOTE
14 A BOM (Byte-Order Mark, code position 16#FEFF#) can be included in a file or other entity to indicate the encoding; it is skipped when decoding. Typically, only the first line of a file or other entity contains a BOM. When decoding, the Encoding function can be called on the first line to determine the encoding; this encoding will then be used in subsequent calls to Decode to convert all of the lines to an internal format.
!appendix

From: Robert Dewar
Sent: Monday, March 15, 2010  7:54 AM

I am now implementing this package. As usual that is the first time I am really
looking at it carefully. It seems quite unfortunate to me that the routines take
Scheme as a dynamic parameter, this means that you always include all encoding
methods in your program, even if you are only interested in one of them.

I understand the usage of looking for a BOM and then passing the discovered
encoding scheme to Decode, but that's only one usage.

In practice, I think by far the most common usage will be with a fixed encoding
scheme, almost always UTF-8, and it seems unfortunate to have a situation where
for this usage you are forced to incorporate all the stuff for the other
encoding schemes.

****************************************************************

From: Tucker Taft
Sent: Monday, March 15, 2010  8:08 AM

An alternative approach would be to provide separate child packages for each
kind of encoding, and perhaps have one child that matches the currently proposed
package, which would call the appropriate one of the others based on the
run-time enumeration value.

Tagged types could probably be used in a creative way as well.

****************************************************************

From: Robert Dewar
Sent: Monday, March 15, 2010  8:21 AM

> An alternative approach would be to provide separate child packages
> for each kind of encoding, and perhaps have one child that matches the
> currently proposed package, which would call the appropriate one of
> the others based on the run-time enumeration value.

Perhaps, but it would be good enough to have specific routines

    Encode_UTF_8

etc.

One assumes any reasonable system can eliminate unused subprograms, so we don't
need separate packages.

> Tagged types could probably be used in a creative way as well.

Seems overkill to me.

****************************************************************

From: Jean-Pierre Rosen
Sent: Monday, March 15, 2010  10:09 AM

> I am now implementing this package. As usual that is the first time I
> am really looking at it carefully. It seems quite unfortunate to me
> that the routines take Scheme as a dynamic parameter, this means that
> you always include all encoding methods in your program, even if you
> are only interested in one of them.

I understand that implementation of UTF_8 and the various UTF_16 are different.
OTOH, I guess (wild guess admitedly) that the different forms of UTF_16 will
share most of the code. Given that the BOM is really useful only for UTF_16*, a
solution could be to just separate UTF_8 from UTF_16 in two different child
packages.

> I understand the usage of looking for a BOM and then passing the
> discovered encoding scheme to Decode, but that's only one usage.

But certainly one that happens in practice, and my concern was to avoid the "big
case". But it is likely that choosing between UTF_8 and UTF_16 is on the
application's side, while choosing between UTF_16BE and UTF_16LE is more on the
user's side, so the dynamic choice is more important here.

So, I'm suggesting having a package UTF_Encoding with the type Encoding_Scheme,
functions Encoding, and exception Encoding_Error, and children UTF_8 and UTF_16.
For those wanting to support both, calls would appear as UTF_8.Decode or
UTF_16.Decode, which would read nicely (we would keep the Scheme parameter for
UTF_16).

****************************************************************

From: Bob Duff
Sent: Monday, March 15, 2010  10:39 AM

> Perhaps, but it would be good enough to have specific routines
>
>     Encode_UTF_8
>
> etc.

If you mean in addition to the ones we already have, then I agree.

I don't see any need for child packages -- that would just complicate things.

****************************************************************

From: Robert Dewar
Sent: Monday, March 15, 2010  10:46 AM

> If you mean in addition to the ones we already have, then I agree.

right, in addition, then the dynamic package just makes calls to these specific
ones. I think only the UTF_8 routines are really important. The UTF_16 routines
are non-dynamic anyway (since there is only one possible scheme).

> I don't see any need for child packages -- that would just complicate
> things.

I agree

****************************************************************

From: Tucker Taft
Sent: Monday, March 15, 2010  11:01 AM

We might consider making them subpackages rather than child packages:

    package UTF_Encoding is
       ...
       package UTF_8 is
           function Encode(...
           function Decode(...
       end UTF_8;

       package UTF_16 is
           ...
       end UTF_16;
       ...
    end UTF_Encoding;

Then some of the tricks implementors use with Text_IO could be used to carve out
the subpackages if desired.

I also think UTF_8.Encode looks better than Encode_UTF_8 somehow... ;-)

Unfortunately, this doesn't really work since there is an existing enumeration
literal UTF_8.  In fact, even a child package named "UTF_8" would be illegal if
the enumeration were in the parent package.  If we restructured it into separate
child packages, with the enumeration only in the "dynamic" child, it would work,
but perhaps just some functions with names like Encode_UTF_8 are adequate. Not
elegant, though.

****************************************************************

From: Robert Dewar
Sent: Monday, March 15, 2010  11:21 AM

> Then some of the tricks implementors use with Text_IO could be used to
> carve out the subpackages if desired.

These tricks are not relevant I think, we certainly wouldn't bother. What we
would do at this stage is to rely on automatic elimination of unused
subprograms.

> I also think UTF_8.Encode looks better than Encode_UTF_8 somehow...
> ;-)

I really don't care about the name
>
> Unfortunately, this doesn't really work since there is an existing
> enumeration literal UTF_8.  In fact, even a child package named
> "UTF_8" would be illegal if the enumeration were in the parent
> package.  If we restructured it into separate child packages, with the
> enumeration only in the "dynamic" child, it would work, but perhaps
> just some functions with names like Encode_UTF_8 are adequate.
> Not elegant, though.

I would just add Encode_UTF_8 and Decode_UTF_8 to the existing spec and be done
with it.

****************************************************************

From: Robert Dewar
Sent: Friday, March 19, 2010  4:45 PM

Another issue for this package,

why is there no routine for decoding a UTF-16 string with Wide_String output?
This is not trivial, there are cases where two UTF-16 codes are required for a
valid Wide_Character output.

Similarly there should be a routine for encoding a Wide_String in UTF-16. I see
no reason for omitting these cases???

Note that you can encode between UTF-16BE/LE and Wide_String, so it seems odd
not to cover the cases of UTF-16 and Wide_String. I noted this because naturally
the UTF-16BE/LE cases are defined in terms of the missing routines!

---

In my package I have in the visible spec:

    function Encode
      (Item   : Wide_String;
       Scheme : Long_Encoding := UTF_16) return Wide_String;
    function Encode
      (Item   : Wide_Wide_String;
       Scheme : Long_Encoding := UTF_16) return Wide_String;

    function Encode
      (Item   : Wide_String;
       Scheme : Long_Encoding := UTF_16) return Wide_String;
    function Decode
      (Item   : Wide_String;
       Scheme : Long_Encoding := UTF_16) return Wide_Wide_String;


which is nicely symmetrical with the short encoding scheme routines

In the body I also have:

    procedure Decode_UTF_8 (Item : String) return Wide_String;
    --  Equivalent to Decode (Item, UTF_8), but smaller and faster

    procedure Decode_UTF_8 (Item : String) return Wide_Wide_String;
    --  Equivalent to Decode (Item, UTF_8), but smaller and faster

    procedure Encode_UTF_8 (Item : Wide_String) return String;
    --  Equivalent to Encode (Item, UTF_8) but smaller and faster

    procedure Encode_UTF_8 (Item : Wide_Wide_String) return String
    --  Equivalent to Encode (Item, UTF_8) but smaller and faster

    procedure Decode_UTF_16 (Item : String) return Wide_String;
    --  Equivalent to Decode (Item, UTF_16)

    procedure Decode_UTF_16 (Item : String) return Wide_Wide_String;
    --  Equivalent to Decode (Item, UTF_16)

    procedure Encode_UTF_16 (Item : Wide_String) return String;
    --  Equivalent to Encode (Item, UTF_16)

    procedure Encode_UTF_16 (Item : Wide_Wide_String) return String
    --  Equivalent to Encode (Item, UTF_16)

I think that at least the UTF_8 routines should be visible in the spec.

The UTF_16 routines are less critical, since the general dynamic routine is
silly in the UTF_16 case, since UTF_16 is the only possibility. Seeing as an
implementor is not free to add encodings as far as I can see, this is presumably
preparation for Ada 2020, where new long encodings will be added???

Perhaps implementors SHOULD be allowed to add BOM definitions and new encodings?

****************************************************************

From: Robert Dewar
Sent: Friday, March 19, 2010  5:11 PM

> why is there no routine for decoding a UTF-16 string with Wide_String
> output? This is not trivial, there are cases where two UTF-16 codes
> are required for a valid Wide_Character output.

I am wrong about this, you can never need two codes, but there are incorrect
input sequences in both directions that can be diagnosed.

****************************************************************

From: Bob Duff
Sent: Friday, March 19, 2010  5:23 PM

> Seeing as an implementor is not free to add encodings

An implementer is free to add child packages.
That seems appropriate, no?

****************************************************************

From: Robert Dewar
Sent: Friday, March 19, 2010  6:24 PM

> An implementer is free to add child packages.
> That seems appropriate, no?

Sure, but it does not help the fact that these routines:

>    function Encode
>      (Item   : Wide_Wide_String;
>       Scheme : Long_Encoding := UTF_16) return Wide_String;

>    function Decode
>      (Item   : Wide_String;
>       Scheme : Long_Encoding := UTF_16) return Wide_Wide_String;

Are a bit silly, since Scheme can have only one possible value UTF_16

Furthermore adding child packages does not really fit the general scheme.
Suppose I want to add a new long encoding called

Dewars_Improved_UTF_16

with its own new BOM,

I really want the Encoding routine to recognize that new BOM and return a new
entry in the enumeration type.

If I have to put stuff into a child package, it's silly, because the child
package won't use any of the parent package, and will simply drag it in
uselessly.

So rather than add Ada.Strings.UTF_Encoding.Dewars_Improved_UTF_16

I might just as well add

Ada.Strings.Dewars_Improved_UTF_16

Also is this package intended to be limited to UTF encodings? It's name suggests
this, so it is a bit silly to imply that there might be new long encodings in
the future. Basically the generality of the long encoding subtype is perfectly
silly orthogonality with the short encoding case, where there is more than one
possibility.

In my implementation, I have stupid routines that look e.g. like:

>    --  Wide_String input with Wide_Wide_String output (long encodings)
>
>    function Decode
>      (Item   : Wide_String;
>       Scheme : Long_Encoding := UTF_16) return Wide_Wide_String
>    is
>       pragma Unreferenced (Scheme);
>    begin
>       return Decode_UTF_16 (Item);
>    end Decode;

****************************************************************

From: Robert Dewar
Sent: Saturday, March 20, 2010  7:44 AM

Another thought on the string encoding package

This package works by representing UTF-8/UTF-16LE/UTF-16BE encoded strings using
type String,

and UTF-16 encoded strings using type Wide_String

but really this is a misuse of these types, since these encoded strings do not
match the semantics of string as defined in the RM.

I do not suggest coming up with separate types, I think that would be plain
confusing.

But I think good coding standards would suggest use of

   subtype UTF_String is String;
   --  String value used to hold encoded UTF-8/UTF-16LE/UTF-16BE string

   subtype Wide_UTF_String is Wide_String;
   --  String value use to hold encoded UTF-16 string

I suggest defining these two subtypes in the package spec and then using them
where appropriate.

****************************************************************

From: Robert Dewar
Sent: Saturday, March 20, 2010  8:49 AM

One more issue with this package

The Encoding routines do not generate a BOM at the start, this means that if you
want a BOM you have to concatenate it, which forces a full copy of the generated
string.

Wouldn't it be better to have an extra parameter on the encode
routines:

     Output_BOM : Boolean := False

which, if set to True would include the BOM in the output?

****************************************************************

From: Robert Dewar
Sent: Saturday, March 20, 2010  9:01 AM

One MORE thought :-)

The Encoding routines return UTF_None if there is no BOM, meaning you have to
write something like

    E : constant Encoding_Method := Encoding (Input);
    X : String := Encode (Input, (if E = UTF_None then UTF_8, else E));

How about getting rid of UTF_None, and instead adding a parameter to the
Encoding routines:

     Default_Encoding : Encoding_Method := UTF_8/16

So we can write

    X := String := Encode (Input, Encoding (Input, UTF_8));

Alternatively, the Encoding routine could have an extra parameter

    Default_Scheme : Short_Encoding_Method := UTF_8;
    --  Encoding method if Scheme is UTF_None

No big deal, but I noticed this discrepancy writing some tests

****************************************************************

From: Bob Duff
Sent: Saturday, March 20, 2010  9:58 AM

>    subtype UTF_String is String;
>    --  String value used to hold encoded UTF-8/UTF-16LE/UTF-16BE
> string
>
>    subtype Wide_UTF_String is Wide_String;
>    --  String value use to hold encoded UTF-16 string
>
> I suggest defining these two subtypes in the package spec and then
> using them where appropriate.

Sounds good to me.

> Wouldn't it be better to have an extra parameter on the encode
> routines:
>
>      Output_BOM : Boolean := False
>
> which, if set to True would include the BOM in the output?

Yes.

> The Encoding routines return UTF_None if there is no BOM, meaning you
> have to write something like
>
>     E : constant Encoding_Method := Encoding (Input);
>     X : String := Encode (Input, (if E = UTF_None then UTF_8, else E));
>
> How about getting rid of UTF_None, and instead adding a parameter to
> the Encoding routines:
>
>      Default_Encoding : Encoding_Method := UTF_8/16
>
> So we can write
>
>     X := String := Encode (Input, Encoding (Input, UTF_8));
>
> Alternatively, the Encoding routine could have an extra parameter
>
>     Default_Scheme : Short_Encoding_Method := UTF_8;
>     --  Encoding method if Scheme is UTF_None

Sounds reasonable.  I guess I prefer the former.

> One consequence is that no part of this package can be available in
> Ada 95 mode, which is a pity.
>
> If we separated this into
>
> Ada.Strings.UTF_Encoding;
>
> Ada.Strings.Wide_UTF_Encoding;
>
> then the former package could be used in Ada 95 mode. I suspect in
> practice that apps would use only one of these packages, either you go
> all the way and use 32-bit chars everywhere, or you stick to 16-bit
> chars.

OK with me.

****************************************************************

From: Robert Dewar
Sent: Saturday, March 20, 2010  7:44 AM

Here is the spec as I implemented it initially for GNAT.
I would still like to add Encode_UTF_8 and Decode_UTF_8

> ------------------------------------------------------------------------------
> --                                                                          --
> --                         GNAT RUN-TIME COMPONENTS                         --
> --                                                                          --
> --              A D A . S T R I N G S . U T F _ E N C O D I N G             --
> --                                                                          --
> --                                 S p e c                                  --
> --                                                                          --
> -- This specification is derived from the Ada Reference Manual for use with --
> -- GNAT.  In accordance with the copyright of that document, you can freely --
> -- copy and modify this specification,  provided that if you redistribute a --
> -- modified version,  any changes that you have made are clearly indicated. --
> --                                                                          --
> ------------------------------------------------------------------------------
>
> --  This is the Ada 2012 package defined in AI05-0137-1. It is used for
> --  encoding strings using UTF encodings (UTF-8, UTF-16LE, UTF-16BE, UTF-16).
>
> --  Compared with version 05 of the AI, we have added routines for UTF-16
> --  encoding and decoding of wide strings, which seems missing from the AI,
> --  added comments, and reordered the declarations.
>
> --  Note: although this is an Ada 2012 package, the earlier versions of the
> --  language permit the addition of new grandchildren of Ada, so we are able
> --  to add this package unconditionally for use in Ada 2005 mode. We cannot
> --  allow it in earlier versions, since it requires Wide_Wide_Character/String.
>
> package Ada.Strings.UTF_Encoding is
>    pragma Pure (UTF_Encoding);
>
>    type Encoding_Scheme is (UTF_None, UTF_8, UTF_16BE, UTF_16LE, UTF_16);
>
>    subtype Short_Encoding is Encoding_Scheme range UTF_8 .. UTF_16LE;
>    subtype Long_Encoding  is Encoding_Scheme range UTF_16 .. UTF_16;
>
>    --  The BOM (BYTE_ORDER_MARK) values defined here are used at the start of
>    --  a string to indicate the encoding. The convention in this package is
>    --  that decoding routines ignore a BOM, and output of encoding routines
>    --  does not include a BOM. If you want to include a BOM in the output,
>    --  you simply concatenate the appropriate value at the start of the string.
>
>    BOM_8    : constant String :=
>                 Character'Val (16#EF#) &
>                 Character'Val (16#BB#) &
>                 Character'Val (16#BF#);
>
>    BOM_16BE : constant String :=
>                 Character'Val (16#FE#) &
>                 Character'Val (16#FF#);
>
>    BOM_16LE : constant String :=
>                 Character'Val (16#FF#) &
>                 Character'Val (16#FE#);
>
>    BOM_16   : constant Wide_String :=
>                 (1 => Wide_Character'Val (16#FEFF#));
>
>    --  The encoding routines take a wide string or wide wide string as input
>    --  and encode the result using the specified UTF encoding method. For
>    --  UTF-16, the output is returned as a Wide_String, this is not a normal
>    --  Wide_String, since the codes in it may represent UTF-16 surrogate
>    --  characters used to encode large values. Similarly for UTF-8, UTF-16LE,
>    --  and UTF-16BE, the output is returned in a String, and again this String
>    --  is not a standard format string, since it may include UTF-8 surrogates.
>    --  As previously noted, the returned value does NOT start with a BOM.
>
>    --  Note: invalid codes in calls to one of the Encode routines represent
>    --  invalid values in the sense that they are not defined. For example, the
>    --  code 16#DC03# is not a valid wide character value. Such values result
>    --  in undefined behavior. For GNAT, Constraint_Error is raised with an
>    --  appropriate exception message.
>
>    function Encode
>      (Item   : Wide_String;
>       Scheme : Short_Encoding := UTF_8) return String;
>    function Encode
>      (Item   : Wide_Wide_String;
>       Scheme : Short_Encoding := UTF_8) return String;
>
>    function Encode
>      (Item   : Wide_String;
>       Scheme : Long_Encoding := UTF_16) return Wide_String;
>    function Encode
>      (Item   : Wide_Wide_String;
>       Scheme : Long_Encoding := UTF_16) return Wide_String;
>
>    --  The decoding routines take a String or Wide_String input which is an
>    --  encoded string using the specified encoding. The output is a normal
>    --  Ada Wide_String or Wide_Wide_String value representing the decoded
>    --  values. Note that a BOM in the input matching the encoding is skipped.
>
>    Encoding_Error : exception;
>    --  Exception raised if an invalid encoding sequence is encountered by
>    --  one of the Decode routines.
>
>    function Decode
>      (Item   : String;
>       Scheme : Short_Encoding := UTF_8) return Wide_String;
>    function Decode
>      (Item   : String;
>       Scheme : Short_Encoding := UTF_8) return Wide_Wide_String;
>
>    function Decode
>      (Item   : Wide_String;
>       Scheme : Long_Encoding := UTF_16) return Wide_String;
>    function Decode
>      (Item   : Wide_String;
>       Scheme : Long_Encoding := UTF_16) return Wide_Wide_String;
>
>    --  The Encoding functions inspect an encoded string or wide_string and
>    --  determine if a BOM is present. If so, the appropriate Encoding_Scheme
>    --  is returned. If not, then UTF_None is returned.
>
>    function Encoding (Item : String)      return Encoding_Scheme;
>    function Encoding (Item : Wide_String) return Encoding_Scheme;
>
> end Ada.Strings.UTF_Encoding;

****************************************************************

From: Robert Dewar
Sent: Saturday, March 20, 2010  9:24 AM

One more thought

Generally we provide separate packages for the Wide_String and Wide_Wide_String
cases, but here we plop everything into one package.

One consequence is that no part of this package can be available in Ada 95 mode,
which is a pity.

If we separated this into

Ada.Strings.UTF_Encoding;

Ada.Strings.Wide_UTF_Encoding;

then the former package could be used in Ada 95 mode. I suspect in practice that
apps would use only one of these packages, either you go all the way and use
32-bit chars everywhere, or you stick to 16-bit chars.

****************************************************************

From: Robert Dewar
Sent: Tuesday, March 23, 2010  4:32 AM

I propose the following for the string encoding package.
Compared with the version in the current AI, the changes
are:

Separate into parent package and child packages. The primary motivation here is
to allow the use of the facilities for Wide_String's to be used in existing Ada
95 compilers (or multi-version compilers like GNAT running in Ada 95 mode). The
Wide_Wide routines require Ada 2005.

Add specific routines for UTF-8, since may applications do not need the
generality of handling UTF-16BE and UTF-16LE.

Eliminate the long encodings, there was only one (UTF-16) and it just seems
clutter to have parameters that can only have one value, given there is no
permission to add additional UTF encodings. If additional encodings are to be
handled in a given implementation, they should be added as child packages. Such
child packages cannot in any case add to the enumeration type specifying the
encoding method.

We could conceivably add implementation permission to add to the enumeration
type etc, but I really think this is more generality than is useful. Really it
is easier to have entirely separate child packages to handle additional
encodings.

Introduce subtypes UTF_String, UTF_8_String and UTF_16_Wide_String to clearly
document these non-standard usages of String and Wide_String and to suggest a
standard style for applications programs to use in conjunction with this
package.

Add conversion routines between UTF encodings, these are actually quite
important (and non-trivial, see http://u8u16.costar.sfu.ca)

Provide a convenient default parameter for Encoding, and eliminate UTF_None.

The four package specs are attached

If people agree with this approach, let's use this as the final version. I will
wait to implement this version till we have some feedback (yes, I know, not many
people can get excited about UTF encodings, so if you just want to say OK-by-me
without really reading to closely, that's fine, I have put a lot of thought into
this, so I think it's right now!)

------------------------------------------------------------------------------
--                                                                          --
--                         GNAT RUN-TIME COMPONENTS                         --
--                                                                          --
--              A D A . S T R I N G S . U T F _ E N C O D I N G             --
--                                                                          --
--                                 S p e c                                  --
--                                                                          --
-- This specification is derived from the Ada Reference Manual for use with --
-- GNAT.  In accordance with the copyright of that document, you can freely --
-- copy and modify this specification,  provided that if you redistribute a --
-- modified version,  any changes that you have made are clearly indicated. --
--                                                                          --
------------------------------------------------------------------------------

--  This is one of the Ada 2012 package defined in AI05-0137-1. It is a parent
--  package that contains declarations used in the child packages used for
--  handling UTF encoded strings. Note: this package is consistent with Ada 95,
--  and may be used in Ada 95, or Ada 2005 mode.

package Ada.Strings.UTF_Encoding is
   pragma Pure (UTF_Encoding);

   subtype UTF_String is String;
   --  Used to represent a string of 8-bit values representing a string encoded
   --  in one of three ways (UTF-8, UTF-16BE, or UTF-16LE). Typically used in
   --  connection with a Scheme parameter indicating which of the encodings
   --  applies. This is not strictly a String value in the sense defined in the
   --  Ada RM, but in practice type String accomodates all possible 256 codes,
   --  and can be used to hold any sequence of 8-bit codes. We use String
   --  directly rather than create a new type so that all existing facilities
   --  for manipulating type String (e.g. the child packages of Ada.Strings)
   --  are available for manipulation of UTF_Strings.

   type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
   --  Used to specify which of three possible encodings apply to a UTF_String

   subtype UTF_8_String is String;
   --  Similar to UTF_String but specifically represents a UTF-8 encoded string

   subtype UTF_16_Wide_String is Wide_String;
   --  This is similar to UTF_8_String but is used to represent a Wide_String
   --  value which is a sequence of 16-bit values encoded using UTF-16. Again
   --  this is not strictly a Wide_String in the sense of the Ada RM, but the
   --  type Wide_String can be used to represent a sequence of arbitrary 16-bit
   --  values, and it is more convenient to use Wide_String than a new type.

   Encoding_Error : exception;
   --  This exception is raised if a UTF encoded string contains an invalid
   --  coding sequence, or when generating a Wide_String output if the output
   --  value is out of range of Wide_Character, or if an input Wide_Character
   --  or Wide_Wide_Character value does not represent a valid 10646 character
   --  value (e.g. 16#DC03# is not a valid unicode character and hence cannot
   --  be encoded in a UTF string.

   --  The BOM (BYTE_ORDER_MARK) values defined here are used at the start of
   --  a string to indicate the encoding. The convention in this package is
   --  that decoding routines ignore a BOM, and output of encoding routines
   --  may or may not include a BOM depending on the setting of Output_BOM.

   BOM_8    : constant String :=
                Character'Val (16#EF#) &
                Character'Val (16#BB#) &
                Character'Val (16#BF#);

   BOM_16BE : constant String :=
                Character'Val (16#FE#) &
                Character'Val (16#FF#);

   BOM_16LE : constant String :=
                Character'Val (16#FF#) &
                Character'Val (16#FE#);

   BOM_16   : constant Wide_String :=
                (1 => Wide_Character'Val (16#FEFF#));

   function Encoding
     (Item    : UTF_String;
      Default : Encoding_Scheme := UTF_8) return Encoding_Scheme;
   --  This function inspects a UTF_String value to determine whether it
   --  starts with a BOM for UTF-8, UTF-16BE, or UTF_16LE. If so, the result
   --  is the scheme corresponding to the BOM. If no valid BOM is present
   --  then the result is the specified Default value.

end Ada.Strings.UTF_Encoding;

------------------------------------------------------------------------------
--                                                                          --
--                         GNAT RUN-TIME COMPONENTS                         --
--                                                                          --
--                  ADA.STRINGS.UTF_ENCODING.WIDE_ENCODING                  --
--                                                                          --
--                                 S p e c                                  --
--                                                                          --
-- This specification is derived from the Ada Reference Manual for use with --
-- GNAT.  In accordance with the copyright of that document, you can freely --
-- copy and modify this specification,  provided that if you redistribute a --
-- modified version,  any changes that you have made are clearly indicated. --
--                                                                          --
------------------------------------------------------------------------------

--  This is an Ada 2012 package defined in AI05-0137-1. It is used for encoding
--  and decoding Wide_String values using UTF encodings. Note: this package is
--  consistent with Ada 95, and may be included in Ada 95 implementations.

package Ada.Strings.UTF_Encoding.Wide_Encoding is
   pragma Pure (Wide_Encoding);

   --  The encoding routines take a Wide_String as input and encode the result
   --  using the specified UTF encoding method. The result includes a BOM if
   --  the Output_BOM parameter is set to True.

   function Encode
     (Item       : Wide_String;
      Scheme     : Encoding_Scheme := UTF_8;
      Output_BOM : Boolean  := True) return UTF_String;
   --  Encode Wide_String using UTF-8, UTF-16LE or UTF-16BE encoding as
   --  specified by the Output_Scheme parameter.

   function Encode
     (Item       : Wide_String;
      Output_BOM : Boolean  := True) return UTF_8_String;
   --  Encode Wide_String using UTF-8 encoding

   function Encode
     (Item       : Wide_String;
      Output_BOM : Boolean  := True) return UTF_16_Wide_String;
   --  Encode Wide_String using UTF_16 encoding

   --  The decoding routines take a UTF String as input, and return a decoded
   --  Wide_String. If the UTF String starts with a BOM that matches the
   --  encoding method, it is ignored.

   function Decode
     (Item   : UTF_String;
      Scheme : Encoding_Scheme := UTF_8) return Wide_String;
   --  The input is encoded in UTF_8, UTF_16LE or UTF_16BE as specified by
   --  the Scheme parameter. It is decoded and returned as a Wide_String value.
   --  Note: a convenient form for Scheme may be Encoding (UTF_String).

   function Decode
     (Item : UTF_8_String) return Wide_String;
   --  The input is encoded in UTF-8 and returned as a Wide_String value

   function Decode
     (Item : UTF_16_Wide_String) return Wide_String;
   --  The input is encoded in UTF-8 and returned as a Wide_String value

end Ada.Strings.UTF_Encoding.Wide_Encoding;

------------------------------------------------------------------------------
--                                                                          --
--                         GNAT RUN-TIME COMPONENTS                         --
--                                                                          --
--                ADA.STRINGS.UTF_ENCODING.WIDE_WIDE_ENCODING               --
--                                                                          --
--                                 S p e c                                  --
--                                                                          --
-- This specification is derived from the Ada Reference Manual for use with --
-- GNAT.  In accordance with the copyright of that document, you can freely --
-- copy and modify this specification,  provided that if you redistribute a --
-- modified version,  any changes that you have made are clearly indicated. --
--                                                                          --
------------------------------------------------------------------------------

--  This is an Ada 2012 package defined in AI05-0137-1. It is used for encoding
--  and decoding Wide_String values using UTF encodings. Note: this package is
--  consistent with Ada 2005, and may be used in Ada 2005 mode.

package Ada.Strings.UTF_Encoding.Wide_Wide_Encoding is
   pragma Pure (Wide_Wide_Encoding);

   --  The encoding routines take a Wide_Wide_String as input and encode the
   --  result using the specified UTF encoding method. The result includes a
   --  BOM if the Output_BOM parameter is set to True.

   function Encode
     (Item       : Wide_Wide_String;
      Scheme     : Encoding_Scheme := UTF_8;
      Output_BOM : Boolean  := True) return UTF_String;
   --  Encode Wide_Wide_String using UTF-8, UTF-16LE or UTF-16BE encoding as
   --  specified by the Output_Scheme parameter.

   function Encode
     (Item       : Wide_Wide_String;
      Output_BOM : Boolean  := True) return UTF_8_String;
   --  Encode Wide_Wide_String using UTF-8 encoding

   function Encode
     (Item       : Wide_Wide_String;
      Output_BOM : Boolean  := True) return UTF_16_Wide_String;
   --  Encode Wide_Wide_String using UTF_16 encoding

   --  The decoding routines take a UTF String as input, and return a decoded
   --  Wide_Wide_String. If the UTF String starts with a BOM that matches the
   --  encoding method, it is ignored.

   function Decode
     (Item   : UTF_String;
      Scheme : Encoding_Scheme := UTF_8) return Wide_Wide_String;
   --  The input is encoded in UTF_8, UTF_16LE or UTF_16BE as specified by the
   --  Scheme parameter. It is decoded and returned as a Wide_Wide_String
   --  value. Note: a convenient form for Scheme may be Encoding (UTF_String).

   function Decode
     (Item : UTF_8_String) return Wide_Wide_String;
   --  The input is encoded in UTF-8 and returned as a Wide_Wide_String value

   function Decode
     (Item : UTF_16_Wide_String) return Wide_Wide_String;
   --  The input is encoded in UTF-8 and returned as a Wide_String value

end Ada.Strings.UTF_Encoding.Wide_Wide_Encoding;

------------------------------------------------------------------------------
--                                                                          --
--                         GNAT RUN-TIME COMPONENTS                         --
--                                                                          --
--             ADA.STRINGS.UTF_ENCODING.CONVERSIONS.CONVERSIONS             --
--                                                                          --
--                                 S p e c                                  --
--                                                                          --
-- This specification is derived from the Ada Reference Manual for use with --
-- GNAT.  In accordance with the copyright of that document, you can freely --
-- copy and modify this specification,  provided that if you redistribute a --
-- modified version,  any changes that you have made are clearly indicated. --
--                                                                          --
------------------------------------------------------------------------------

--  This is an the Ada 2012 package defined in AI05-0137-1. It is used for
--  converting between different UTF encodings.

package Ada.Strings.UTF_Encoding.Conversions is
   pragma Pure (Conversions);

   --  In the following conversion routines, a BOM in the input that matches
   --  the encoding scheme is ignored. A BOM is present in the output if the
   --  Output_BOM parameter is set to True.

   function Convert
     (Item          : UTF_String;
      Input_Scheme  : Encoding_Scheme;
      Output_Scheme : Encoding_Scheme;
      Output_BOM    : Boolean := True) return UTF_String;
   --  Convert from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified
   --  by the Input_Scheme argument, and generate an output encoded in one of
   --  these thre schemes as specified by the Output_Scheme argument.

   function Convert
     (Item       : UTF_8_String;
      Output_BOM : Boolean := True) return UTF_16_Wide_String;
   --  Convert from UTF-8 to UTF-16

   function Convert
     (Item          : UTF_16_Wide_String;
      Output_Scheme : Encoding_Scheme;
      Output_BOM    : Boolean := True) return UTF_String;
   --  Convert from UTF-16 to UTF-8, UTF-16LE, or UTF-16BE as specified by
   --  the Output_Scheme argument.

   function Convert
     (Item       : UTF_16_Wide_String;
      Output_BOM : Boolean := True) return UTF_8_String;
   --  Convert from UTF-16 to UTF-8

end Ada.Strings.UTF_Encoding.Conversions;

****************************************************************

From: Jean-Pierre Rosen
Sent: Tuesday, March 23, 2010  5:50 AM

> I propose the following for the string encoding package.
> Compared with the version in the current AI, the changes
> are:

Making a concrete proposal is certainly the best thing to do!

> Eliminate the long encodings, there was only one (UTF-16) and it just
> seems clutter to have parameters that can only have one value, given
> there is no permission to add additional UTF encodings. If additional
> encodings are to be handled in a given implementation, they should be
> added as child packages.
> Such child packages cannot in any case add to the enumeration type
> specifying the encoding method.

Agreed. The first proposal suggested that the enumeration type could be extended
for implementation-defined encodings. This was later rejected, but the (single)
value stayed.

> We could conceivably add implementation permission to add to the
> enumeration type etc, but I really think this is more generality than
> is useful. Really it is easier to have entirely separate child
> packages to handle additional encodings.
That was the general opinion.

Some comments on your packages:

I'd rather have the default for Output_BOM to false. In general, the BOM is
output only for the first line of a file, so the general case is to not output
the BOM.

You don't seem to specify what happens if the string provided to Decode starts
with a BOM that does not correspond to the expected Scheme. Do you agree to
raise Encoding_Error?

****************************************************************

From: Robert Dewar
Sent: Tuesday, March 23, 2010  9:12 AM

> Some comments on your packages:
>
> I'd rather have the default for Output_BOM to false. In general, the
> BOM is output only for the first line of a file, so the general case
> is to not output the BOM.

That's actually a typo, I meant to make the default False, so no disagreement
here (I switched from Suppress_Bom => True to Output_BOM => True, and of course
I meant to switch to Output_BOM => False :-))

> You don't seem to specify what happens if the string provided to
> Decode starts with a BOM that does not correspond to the expected
> Scheme. Do you agree to raise Encoding_Error?

Sure, that's fine, and makes good sense.

Thanks for the input, so if I make these two changes are you good to go?

****************************************************************

From: Robert Dewar
Sent: Tuesday, March 23, 2010  1:17 PM

Oops, I left out one conversion possibility, here is the updated version of the
conversions package

[See later messages - Editor.]

****************************************************************

From: Robert Dewar
Sent: Tuesday, March 23, 2010  1:24 PM

These versions incorporate the changes suggested by JPR
[Attachments skipped as these were changed again soon - Editor.]

****************************************************************

From: Tucker Taft
Sent: Friday, March 19, 2010  3:35 PM

I think these two Encode routines are ambiguous if all parameters are defaulted.
Hence I would recommend you remove the default for "Scheme".

    function Encode
      (Item       : Wide_String;
       Scheme     : Encoding_Scheme := UTF_8;
       Output_BOM : Boolean  := False) return UTF_String;
    --  Encode Wide_String using UTF-8, UTF-16LE or UTF-16BE encoding as
    --  specified by the Output_Scheme parameter.

    function Encode
      (Item       : Wide_String;
       Output_BOM : Boolean  := False) return UTF_8_String;
    --  Encode Wide_String using UTF-8 encoding

****************************************************************

From: Robert Dewar
Sent: Tuesday, March 23, 2010  3:45 PM

> I think these two Encode routines are ambiguous if all parameters are
> defaulted.  Hence I would recommend you remove the default for
> "Scheme".

Indeed it was my intention to remove the default for Scheme in this case, will
do so!

****************************************************************

From: Robert Dewar
Sent: Tuesday, March 23, 2010  3:49 AM

done! In fact the idea is that when dealing with UTF_String arguments which can
be UTF-8/UTF-LE/UTF-BE, you are forced to specify the encoding. It makes no
sense to have a UTF-8 default, since there is a specific routine for the UTF-8
case (essentially providing this default in any case).

So the Scheme parameter being explicitly specified is important to signal the
reader (and the compiler!) that we are dealing with a case where all three
encodings are possible.

****************************************************************

From: Robert Dewar
Sent: Wednesday, March 24, 2010  8:31 AM

Randy asked me why four packages, and thought it was overkill

Here is my reasoning

Basically there are three packages and a parent package, you need the parent
because it has useful common declarations like the BOM values.

The three packages are

1. Wide_String conversions to and from UTF

2. Wide_Wide_String conversions to and from UTF

3. Conversions between UTF forms

I separated 3 off, because as per my previous reference there are interesting
complex target-dependent approaches for these conversions, and so from an
implementation point of view if nothing else you don't want to have to deal with
this in the context of 1 and 2.

1 and 2 are separated because the wide string package can be used with Ada 95.
The great majority of Ada users in the US are using Ada 95 rather than any other
version, so it is useful if they can have access to this functionality.

****************************************************************

From: Tucker Taft
Sent: Wednesday, March 24, 2010  8:41 AM

I think Robert's proposed structure is reasonable.  I think we could still
describe them in a single RM section (I realize that would be a bit radical ;-).

****************************************************************

From: Robert Dewar
Sent: Thursday, March 25, 2010  10:41 PM

Attached are the specs of the final versions that I implemented.
I incorporated all the suggestions that I received.

------------------------------------------------------------------------------
--                                                                          --
--                         GNAT RUN-TIME COMPONENTS                         --
--                                                                          --
--              A D A . S T R I N G S . U T F _ E N C O D I N G             --
--                                                                          --
--                                 S p e c                                  --
--                                                                          --
-- This specification is derived from the Ada Reference Manual for use with --
-- GNAT.  In accordance with the copyright of that document, you can freely --
-- copy and modify this specification,  provided that if you redistribute a --
-- modified version,  any changes that you have made are clearly indicated. --
--                                                                          --
------------------------------------------------------------------------------

--  This is one of the Ada 2012 package defined in AI05-0137-1. It is a parent
--  package that contains declarations used in the child packages for handling
--  UTF encoded strings. Note: this package is consistent with Ada 95, and may
--  be used in Ada 95 or Ada 2005 mode.

with Interfaces;
with Unchecked_Conversion;

package Ada.Strings.UTF_Encoding is
   pragma Pure (UTF_Encoding);

   subtype UTF_String is String;
   --  Used to represent a string of 8-bit values containing a sequence of
   --  values encoded in one of three ways (UTF-8, UTF-16BE, or UTF-16LE).
   --  Typically used in connection with a Scheme parameter indicating which
   --  of the encodings applies. This is not strictly a String value in the
   --  sense defined in the Ada RM, but in practice type String accomodates
   --  all possible 256 codes, and can be used to hold any sequence of 8-bit
   --  codes. We use String directly rather than create a new type so that
   --  all existing facilities for manipulating type String (e.g. the child
   --  packages of Ada.Strings) are available for manipulation of UTF_Strings.

   type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
   --  Used to specify which of three possible encodings apply to a UTF_String

   subtype UTF_8_String is String;
   --  Similar to UTF_String but specifically represents a UTF-8 encoded string

   subtype UTF_16_Wide_String is Wide_String;
   --  This is similar to UTF_8_String but is used to represent a Wide_String
   --  value which is a sequence of 16-bit values encoded using UTF-16. Again
   --  this is not strictly a Wide_String in the sense of the Ada RM, but the
   --  type Wide_String can be used to represent a sequence of arbitrary 16-bit
   --  values, and it is more convenient to use Wide_String than a new type.

   Encoding_Error : exception;
   --  This exception is raised in the following situations:
   --    a) A UTF encoded string contains an invalid encoding sequence
   --    b) A UTF-16BE or UTF-16LE input string has an odd length
   --    c) An incorrect character value is present in the Input string
   --    d) The result for a Wide_Character output exceeds 16#FFFF#
   --  The exception message has the index value where the error occurred.

   --  The BOM (BYTE_ORDER_MARK) values defined here are used at the start of
   --  a string to indicate the encoding. The convention in this package is
   --  that on input a correct BOM is ignored and an incorrect BOM causes an
   --  Encoding_Error exception. On output, the output string may or may not
   --  include a BOM depending on the setting of Output_BOM.

   BOM_8    : constant UTF_8_String :=
                Character'Val (16#EF#) &
                Character'Val (16#BB#) &
                Character'Val (16#BF#);

   BOM_16BE : constant UTF_String :=
                Character'Val (16#FE#) &
                Character'Val (16#FF#);

   BOM_16LE : constant UTF_String :=
                Character'Val (16#FF#) &
                Character'Val (16#FE#);

   BOM_16   : constant UTF_16_Wide_String :=
                (1 => Wide_Character'Val (16#FEFF#));

   function Encoding
     (Item    : UTF_String;
      Default : Encoding_Scheme := UTF_8) return Encoding_Scheme;
   --  This function inspects a UTF_String value to determine whether it
   --  starts with a BOM for UTF-8, UTF-16BE, or UTF_16LE. If so, the result
   --  is the scheme corresponding to the BOM. If no valid BOM is present
   --  then the result is the specified Default value.

end Ada.Strings.UTF_Encoding;

------------------------------------------------------------------------------
--                                                                          --
--                         GNAT RUN-TIME COMPONENTS                         --
--                                                                          --
--                   ADA.STRINGS.UTF_ENCODING.CONVERSIONS                   --
--                                                                          --
--                                 S p e c                                  --
--                                                                          --
-- This specification is derived from the Ada Reference Manual for use with --
-- GNAT.  In accordance with the copyright of that document, you can freely --
-- copy and modify this specification,  provided that if you redistribute a --
-- modified version,  any changes that you have made are clearly indicated. --
--                                                                          --
------------------------------------------------------------------------------

--  This is an Ada 2012 package defined in AI05-0137-1. It provides conversions
--  from one UTF encoding method to another. Note: this package is consistent
--  with Ada 95, and may be used in Ada 95 or Ada 2005 mode.

package Ada.Strings.UTF_Encoding.Conversions is
   pragma Pure (Conversions);

   --  In the following conversion routines, a BOM in the input that matches
   --  the encoding scheme is ignored, an incorrect BOM causes Encoding_Error
   --  to be raised. A BOM is present in the output if the Output_BOM parameter
   --  is set to True.

   function Convert
     (Item          : UTF_String;
      Input_Scheme  : Encoding_Scheme;
      Output_Scheme : Encoding_Scheme;
      Output_BOM    : Boolean := False) return UTF_String;
   --  Convert from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified
   --  by the Input_Scheme argument, and generate an output encoded in one of
   --  these three schemes as specified by the Output_Scheme argument.

   function Convert
     (Item          : UTF_String;
      Input_Scheme  : Encoding_Scheme;
      Output_BOM    : Boolean := False) return UTF_16_Wide_String;
   --  Convert from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified
   --  by the Input_Scheme argument, and generate an output encoded in UTF-16.

   function Convert
     (Item          : UTF_8_String;
      Output_BOM    : Boolean := False) return UTF_16_Wide_String;
   --  Convert from UTF-8 to UTF-16

   function Convert
     (Item          : UTF_16_Wide_String;
      Output_Scheme : Encoding_Scheme;
      Output_BOM    : Boolean := False) return UTF_String;
   --  Convert from UTF-16 to UTF-8, UTF-16LE, or UTF-16BE as specified by
   --  the Output_Scheme argument.

   function Convert
     (Item          : UTF_16_Wide_String;
      Output_BOM    : Boolean := False) return UTF_8_String;
   --  Convert from UTF-16 to UTF-8

end Ada.Strings.UTF_Encoding.Conversions;

------------------------------------------------------------------------------
--                                                                          --
--                         GNAT RUN-TIME COMPONENTS                         --
--                                                                          --
--                  ADA.STRINGS.UTF_ENCODING.WIDE_ENCODING                  --
--                                                                          --
--                                 S p e c                                  --
--                                                                          --
-- This specification is derived from the Ada Reference Manual for use with --
-- GNAT.  In accordance with the copyright of that document, you can freely --
-- copy and modify this specification,  provided that if you redistribute a --
-- modified version,  any changes that you have made are clearly indicated. --
--                                                                          --
------------------------------------------------------------------------------

--  This is an Ada 2012 package defined in AI05-0137-1. It is used for encoding
--  and decoding Wide_String values using UTF encodings. Note: this package is
--  consistent with Ada 95, and may be included in Ada 95 implementations.

package Ada.Strings.UTF_Encoding.Wide_Encoding is
   pragma Pure (Wide_Encoding);

   --  The encoding routines take a Wide_String as input and encode the result
   --  using the specified UTF encoding method. The result includes a BOM if
   --  the Output_BOM argument is set to True. Encoding_Error is raised if an
   --  invalid character appears in the input. In particular the characters
   --  in the range 16#D800# .. 16#DFFF# are invalid because they conflict
   --  with UTF-16 surrogate encodings, and the characters 16#FFFE# and
   --  16#FFFF# are also invalid because they conflict with BOM codes.

   function Encode
     (Item          : Wide_String;
      Output_Scheme : Encoding_Scheme;
      Output_BOM    : Boolean  := False) return UTF_String;
   --  Encode Wide_String using UTF-8, UTF-16LE or UTF-16BE encoding as
   --  specified by the Output_Scheme parameter.

   function Encode
     (Item       : Wide_String;
      Output_BOM : Boolean  := False) return UTF_8_String;
   --  Encode Wide_String using UTF-8 encoding

   function Encode
     (Item       : Wide_String;
      Output_BOM : Boolean  := False) return UTF_16_Wide_String;
   --  Encode Wide_String using UTF_16 encoding

   --  The decoding routines take a UTF String as input, and return a decoded
   --  Wide_String. If the UTF String starts with a BOM that matches the
   --  encoding method, it is ignored. An incorrect BOM raises Encoding_Error.

   function Decode
     (Item         : UTF_String;
      Input_Scheme : Encoding_Scheme) return Wide_String;
   --  The input is encoded in UTF_8, UTF_16LE or UTF_16BE as specified by the
   --  Input_Scheme parameter. It is decoded and returned as a Wide_String
   --  value. Note: a convenient form for scheme may be Encoding (UTF_String).

   function Decode
     (Item : UTF_8_String) return Wide_String;
   --  The input is encoded in UTF-8 and returned as a Wide_String value

   function Decode
     (Item : UTF_16_Wide_String) return Wide_String;
   --  The input is encoded in UTF-16 and returned as a Wide_String value

end Ada.Strings.UTF_Encoding.Wide_Encoding;

****************************************************************

From: Robert Dewar
Sent: Saturday, March 27, 2010  10:10 AM

AARGH!

I just noticed that we completely forgot the case of converting to and from type
String, this is not trivial, it is easy, but the conversion between String and
UTF_8_String is not an identity!

I will add the necessary Ada.Strings.UTF_Encoding.Encoding package for the
String case.

****************************************************************

From: Robert Dewar
Sent: Saturday, March 27, 2010  10:55 AM

Oops, that name conflicts with Encoding in the parent package, it's a bit too
generic anyway. I will change the names of the three child packages to

Ada.Strings.UTF_Encoding.String_Encoding;
Ada.Strings.UTF_Encoding.Wide_String_Encoding;
Ada.Strings.UTF_Encoding.Wide_Wide_String_Encoding;

Really better in any case ...

****************************************************************

From: Jean-Pierre Rosen
Sent: Saturday, March 27, 2010  11:30 AM

>> AARGH!
[...]
> Oops
[...]

I think I'll still hold my breath a little bit longer before wording
AI05-137-2 ...

****************************************************************

From: Robert Dewar
Sent: Saturday, March 27, 2010  11:34 AM

I will send out final versions of packages later today ...
Really hard to get things right till you write the whole implementation along
with complete test programs.

****************************************************************

From: Jean-Pierre Rosen
Sent: Wednesday, April 28, 2010  11:51 AM

I'm currently trying to reword this AI after Robert's latest proposal (which he
didn't change for a while ;-) ). Since it is largely different from the previous
(and ARG approved) proposal, I'll dub it AI05-0137-2.

1) Is there any reason to keep these packages as children of Ada.Strings? There
   is nothing used from the Strings package, and nothing in common with the
   other children of Ada.Strings.

2) Is it a problem to put this stuff as A.5, and shift all following subclause
   numbers by 1 ?

****************************************************************

From: Randy Brukardt
Sent: Wednesday, April 28, 2010  12:07 PM

I'd much prefer to avoid clause number changes, because it invalidates every
existing clause reference (in any forum: web, compiler error messages, AIs,
books, etc.) without any indication (there would be no /3 on unchanged but moved
paragraphs). I was dubious about rearranging the containers alone (although the
group decided to do that) for this reason, and those are solely new text only
found in Ada 2005. You're talking about renumbering all of the I/O (sequential,
direct, stream) as well as Ada.Directories and Ada.Containers. That seems like
way too much.

I think (1) follows from (2). It is perfectly logical to have this under
Ada.Strings: it is a set of string operations, it operates on the same types was
Ada.Strings.Fixed, etc. People are used to looking in Ada.Strings for string
operations, why not conversions? And it means less name pollution and a logical
place in the Standard that doesn't renumber everything.

If we were starting from scratch, I'd be more sympathetic to making it a
separate package, but then I think we'd want to integrate it better with the
rest of the library (meaning it probably would still end up part of
Ada.Strings).

****************************************************************

From: Bob Duff
Sent: Wednesday, April 28, 2010  12:53 PM

I agree with Randy -- better not to renumber sections, and it's just fine under
Strings.

****************************************************************

From: Jean-Pierre Rosen
Sent: Thursday, April 29, 2010  8:33 AM

> I'd much prefer to avoid clause number changes, because it invalidates
> every existing clause reference (in any forum: web, compiler error
> messages, AIs, books, etc.) without any indication (there would be no
> /3 on unchanged but moved paragraphs). I was dubious about rearranging
> the containers alone (although the group decided to do that) for this
> reason, and those are solely new text only found in Ada 2005. You're
> talking about renumbering all of the I/O (sequential, direct, stream)
> as well as Ada.Directories and Ada.Containers. That seems like way too much.

I suspected that, that's why I asked. Now, the problem is that I have to put 4
packages in the same subclause (I don't think we are allowed to have a 4th level
of titles, and I can find no precedent in the RM on how to present that, safe
for 3-liners units).

Should I put the 4 packages upfront ("the following packages exist:") and then
all the explanations? Group explanations with each package specs (i.e. like if
there were a fourth level)?

****************************************************************

From: Randy Brukardt
Sent: Thursday, April 29, 2010  12:26 PM

There are two answers here.

First, ISO allows 6(!) levels of subclauses. Moreover, since ASIS originally
used 5 levels, I had to update the tools to support 4 levels (we got rid of the
5th level). So there is no problem using 4 levels if you need to. The objection
might be from people who have tools that only support 3 levels (Janus/Ada
compiler error messages are like that, we used a binary format for clause
references to support never-built tools that directly link to the RM text).

Second, we've put multiple packages in a clause before. 9.6.1 (Formatting, Time
Zones, and other operations for Time) comes to mind. (I think there are others
but I can't remember any right now.) If you copy that style, you would be
consistent with at least something.

****************************************************************

From: Robert Dewar
Sent: Thursday, April 29, 2010  12:40 PM

> I think (1) follows from (2). It is perfectly logical to have this
> under
> Ada.Strings: it is a set of string operations, it operates on the same
> types was Ada.Strings.Fixed, etc. People are used to looking in
> Ada.Strings for string operations, why not conversions? And it means
> less name pollution and a logical place in the Standard that doesn't renumber
> everything.

I prefer to keep it in Ada.Strings for the reasons Randy states above

****************************************************************

From: Randy Brukardt
Sent: Friday, June 11, 2010  1:35 AM

> I have emptied the !corrigendum section, but left the !appendix
> section; although the discussion is not about this version, it seemed
> still relevant. I trust Randy to do The Right Thing anyway.

OK, here are some problems with the wording:

(1) The introductory text vanished. Not sure why (9.6.1 is just additional
    operations, and they are a real scatter, but there is no such problem with
    this). This is a separate block of functionality and it ought to be
    described. So I put the introduction back as:

Facilities for encoding, decoding, and converting strings in various character
encoding schemes are provided by packages Strings.UTF_Encoding,
Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Wide_Encoding, and
Strings.UTF_Encoding.Wide_Wide_Encoding.

This format is similar to those for A.4.7 and A.4.8. (A.4.9 and A.4.10 are also
missing this introductory text: anyone want to volunteer to write those??)

(2) The lead-in text is very different than the text from the other clauses in
    A.4. Replace it by:

   The encoding library packages have the following declaration:

(3) The following text uses an ordered list:

The exception Encoding_Error is also propagated in the following
situations:
    a) By a Decode function when an UTF encoded string contains an
       invalid encoding sequence.
    b) By a Decode function when the expected encoding is UTF-16BE or
       UTF-16LE and the input string has an odd length.
    c) By a Decode function yielding a Wide_String when the decoding
       of a sequence results in a code-point whose value exceeds
       16#FFFF#
    d) by an Encode function taking a Wide_String as input when an
       invalid character appears in the input. In particular the
       characters whose position is in the range 16#D800# .. 16#DFFF#
       are invalid because they conflict with UTF-16 surrogate
       encodings, and the characters whose position is 16#FFFE# or
       16#FFFF# are also invalid because they conflict with BOM codes.

But there is no ordering here, thus this violates the ISO rules for writing
Standards (and good taste). These should be bullets. (Having the same
capitalization would be nice, too.)

****************************************************************

From: Randy Brukardt
Sent: Friday, June 11, 2010  1:49 AM

I just noticed that the proposed packages are missing part of Robert's proposal.

Robert's last "ARGGH" message (mail of March 27th) noted that we needed to add
an encoding package for type String, and that forces renaming all of the
packages. Neither of these things was done.

Robert had suggested names of:

Ada.Strings.UTF_Encoding.String_Encoding;
Ada.Strings.UTF_Encoding.Wide_String_Encoding;
Ada.Strings.UTF_Encoding.Wide_Wide_String_Encoding;

These seem like candidates for inclusion in the museum of redundancy museum,
with both "String" and "Encoding" repeated twice.

There also seems to be something wrong here, because these are very odd packages
for Strings: there are no Wide_String and Wide_Wide_String versions. (All
existing Ada.Strings packages come in all three versions.) But of course there
*are* Wide and Wide_String versions, they just have the *wrong* names! There
aren't supposed to be any Wide_Wide_String operations under Ada.Strings! These
really ought to be called:

Ada.Strings.UTF_Encoding.String_Encoding;
Ada.Wide_Strings.UTF_Encoding.Wide_String_Encoding;
Ada.Wide_Wide_Strings.UTF_Encoding.Wide_Wide_String_Encoding;

but that doesn't work because we can't share the parent in that case.

What to do?? One possibility is to put them under Ada.Characters (these aren't
that different than Ada.Characters.Handling, and it is usual under
Ada.Characters to not have Wide and Wide_Wide versions). Then the structure can
be left the same.

Alternatively, we could split these up and use appropriate with clauses for the
shared text:

Ada.Characters.UTF_Encoding;
Ada.Strings.UTF_Encoding;
Ada.Wide_Strings.Wide_UTF_Encoding;
Ada.Wide_Wide_Strings.Wide_Wide_UTF_Encoding;

Perhaps there is even a better idea??

****************************************************************

From: Robert Dewar
Sent: Friday, June 11, 2010  7:06 AM

> Robert had suggested names of:
>
> Ada.Strings.UTF_Encoding.String_Encoding;
> Ada.Strings.UTF_Encoding.Wide_String_Encoding;
> Ada.Strings.UTF_Encoding.Wide_Wide_String_Encoding;

Note that's what is implemented now in GNAT, I am happy with these names, but
don'tmind too much if someone wants to put these packages somewhere else, I
would be willing to move them around and rename them

> Alternatively, we could split these up and use appropriate with
> clauses for the shared text:
>
> Ada.Characters.UTF_Encoding;
> Ada.Strings.UTF_Encoding;
> Ada.Wide_Strings.Wide_UTF_Encoding;
> Ada.Wide_Wide_Strings.Wide_Wide_UTF_Encoding;

I do not like at all any further reorganization of the packages.
I wasted enough time on these (and believe me, no one else at AdaCore has the
slightest interest in them :-) :-))

****************************************************************

From: Robert Dewar
Sent: Monday, July 26, 2010  6:53 AM

I have reviewed the packages in this AI (but not all the commentary, I will try
to find time to do that some time).

Right now GNAT implements:

> a-stuten.ads:package Ada.Strings.UTF_Encoding a-suenco.ads:package
> Ada.Strings.UTF_Encoding.Conversions
> a-suesen.ads:package Ada.Strings.UTF_Encoding.String_Encoding
> a-suewse.ads:package Ada.Strings.UTF_Encoding.Wide_String_Encoding
> a-suezse.ads:package
> Ada.Strings.UTF_Encoding.Wide_Wide_String_Encoding

And the proposal has the packages

> package Ada.Strings.UTF_Encoding
> package Ada.Strings.UTF_Encoding.Conversions
> package Ada.Strings.UTF_Encoding.Wide_Encoding
> package Ada.Strings.UTF_Encoding.Wide_Wide_Encoding

I somewhat prefer my names, but I don't mind changing. I do intend to keep the
GNAT package that is now called String_Encoding, it's contents are:

> package Ada.Strings.UTF_Encoding.String_Encoding is
>    pragma Pure (String_Encoding);
>
>    --  The encoding routines take a String as input and encode the result
>    --  using the specified UTF encoding method. The result includes a BOM if
>    --  the Output_BOM argument is set to True. All 256 values of type Character
>    --  are valid, so Encoding_Error cannot be raised for string input data.
>
>    function Encode
>      (Item          : String;
>       Output_Scheme : Encoding_Scheme;
>       Output_BOM    : Boolean  := False) return UTF_String;
>    --  Encode String using UTF-8, UTF-16LE or UTF-16BE encoding as specified by
>    --  the Output_Scheme parameter.
>
>    function Encode
>      (Item       : String;
>       Output_BOM : Boolean  := False) return UTF_8_String;
>    --  Encode String using UTF-8 encoding
>
>    function Encode
>      (Item       : String;
>       Output_BOM : Boolean  := False) return UTF_16_Wide_String;
>    --  Encode String using UTF_16 encoding
>
>    --  The decoding routines take a UTF String as input, and return a decoded
>    --  Wide_String. If the UTF String starts with a BOM that matches the
>    --  encoding method, it is ignored. An incorrect BOM raises Encoding_Error,
>    --  as does a code out of range of type Character.
>
>    function Decode
>      (Item         : UTF_String;
>       Input_Scheme : Encoding_Scheme) return String;
>    --  The input is encoded in UTF_8, UTF_16LE or UTF_16BE as specified by the
>    --  Input_Scheme parameter. It is decoded and returned as a String value.
>    --  Note: a convenient form for scheme may be Encoding (UTF_String).
>
>    function Decode
>      (Item : UTF_8_String) return String;
>    --  The input is encoded in UTF-8 and returned as a String value
>
>    function Decode
>      (Item : UTF_16_Wide_String) return String;
>    --  The input is encoded in UTF-16 and returned as a String value
>
> end Ada.Strings.UTF_Encoding.String_Encoding;

I don't really care one way or another if this is included in the standard or
not (it doesn't affect us), obviously I think it should be included.

I do have to think about
its name. If I change the names of the Wide and Wide_Wide packages to match the
AI, then I suppose I will call this additional package

Ada.Strings.UTF_Encoding.Encoding

which seems a bit redundant.

I really prefer my names, they emphasize that the three packages deal with
Strings, Wide_Strings and Wide_Wide_Strings. In fact I think I will keep my
names anyway, but I will provide the AI names as renamings if people really feel
they are better.

Thoughts?

****************************************************************

From: Tucker Taft
Sent: Monday, July 26, 2010  9:34 AM

We debated various namings and came
up with the ones in the AI.  It is not
clear whether there is sufficient reason to reopen the AI.

****************************************************************

From: Robert Dewar
Sent: Monday, July 26, 2010  10:03 AM

Yes, but note the additional factor that you are missing a package, and the name

Ada.Strings.UTF_Encoding.Encoding

for the String specific version is distinctly unpleasant in my view! It was the
addition of this package (which is definitely needed) that made me add String to
the names. The real issue here is whether to add the missing package.

Once that is decided, the issue of what to call it can be addressed

In the case of

package Ada.Strings.UTF_Encoding.Wide_Wide_Encoding

You know it has to do with Wide_Wide_String, but that hint is not there for the
base package.

As I say, I am going to maintain the existing names in GNAT for this reason, but
it's of course trivial to add renames for the 2/3 packages that are in the
standard.

****************************************************************

From: Randy Brukardt
Sent: Monday, July 26, 2010  1:19 PM

That would be true, except that (a) the AI Robert referenced (version /01)
doesn't reflect the discussion from Valencia; and (b) we only voted intent in
Valencia because of the extensive changes that the AI underwent there.

Jean-Pierre sent a newer version of this AI to the ARG list on June 25th; I
haven't posted it yet, but anyone planning to comment on the AI needs to refer
to that most recent draft and not previous versions.

Thus I agree with Tucker that Robert's comments are moot, but for a different
reason: he's referring to an obsolete version of the AI.

P.S. Note that the package names were changed again, this time because of a
complaint that is recorded as coming from Ed: too many "Encoding"s in the names.

****************************************************************

From: Robert Dewar
Sent: Monday, July 26, 2010  5:42 PM

Can we at least see the current modified specs ASAP. I really don't care about
all the text, just the specs are enough.

BTW, when you use the word moot, are you using it in the correct sense of
"undecided, arguable", or in the peculiar common US modern sense of
"irrelevant". I recommend against using the word at all because of this
confusion.

****************************************************************

From: Randy Brukardt
Sent: Monday, July 26, 2010  6:40 PM

> Can we at least see the current modified specs ASAP. I really don't
> care about all the text, just the specs are enough.

I've forwarded J-P's old message to you (note that I haven't edited it yet, so
there may be some small changes before it gets posted). For any one else that is
interested, it was sent June 25th to this list.

> BTW, when you use the word moot, are you using it in the correct sense
> of "undecided, arguable", or in the peculiar common US modern sense of
> "irrelevant". I recommend against using the word at all because of
> this confusion.

I think I used it correctly, but I think whether I used it correctly is moot
(using it correctly). :-)

I actually thought about this before I used the word, because I knew that you
have complained about inaccurate use of this word in the past and that the
definition isn't "irrelevant". In this case, I meant that your comments ought
"to be left undecided", regardless of their technical merits, because they were
superseded by the full ARG meeting.

****************************************************************

From: Robert Dewar
Sent: Monday, July 26, 2010  6:47 PM

> I've forwarded J-P's old message to you (note that I haven't edited it
> yet, so there may be some small changes before it gets posted). For
> any one else that is interested, it was sent June 25th to this list.

So  this message basically represents the current proposal?

>> BTW, when you use the word moot, are you using it in the correct
>> sense of "undecided, arguable", or in the peculiar common US modern
>> sense of "irrelevant". I recommend against using the word at all
>> because of this confusion.
>
> I think I used it correctly, but I think whether I used it correctly
> is moot (using it correctly). :-)
>
> I actually thought about this before I used the word, because I knew
> that you have complained about inaccurate use of this word in the past
> and that the definition isn't "irrelevant". In this case, I meant that
> your comments ought "to be left undecided", regardless of their
> technical merits, because they were superseded by the full ARG meeting.

fair enough

****************************************************************

From: Robert Dewar
Sent: Monday, July 26, 2010  6:49 PM

I will look at the new version and comment ASAP!

****************************************************************

From: Robert Dewar
Sent: Monday, July 26, 2010  7:18 PM

I read the specs in the latest version Randy sent to me, and they are fine, I
agree that removing the second Encoding from the names makes sense, and I will
make that change in the GNAT implementation.

****************************************************************

From: Randy Brukardt
Sent: Monday, July 26, 2010  7:42 PM

For the record, I've now checked over Jean-Pierre's version carefully, and while
I found a couple of typos and the !corrigendum section was incompletely updated,
I didn't find any significant problems.

****************************************************************


Questions? Ask the ACAA Technical Agent