!standard A.4.11 10-10-19 AI05-0137-2/03 !class Amendment 10-05-07 !status Amendment 2012 10-05-07 !status WG9 Approved 10-10-28 !status ARG Approved 6-0-3 10-06-20 !status work item 10-05-07 !status received 10-03-15 !priority Medium !difficulty Easy !subject String encoding packages !summary New child packages of Ada.Strings are added to support conversions between String/Wide_String/Wide_Wide_String and UTF_8/UTF_16 encoding. These packages are intended to replace the already approved versions from AI05-0137-1. !problem SI99-0041 requires the adoption of UTF_16 for the encoding of program text in ASIS. Similarly, many real-world applications use UTF-8 or UTF-16 encodings. However, the Ada Standard provides no way to actually construct or use such text strings. It would be useful for ASIS users, but also for the Ada community at large to define a package to handle encoding/decoding between String/Wide_String/Wide_Wide_String and UTF_8/UTF_16. !proposal (See summary.) !wording The following clause is added as A.4.11: A.4.11 String encoding Facilities for encoding, decoding, and converting strings in various character encoding schemes are provided by packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings, and Strings.UTF_Encoding.Wide_Wide_Strings. Static Semantics The encoding library packages have the following declarations: package Ada.Strings.UTF_Encoding is pragma Pure (UTF_Encoding); -- Declarations common to the string encoding packages type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE); subtype UTF_String is String; subtype UTF_8_String is String; subtype UTF_16_Wide_String is Wide_String; Encoding_Error : exception; BOM_8 : constant UTF_8_String := Character'Val (16#EF#) & Character'Val (16#BB#) & Character'Val (16#BF#); BOM_16BE : constant UTF_String := Character'Val (16#FE#) & Character'Val (16#FF#); BOM_16LE : constant UTF_String := Character'Val (16#FF#) & Character'Val (16#FE#); BOM_16 : constant UTF_16_Wide_String := (1 => Wide_Character'Val (16#FEFF#)); function Encoding (Item : UTF_String; Default : Encoding_Scheme := UTF_8) return Encoding_Scheme; end Ada.Strings.UTF_Encoding; package Ada.Strings.UTF_Encoding.Conversions is pragma Pure (Conversions); -- Conversions between various encoding schemes function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_16_Wide_String; function Convert (Item : UTF_8_String; Output_BOM : Boolean := False) return UTF_16_Wide_String; function Convert (Item : UTF_16_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; function Convert (Item : UTF_16_Wide_String; Output_BOM : Boolean := False) return UTF_8_String; end Ada.Strings.UTF_Encoding.Conversions; package Ada.Strings.UTF_Encoding.Strings is pragma Pure (Strings); -- Encoding / decoding between String and various encoding schemes function Encode (Item : String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; function Encode (Item : String; Output_BOM : Boolean := False) return UTF_8_String; function Encode (Item : String; Output_BOM : Boolean := False) return UTF_16_Wide_String; function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return String; function Decode (Item : UTF_8_String) return String; function Decode (Item : UTF_16_Wide_String) return String; end Ada.Strings.UTF_Encoding.Strings; package Ada.Strings.UTF_Encoding.Wide_Strings is pragma Pure (Wide_Strings); -- Encoding / decoding between Wide_String and various encoding schemes function Encode (Item : Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_8_String; function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String; function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_String; function Decode (Item : UTF_8_String) return Wide_String; function Decode (Item : UTF_16_Wide_String) return Wide_String; end Ada.Strings.UTF_Encoding.Wide_Strings; package Ada.Strings.UTF_Encoding.Wide_Wide_Strings is pragma Pure (Wide_Wide_Strings); -- Encoding / decoding between Wide_Wide_String and various encoding schemes function Encode (Item : Wide_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_8_String; function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String; function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_Wide_String; function Decode (Item : UTF_8_String) return Wide_Wide_String; function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String; end Ada.Strings.UTF_Encoding.Wide_Wide_Strings; The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 10646. UTF_16BE corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC 10646 stored in 8 bits, big endian; and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian. The subtype UTF_String is used to represent a String of 8-bit values containing a sequence of values encoded in one of three ways (UTF-8, UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used to represent a String of 8-bit values containing a sequence of values encoded in UTF-8. The subtype UTF_16_Wide_String is used to represent a Wide_String of 16-bit values containing a sequence of values encoded in UTF-16. The BOM_8, BOM_16BE, BOM_16LE and BOM_16 constants correspond to values used at the start of a string to indicate the encoding. For all Convert and Decode functions, an initial BOM in the input that matches the expected encoding scheme is ignored, and a different initial BOM causes Encoding_Error to be propagated. For all Convert and Encode functions, a BOM is included at the start of the output string if the Output_BOM parameter is set to True. The exception Encoding_Error is also propagated in the following situations: * By a Decode function when a UTF encoded string contains an invalid encoding sequence. * By a Decode function when the expected encoding is UTF-16BE or UTF-16LE and the input string has an odd length. * By a Decode function yielding a String when the decoding of a sequence results in a code-point whose value exceeds 16#FF#. * By a Decode function yielding a Wide_String when the decoding of a sequence results in a code-point whose value exceeds 16#FFFF#. * By an Encode function taking a Wide_String as input when an invalid character appears in the input. In particular the characters whose position is in the range 16#D800# .. 16#DFFF# are invalid because they conflict with UTF-16 surrogate encodings, and the characters whose position is 16#FFFE# or 16#FFFF# are also invalid because they conflict with BOM codes. Each of the Convert and Encode functions returns a UTF_String (respectively UTF_8_String and UTF_16_String) value whose characters have position values that correspond to the encoding of the Item parameter according to the encoding scheme required by the function or specified by its Output_Scheme parameter. For UTF_8, no overlong encoding is returned. The lower bound of the returned string is 1. Each of the Decode functions takes a UTF_String (respectively UTF_8_String and UTF_16_String) Item parameter which is assumed to contain characters whose position values correspond to a valid encoding sequence according to the encoding scheme required by the function or specified by its Input_Scheme parameter, and returns the corresponding String, Wide_String or Wide_Wide_String value. The lower bound of the returned string is 1. function Encoding (Item : UTF_String; Default : Encoding_Scheme := UTF_8) return Encoding_Scheme; Inspects a UTF_String value to determine whether it starts with a BOM for UTF-8, UTF-16BE, or UTF_16LE. If so, returns the scheme corresponding to the BOM; returns the value of Default otherwise. function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; Converts from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and generates an output encoded in one of these three schemes as specified by Output_Scheme. function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_16_Wide_String; Converts from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and generates an output encoded in UTF-16. function Convert (Item : UTF_8_String; Output_BOM : Boolean := False) return UTF_16_Wide_String; Converts from input encoded in UTF-8 and generates an output encoded in UTF-16. function Convert (Item : UTF_16_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; Converts from input encoded in UTF-16 and generates an output encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme. function Convert (Item : UTF_16_Wide_String; Output_BOM : Boolean := False) return UTF_8_String; Converts from input encoded in UTF-16 and generates an output encoded in UTF-8. function Encode (Item : String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; Encodes from String input, and generates an output encoded in UTF-8, UTF-16LE or UTF-16BE encoding as specified by Output_Scheme. function Encode (Item : String; Output_BOM : Boolean := False) return UTF_8_String; Encodes from String input, and generates an output encoded in UTF-8 encoding. function Encode (Item : String; Output_BOM : Boolean := False) return UTF_16_Wide_String; Encodes from String input, and generates an output encoded in UTF_16 encoding. function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return String; Decodes from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and returns the corresponding String value. function Decode (Item : UTF_8_String) return String; Decodes from input encoded in UTF-8, and returns the corresponding String value. function Decode (Item : UTF_16_Wide_String) return String; Decodes from input encoded in UTF-16, and returns the corresponding String value. function Encode (Item : Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; Encodes from Wide_String input, and generates an output encoded in UTF-8, UTF-16LE or UTF-16BE encoding as specified by Output_Scheme. function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_8_String; Encodes from Wide_String input, and generates an output encoded in UTF-8 encoding. function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String; Encodes from Wide_String input, and generates an output encoded in UTF_16 encoding. function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_String; Decodes from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and returns the corresponding Wide_String value. function Decode (Item : UTF_8_String) return Wide_String; Decodes from input encoded in UTF-8, and returns the corresponding Wide_String value. function Decode (Item : UTF_16_Wide_String) return Wide_String; Decodes from input encoded in UTF-16, and returns the corresponding Wide_String value. function Encode (Item : Wide_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; Encodes from Wide_Wide_String input, and generates an output encoded in UTF-8, UTF-16LE or UTF-16BE encoding as specified by Output_Scheme. function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_8_String; Encodes from Wide_Wide_String input, and generates an output encoded in UTF-8 encoding. function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String; Encodes from Wide_Wide_String input, and generates an output encoded in UTF_16 encoding. function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_Wide_String; Decodes from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme, and returns the corresponding Wide_Wide_String value. function Decode (Item : UTF_8_String) return Wide_Wide_String; Decodes from input encoded in UTF-8, and returns the corresponding Wide_Wide_String value. function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String; Decodes from input encoded in UTF-16, and returns the corresponding Wide_Wide_String value. Implementation Advice If an implementation supports other encoding schemes, similar children of Ada.Strings should be defined. Note: A BOM (Byte-Order Mark, code position 16#FEFF#) can be included in a file or other entity to indicate the encoding; it is skipped when decoding. Typically, only the first line of a file or other entity contains a BOM. When decoding, the Encoding function can be called on the first line to determine the encoding; this encoding will then be used in subsequent calls to Decode to convert all of the lines to an internal format. !discussion Background on character encoding: A character set is a set of abstract characters. An encoding assigns an integer value to each character; this value is called the code-point of the character. Normally, a character string should be represented as a sequence of code-points; however, it would waste a lot of space, since ISO 10646 defines 32-bit code-points. An encoding scheme is a representation of a string of characters, using a more economical representation. Typically, an encoding scheme uses a suite of integer values, where each code-point is represented by one or several consecutive values. UTF-8 is an encoding scheme that uses 8-bit values. In some cases, UTF-8 defines several possible encodings for a code-point; in this case, the shortest one should be used; other encodings are called overlong encodings. UTF-16 uses 16-bit values. UTF-32 uses 32-bit values, which is of little interest since nothing is gained compared to UCS-32 (raw encoding). There is no problem when using a String to encode UTF-8, or a Wide_String to encode UTF-16. However, it is sometimes useful to encode/decode a UTF-16 (or even UTF-32) encoded text into/from a String; in that case, characters must be paired to form 16-bit values (or 32-bit values). This can be done in two ways, Big Endian (high order character first) or Little Endian (low order character first). A special value, called BOM (Byte Order Mark, 16#FEFF#), can be used at the beginning of an encoded text (with 4 leading zeroes for UTF-32). The BOM corresponds to no code-point, and is discarded when decoding, but it is used to recognize whether a stream of bytes is Big Endian or Little Endian UTF-16 or UTF-32. By extension, the sequence 16#EF# 16#BB# 16#BF# can be used as BOM to identify UTF-8 text (although there is no byte order issue in UTF-8; actually, use of BOM for UTF-8 is discouraged). Note that UTF-8 encoding could be used for file names that include characters that are not in ASCII. This package would allow adding an Implementation Advice (to Text_IO, Sequential_IO, and so on) to the effect that it is recommended to support file names encoded in UTF-8. Implementation choices: Strictly speaking, an encoded text should be an array of bytes, not of (wide_)characters. This proposal uses (wide_)string, but the encoding is defined in terms of position values of characters rather than characters themselves. It could be argued that it should be defined in terms of internal representation of characters, but we know that they are the same as the position values for (Wide_)Character. It is necessary to have decoding functions with a parameter that specifies the encoding, because it makes things easier when the encoding scheme is recognized dynamically. Functions whose encoding scheme is implicit are also provided for the most common encoding schemes to make it simpler for programs that require a statically defined encoding scheme. There are many other possible encoding schemes, including UTF-EBCDIC, Shift-JIS, SCSU, BOCU-1... It seemed sensible to provide only the most useful ones, while leaving the possibility (through Implementation Advice) to provide others. When reading a file, a BOM can be expected as starting the first line of the file, but not subsequent lines. The proposed handling of BOM assumes the following pattern: 1) Read the first line. Call function Encoding on that line with an appropriate default to use if the line does not start with a BOM. Initialize the encoding scheme to the value returned by the function. 2) Decode all lines (including the first one) with the chosen encoding scheme. Since the BOM is ignored by Decode functions, it is not necessary to slice the first line specially. The Wide_Wide_String functions have been put in a separate package to avoid dragging in the corresponding code when Wide_Wide_Strings are not used in the application code. Alternative designs: Arrays of Unsigned_8 or Unsigned_16 could be used in place of (Wide_)String. That would enforce strong typing to differentiate between an Ada String and an encoded string. On the other hand, it is likely to be more of a burden than a help to most casual users. Moreover, it would not allow ASIS program text to be kept as a Wide_String. Existing similar packages: Similar conversion functions are provided as part of xmlada and qtada. xmlada provides much more sophisticated services, such as supporting conversions to various ccs, converting in place in buffers, etc. However, it seems reasonable to provide only basic functionalities in the standard. Gnat provides the package System.WCh_Con, but it converts only individual characters (not strings), does not support UTF-8, and is provided by generics that require a user-provided input/output formal function. Although more general, this solution would be too heavyweight for the casual user. !corrigendum A.4.11(0) @dinsc Facilities for encoding, decoding, and converting strings in various character encoding schemes are provided by packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings, and Strings.UTF_Encoding.Wide_Wide_Strings. @s8<@i> The encoding library packages have the following declarations: @xcode<@b Ada.Strings.UTF_Encoding @b @b Pure (UTF_Encoding); -- Declarations common to the string encoding packages @b Encoding_Scheme @b (UTF_8, UTF_16BE, UTF_16LE); @b UTF_String @b String; @b UTF_8_String @b String; @b UTF_16_Wide_String @b Wide_String; Encoding_Error : @b; BOM_8 : @b UTF_8_String := Character'Val (16#EF#) & Character'Val (16#BB#) & Character'Val (16#BF#); BOM_16BE : @b UTF_String := Character'Val (16#FE#) & Character'Val (16#FF#); BOM_16LE : @b UTF_String := Character'Val (16#FF#) & Character'Val (16#FE#); BOM_16 : @b UTF_16_Wide_String := (1 =@> Wide_Character'Val (16#FEFF#)); @b Encoding (Item : UTF_String; Default : Encoding_Scheme := UTF_8) @b Encoding_Scheme; @b Ada.Strings.UTF_Encoding; @b Ada.Strings.UTF_Encoding.Conversions @b @b Pure (Conversions); -- Conversions between various encoding schemes @b Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_String; @b Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_16_Wide_String; @b Convert (Item : UTF_8_String; Output_BOM : Boolean := False) @b UTF_16_Wide_String; @b Convert (Item : UTF_16_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_String; @b Convert (Item : UTF_16_Wide_String; Output_BOM : Boolean := False) @b UTF_8_String; @b Ada.Strings.UTF_Encoding.Conversions; @b Ada.Strings.UTF_Encoding.Strings @b @b Pure (Strings); -- Encoding / decoding between String and various encoding schemes @b Encode (Item : String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_String; @b Encode (Item : String; Output_BOM : Boolean := False) @b UTF_8_String; @b Encode (Item : String; Output_BOM : Boolean := False) @b UTF_16_Wide_String; @b Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) @b String; @b Decode (Item : UTF_8_String) @b String; @b Decode (Item : UTF_16_Wide_String) @b String; @b Ada.Strings.UTF_Encoding.Strings; @b Ada.Strings.UTF_Encoding.Wide_Strings @b @b Pure (Wide_Strings); -- Encoding / decoding between Wide_String and various encoding schemes @b Encode (Item : Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_String; @b Encode (Item : Wide_String; Output_BOM : Boolean := False) @b UTF_8_String; @b Encode (Item : Wide_String; Output_BOM : Boolean := False) @b UTF_16_Wide_String; @b Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) @b Wide_String; @b Decode (Item : UTF_8_String) @b Wide_String; @b Decode (Item : UTF_16_Wide_String) @b Wide_String; @b Ada.Strings.UTF_Encoding.Wide_Strings; @b Ada.Strings.UTF_Encoding.Wide_Wide_Strings @b @b Pure (Wide_Wide_Strings); -- Encoding / decoding between Wide_Wide_String and various encoding schemes @b Encode (Item : Wide_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_String; @b Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) @b UTF_8_String; @b Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) @b UTF_16_Wide_String; @b Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) @b Wide_Wide_String; @b Decode (Item : UTF_8_String) @b Wide_Wide_String; @b Decode (Item : UTF_16_Wide_String) @b Wide_Wide_String; @b Ada.Strings.UTF_Encoding.Wide_Wide_Strings;> The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 10646. UTF_16BE corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC 10646 stored in 8 bits, big endian; and UTF_16LE corresponds to the UTF-16 encoding scheme on 8 bits, little endian. The subtype UTF_String is used to represent a String of 8-bit values containing a sequence of values encoded in one of three ways (UTF-8, UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used to represent a String of 8-bit values containing a sequence of values encoded in UTF-8. The subtype UTF_16_Wide_String is used to represent a Wide_String of 16-bit values containing a sequence of values encoded in UTF-16. The BOM_8, BOM_16BE, BOM_16LE and BOM_16 constants correspond to values used at the start of a string to indicate the encoding. For all Convert and Decode functions, an initial BOM in the input that matches the expected encoding scheme is ignored, and a different initial BOM causes Encoding_Error to be propagated. For all Convert and Encode functions, a BOM is included at the start of the output string if the Output_BOM parameter is set to True. The exception Encoding_Error is also propagated in the following situations: @xbullet @xbullet @xbullet @xbullet @xbullet Each of the Convert and Encode functions returns a UTF_String (respectively UTF_8_String and UTF_16_String) value whose characters have position values that correspond to the encoding of the Item parameter according to the encoding scheme required by the function or specified by its Output_Scheme parameter. For UTF_8, no overlong encoding is returned. The lower bound of the returned string is 1. Each of the Decode functions takes a UTF_String (respectively UTF_8_String and UTF_16_String) Item parameter which is assumed to contain characters whose position values correspond to a valid encoding sequence according to the encoding scheme required by the function or specified by its Input_Scheme parameter, and returns the corresponding String, Wide_String, or Wide_Wide_String value. The lower bound of the returned string is 1. @xcode<@b Encoding (Item : UTF_String; Default : Encoding_Scheme := UTF_8) @b Encoding_Scheme;> @xindent @xcode<@b Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_String;> @xindent @xcode<@b Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_16_Wide_String;> @xindent @xcode<@b Convert (Item : UTF_8_String; Output_BOM : Boolean := False) @b UTF_16_Wide_String;> @xindent @xcode<@b Convert (Item : UTF_16_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_String;> @xindent @xcode<@b Convert (Item : UTF_16_Wide_String; Output_BOM : Boolean := False) @b UTF_8_String;> @xindent @xcode<@b Encode (Item : String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_String;> @xindent @xcode<@b Encode (Item : String; Output_BOM : Boolean := False) @b UTF_8_String;> @xindent @xcode<@b Encode (Item : String; Output_BOM : Boolean := False) @b UTF_16_Wide_String;> @xindent @xcode<@b Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) @b String;> @xindent @xcode<@b Decode (Item : UTF_8_String) @b String;> @xindent @xcode<@b Decode (Item : UTF_16_Wide_String) @b String;> @xindent @xcode<@b Encode (Item : Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_String;> @xindent @xcode<@b Encode (Item : Wide_String; Output_BOM : Boolean := False) @b UTF_8_String;> @xindent @xcode<@b Encode (Item : Wide_String; Output_BOM : Boolean := False) @b UTF_16_Wide_String;> @xindent @xcode<@b Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) @b Wide_String;> @xindent @xcode<@b Decode (Item : UTF_8_String) @b Wide_String;> @xindent @xcode<@b Decode (Item : UTF_16_Wide_String) @b Wide_String;> @xindent @xcode<@b Encode (Item : Wide_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) @b UTF_String;> @xindent @xcode<@b Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) @b UTF_8_String;> @xindent @xcode<@b Encode (Item : Wide_Wide_String; Output_BOM : Boolean := False) @b UTF_16_Wide_String;> @xindent @xcode<@b Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) @b Wide_Wide_String;> @xindent @xcode<@b Decode (Item : UTF_8_String) @b Wide_Wide_String;> @xindent @xcode<@b Decode (Item : UTF_16_Wide_String) @b Wide_Wide_String;> @xindent @s8<@i> If an implementation supports other encoding schemes, similar children of Ada.Strings should be defined. NOTE@hr @s9<14 A BOM (Byte-Order Mark, code position 16#FEFF#) can be included in a file or other entity to indicate the encoding; it is skipped when decoding. Typically, only the first line of a file or other entity contains a BOM. When decoding, the Encoding function can be called on the first line to determine the encoding; this encoding will then be used in subsequent calls to Decode to convert all of the lines to an internal format.> !appendix From: Robert Dewar Sent: Monday, March 15, 2010 7:54 AM I am now implementing this package. As usual that is the first time I am really looking at it carefully. It seems quite unfortunate to me that the routines take Scheme as a dynamic parameter, this means that you always include all encoding methods in your program, even if you are only interested in one of them. I understand the usage of looking for a BOM and then passing the discovered encoding scheme to Decode, but that's only one usage. In practice, I think by far the most common usage will be with a fixed encoding scheme, almost always UTF-8, and it seems unfortunate to have a situation where for this usage you are forced to incorporate all the stuff for the other encoding schemes. **************************************************************** From: Tucker Taft Sent: Monday, March 15, 2010 8:08 AM An alternative approach would be to provide separate child packages for each kind of encoding, and perhaps have one child that matches the currently proposed package, which would call the appropriate one of the others based on the run-time enumeration value. Tagged types could probably be used in a creative way as well. **************************************************************** From: Robert Dewar Sent: Monday, March 15, 2010 8:21 AM > An alternative approach would be to provide separate child packages > for each kind of encoding, and perhaps have one child that matches the > currently proposed package, which would call the appropriate one of > the others based on the run-time enumeration value. Perhaps, but it would be good enough to have specific routines Encode_UTF_8 etc. One assumes any reasonable system can eliminate unused subprograms, so we don't need separate packages. > Tagged types could probably be used in a creative way as well. Seems overkill to me. **************************************************************** From: Jean-Pierre Rosen Sent: Monday, March 15, 2010 10:09 AM > I am now implementing this package. As usual that is the first time I > am really looking at it carefully. It seems quite unfortunate to me > that the routines take Scheme as a dynamic parameter, this means that > you always include all encoding methods in your program, even if you > are only interested in one of them. I understand that implementation of UTF_8 and the various UTF_16 are different. OTOH, I guess (wild guess admitedly) that the different forms of UTF_16 will share most of the code. Given that the BOM is really useful only for UTF_16*, a solution could be to just separate UTF_8 from UTF_16 in two different child packages. > I understand the usage of looking for a BOM and then passing the > discovered encoding scheme to Decode, but that's only one usage. But certainly one that happens in practice, and my concern was to avoid the "big case". But it is likely that choosing between UTF_8 and UTF_16 is on the application's side, while choosing between UTF_16BE and UTF_16LE is more on the user's side, so the dynamic choice is more important here. So, I'm suggesting having a package UTF_Encoding with the type Encoding_Scheme, functions Encoding, and exception Encoding_Error, and children UTF_8 and UTF_16. For those wanting to support both, calls would appear as UTF_8.Decode or UTF_16.Decode, which would read nicely (we would keep the Scheme parameter for UTF_16). **************************************************************** From: Bob Duff Sent: Monday, March 15, 2010 10:39 AM > Perhaps, but it would be good enough to have specific routines > > Encode_UTF_8 > > etc. If you mean in addition to the ones we already have, then I agree. I don't see any need for child packages -- that would just complicate things. **************************************************************** From: Robert Dewar Sent: Monday, March 15, 2010 10:46 AM > If you mean in addition to the ones we already have, then I agree. right, in addition, then the dynamic package just makes calls to these specific ones. I think only the UTF_8 routines are really important. The UTF_16 routines are non-dynamic anyway (since there is only one possible scheme). > I don't see any need for child packages -- that would just complicate > things. I agree **************************************************************** From: Tucker Taft Sent: Monday, March 15, 2010 11:01 AM We might consider making them subpackages rather than child packages: package UTF_Encoding is ... package UTF_8 is function Encode(... function Decode(... end UTF_8; package UTF_16 is ... end UTF_16; ... end UTF_Encoding; Then some of the tricks implementors use with Text_IO could be used to carve out the subpackages if desired. I also think UTF_8.Encode looks better than Encode_UTF_8 somehow... ;-) Unfortunately, this doesn't really work since there is an existing enumeration literal UTF_8. In fact, even a child package named "UTF_8" would be illegal if the enumeration were in the parent package. If we restructured it into separate child packages, with the enumeration only in the "dynamic" child, it would work, but perhaps just some functions with names like Encode_UTF_8 are adequate. Not elegant, though. **************************************************************** From: Robert Dewar Sent: Monday, March 15, 2010 11:21 AM > Then some of the tricks implementors use with Text_IO could be used to > carve out the subpackages if desired. These tricks are not relevant I think, we certainly wouldn't bother. What we would do at this stage is to rely on automatic elimination of unused subprograms. > I also think UTF_8.Encode looks better than Encode_UTF_8 somehow... > ;-) I really don't care about the name > > Unfortunately, this doesn't really work since there is an existing > enumeration literal UTF_8. In fact, even a child package named > "UTF_8" would be illegal if the enumeration were in the parent > package. If we restructured it into separate child packages, with the > enumeration only in the "dynamic" child, it would work, but perhaps > just some functions with names like Encode_UTF_8 are adequate. > Not elegant, though. I would just add Encode_UTF_8 and Decode_UTF_8 to the existing spec and be done with it. **************************************************************** From: Robert Dewar Sent: Friday, March 19, 2010 4:45 PM Another issue for this package, why is there no routine for decoding a UTF-16 string with Wide_String output? This is not trivial, there are cases where two UTF-16 codes are required for a valid Wide_Character output. Similarly there should be a routine for encoding a Wide_String in UTF-16. I see no reason for omitting these cases??? Note that you can encode between UTF-16BE/LE and Wide_String, so it seems odd not to cover the cases of UTF-16 and Wide_String. I noted this because naturally the UTF-16BE/LE cases are defined in terms of the missing routines! --- In my package I have in the visible spec: function Encode (Item : Wide_String; Scheme : Long_Encoding := UTF_16) return Wide_String; function Encode (Item : Wide_Wide_String; Scheme : Long_Encoding := UTF_16) return Wide_String; function Encode (Item : Wide_String; Scheme : Long_Encoding := UTF_16) return Wide_String; function Decode (Item : Wide_String; Scheme : Long_Encoding := UTF_16) return Wide_Wide_String; which is nicely symmetrical with the short encoding scheme routines In the body I also have: procedure Decode_UTF_8 (Item : String) return Wide_String; -- Equivalent to Decode (Item, UTF_8), but smaller and faster procedure Decode_UTF_8 (Item : String) return Wide_Wide_String; -- Equivalent to Decode (Item, UTF_8), but smaller and faster procedure Encode_UTF_8 (Item : Wide_String) return String; -- Equivalent to Encode (Item, UTF_8) but smaller and faster procedure Encode_UTF_8 (Item : Wide_Wide_String) return String -- Equivalent to Encode (Item, UTF_8) but smaller and faster procedure Decode_UTF_16 (Item : String) return Wide_String; -- Equivalent to Decode (Item, UTF_16) procedure Decode_UTF_16 (Item : String) return Wide_Wide_String; -- Equivalent to Decode (Item, UTF_16) procedure Encode_UTF_16 (Item : Wide_String) return String; -- Equivalent to Encode (Item, UTF_16) procedure Encode_UTF_16 (Item : Wide_Wide_String) return String -- Equivalent to Encode (Item, UTF_16) I think that at least the UTF_8 routines should be visible in the spec. The UTF_16 routines are less critical, since the general dynamic routine is silly in the UTF_16 case, since UTF_16 is the only possibility. Seeing as an implementor is not free to add encodings as far as I can see, this is presumably preparation for Ada 2020, where new long encodings will be added??? Perhaps implementors SHOULD be allowed to add BOM definitions and new encodings? **************************************************************** From: Robert Dewar Sent: Friday, March 19, 2010 5:11 PM > why is there no routine for decoding a UTF-16 string with Wide_String > output? This is not trivial, there are cases where two UTF-16 codes > are required for a valid Wide_Character output. I am wrong about this, you can never need two codes, but there are incorrect input sequences in both directions that can be diagnosed. **************************************************************** From: Bob Duff Sent: Friday, March 19, 2010 5:23 PM > Seeing as an implementor is not free to add encodings An implementer is free to add child packages. That seems appropriate, no? **************************************************************** From: Robert Dewar Sent: Friday, March 19, 2010 6:24 PM > An implementer is free to add child packages. > That seems appropriate, no? Sure, but it does not help the fact that these routines: > function Encode > (Item : Wide_Wide_String; > Scheme : Long_Encoding := UTF_16) return Wide_String; > function Decode > (Item : Wide_String; > Scheme : Long_Encoding := UTF_16) return Wide_Wide_String; Are a bit silly, since Scheme can have only one possible value UTF_16 Furthermore adding child packages does not really fit the general scheme. Suppose I want to add a new long encoding called Dewars_Improved_UTF_16 with its own new BOM, I really want the Encoding routine to recognize that new BOM and return a new entry in the enumeration type. If I have to put stuff into a child package, it's silly, because the child package won't use any of the parent package, and will simply drag it in uselessly. So rather than add Ada.Strings.UTF_Encoding.Dewars_Improved_UTF_16 I might just as well add Ada.Strings.Dewars_Improved_UTF_16 Also is this package intended to be limited to UTF encodings? It's name suggests this, so it is a bit silly to imply that there might be new long encodings in the future. Basically the generality of the long encoding subtype is perfectly silly orthogonality with the short encoding case, where there is more than one possibility. In my implementation, I have stupid routines that look e.g. like: > -- Wide_String input with Wide_Wide_String output (long encodings) > > function Decode > (Item : Wide_String; > Scheme : Long_Encoding := UTF_16) return Wide_Wide_String > is > pragma Unreferenced (Scheme); > begin > return Decode_UTF_16 (Item); > end Decode; **************************************************************** From: Robert Dewar Sent: Saturday, March 20, 2010 7:44 AM Another thought on the string encoding package This package works by representing UTF-8/UTF-16LE/UTF-16BE encoded strings using type String, and UTF-16 encoded strings using type Wide_String but really this is a misuse of these types, since these encoded strings do not match the semantics of string as defined in the RM. I do not suggest coming up with separate types, I think that would be plain confusing. But I think good coding standards would suggest use of subtype UTF_String is String; -- String value used to hold encoded UTF-8/UTF-16LE/UTF-16BE string subtype Wide_UTF_String is Wide_String; -- String value use to hold encoded UTF-16 string I suggest defining these two subtypes in the package spec and then using them where appropriate. **************************************************************** From: Robert Dewar Sent: Saturday, March 20, 2010 8:49 AM One more issue with this package The Encoding routines do not generate a BOM at the start, this means that if you want a BOM you have to concatenate it, which forces a full copy of the generated string. Wouldn't it be better to have an extra parameter on the encode routines: Output_BOM : Boolean := False which, if set to True would include the BOM in the output? **************************************************************** From: Robert Dewar Sent: Saturday, March 20, 2010 9:01 AM One MORE thought :-) The Encoding routines return UTF_None if there is no BOM, meaning you have to write something like E : constant Encoding_Method := Encoding (Input); X : String := Encode (Input, (if E = UTF_None then UTF_8, else E)); How about getting rid of UTF_None, and instead adding a parameter to the Encoding routines: Default_Encoding : Encoding_Method := UTF_8/16 So we can write X := String := Encode (Input, Encoding (Input, UTF_8)); Alternatively, the Encoding routine could have an extra parameter Default_Scheme : Short_Encoding_Method := UTF_8; -- Encoding method if Scheme is UTF_None No big deal, but I noticed this discrepancy writing some tests **************************************************************** From: Bob Duff Sent: Saturday, March 20, 2010 9:58 AM > subtype UTF_String is String; > -- String value used to hold encoded UTF-8/UTF-16LE/UTF-16BE > string > > subtype Wide_UTF_String is Wide_String; > -- String value use to hold encoded UTF-16 string > > I suggest defining these two subtypes in the package spec and then > using them where appropriate. Sounds good to me. > Wouldn't it be better to have an extra parameter on the encode > routines: > > Output_BOM : Boolean := False > > which, if set to True would include the BOM in the output? Yes. > The Encoding routines return UTF_None if there is no BOM, meaning you > have to write something like > > E : constant Encoding_Method := Encoding (Input); > X : String := Encode (Input, (if E = UTF_None then UTF_8, else E)); > > How about getting rid of UTF_None, and instead adding a parameter to > the Encoding routines: > > Default_Encoding : Encoding_Method := UTF_8/16 > > So we can write > > X := String := Encode (Input, Encoding (Input, UTF_8)); > > Alternatively, the Encoding routine could have an extra parameter > > Default_Scheme : Short_Encoding_Method := UTF_8; > -- Encoding method if Scheme is UTF_None Sounds reasonable. I guess I prefer the former. > One consequence is that no part of this package can be available in > Ada 95 mode, which is a pity. > > If we separated this into > > Ada.Strings.UTF_Encoding; > > Ada.Strings.Wide_UTF_Encoding; > > then the former package could be used in Ada 95 mode. I suspect in > practice that apps would use only one of these packages, either you go > all the way and use 32-bit chars everywhere, or you stick to 16-bit > chars. OK with me. **************************************************************** From: Robert Dewar Sent: Saturday, March 20, 2010 7:44 AM Here is the spec as I implemented it initially for GNAT. I would still like to add Encode_UTF_8 and Decode_UTF_8 > ------------------------------------------------------------------------------ > -- -- > -- GNAT RUN-TIME COMPONENTS -- > -- -- > -- A D A . S T R I N G S . U T F _ E N C O D I N G -- > -- -- > -- S p e c -- > -- -- > -- This specification is derived from the Ada Reference Manual for use with -- > -- GNAT. In accordance with the copyright of that document, you can freely -- > -- copy and modify this specification, provided that if you redistribute a -- > -- modified version, any changes that you have made are clearly indicated. -- > -- -- > ------------------------------------------------------------------------------ > > -- This is the Ada 2012 package defined in AI05-0137-1. It is used for > -- encoding strings using UTF encodings (UTF-8, UTF-16LE, UTF-16BE, UTF-16). > > -- Compared with version 05 of the AI, we have added routines for UTF-16 > -- encoding and decoding of wide strings, which seems missing from the AI, > -- added comments, and reordered the declarations. > > -- Note: although this is an Ada 2012 package, the earlier versions of the > -- language permit the addition of new grandchildren of Ada, so we are able > -- to add this package unconditionally for use in Ada 2005 mode. We cannot > -- allow it in earlier versions, since it requires Wide_Wide_Character/String. > > package Ada.Strings.UTF_Encoding is > pragma Pure (UTF_Encoding); > > type Encoding_Scheme is (UTF_None, UTF_8, UTF_16BE, UTF_16LE, UTF_16); > > subtype Short_Encoding is Encoding_Scheme range UTF_8 .. UTF_16LE; > subtype Long_Encoding is Encoding_Scheme range UTF_16 .. UTF_16; > > -- The BOM (BYTE_ORDER_MARK) values defined here are used at the start of > -- a string to indicate the encoding. The convention in this package is > -- that decoding routines ignore a BOM, and output of encoding routines > -- does not include a BOM. If you want to include a BOM in the output, > -- you simply concatenate the appropriate value at the start of the string. > > BOM_8 : constant String := > Character'Val (16#EF#) & > Character'Val (16#BB#) & > Character'Val (16#BF#); > > BOM_16BE : constant String := > Character'Val (16#FE#) & > Character'Val (16#FF#); > > BOM_16LE : constant String := > Character'Val (16#FF#) & > Character'Val (16#FE#); > > BOM_16 : constant Wide_String := > (1 => Wide_Character'Val (16#FEFF#)); > > -- The encoding routines take a wide string or wide wide string as input > -- and encode the result using the specified UTF encoding method. For > -- UTF-16, the output is returned as a Wide_String, this is not a normal > -- Wide_String, since the codes in it may represent UTF-16 surrogate > -- characters used to encode large values. Similarly for UTF-8, UTF-16LE, > -- and UTF-16BE, the output is returned in a String, and again this String > -- is not a standard format string, since it may include UTF-8 surrogates. > -- As previously noted, the returned value does NOT start with a BOM. > > -- Note: invalid codes in calls to one of the Encode routines represent > -- invalid values in the sense that they are not defined. For example, the > -- code 16#DC03# is not a valid wide character value. Such values result > -- in undefined behavior. For GNAT, Constraint_Error is raised with an > -- appropriate exception message. > > function Encode > (Item : Wide_String; > Scheme : Short_Encoding := UTF_8) return String; > function Encode > (Item : Wide_Wide_String; > Scheme : Short_Encoding := UTF_8) return String; > > function Encode > (Item : Wide_String; > Scheme : Long_Encoding := UTF_16) return Wide_String; > function Encode > (Item : Wide_Wide_String; > Scheme : Long_Encoding := UTF_16) return Wide_String; > > -- The decoding routines take a String or Wide_String input which is an > -- encoded string using the specified encoding. The output is a normal > -- Ada Wide_String or Wide_Wide_String value representing the decoded > -- values. Note that a BOM in the input matching the encoding is skipped. > > Encoding_Error : exception; > -- Exception raised if an invalid encoding sequence is encountered by > -- one of the Decode routines. > > function Decode > (Item : String; > Scheme : Short_Encoding := UTF_8) return Wide_String; > function Decode > (Item : String; > Scheme : Short_Encoding := UTF_8) return Wide_Wide_String; > > function Decode > (Item : Wide_String; > Scheme : Long_Encoding := UTF_16) return Wide_String; > function Decode > (Item : Wide_String; > Scheme : Long_Encoding := UTF_16) return Wide_Wide_String; > > -- The Encoding functions inspect an encoded string or wide_string and > -- determine if a BOM is present. If so, the appropriate Encoding_Scheme > -- is returned. If not, then UTF_None is returned. > > function Encoding (Item : String) return Encoding_Scheme; > function Encoding (Item : Wide_String) return Encoding_Scheme; > > end Ada.Strings.UTF_Encoding; **************************************************************** From: Robert Dewar Sent: Saturday, March 20, 2010 9:24 AM One more thought Generally we provide separate packages for the Wide_String and Wide_Wide_String cases, but here we plop everything into one package. One consequence is that no part of this package can be available in Ada 95 mode, which is a pity. If we separated this into Ada.Strings.UTF_Encoding; Ada.Strings.Wide_UTF_Encoding; then the former package could be used in Ada 95 mode. I suspect in practice that apps would use only one of these packages, either you go all the way and use 32-bit chars everywhere, or you stick to 16-bit chars. **************************************************************** From: Robert Dewar Sent: Tuesday, March 23, 2010 4:32 AM I propose the following for the string encoding package. Compared with the version in the current AI, the changes are: Separate into parent package and child packages. The primary motivation here is to allow the use of the facilities for Wide_String's to be used in existing Ada 95 compilers (or multi-version compilers like GNAT running in Ada 95 mode). The Wide_Wide routines require Ada 2005. Add specific routines for UTF-8, since may applications do not need the generality of handling UTF-16BE and UTF-16LE. Eliminate the long encodings, there was only one (UTF-16) and it just seems clutter to have parameters that can only have one value, given there is no permission to add additional UTF encodings. If additional encodings are to be handled in a given implementation, they should be added as child packages. Such child packages cannot in any case add to the enumeration type specifying the encoding method. We could conceivably add implementation permission to add to the enumeration type etc, but I really think this is more generality than is useful. Really it is easier to have entirely separate child packages to handle additional encodings. Introduce subtypes UTF_String, UTF_8_String and UTF_16_Wide_String to clearly document these non-standard usages of String and Wide_String and to suggest a standard style for applications programs to use in conjunction with this package. Add conversion routines between UTF encodings, these are actually quite important (and non-trivial, see http://u8u16.costar.sfu.ca) Provide a convenient default parameter for Encoding, and eliminate UTF_None. The four package specs are attached If people agree with this approach, let's use this as the final version. I will wait to implement this version till we have some feedback (yes, I know, not many people can get excited about UTF encodings, so if you just want to say OK-by-me without really reading to closely, that's fine, I have put a lot of thought into this, so I think it's right now!) ------------------------------------------------------------------------------ -- -- -- GNAT RUN-TIME COMPONENTS -- -- -- -- A D A . S T R I N G S . U T F _ E N C O D I N G -- -- -- -- S p e c -- -- -- -- This specification is derived from the Ada Reference Manual for use with -- -- GNAT. In accordance with the copyright of that document, you can freely -- -- copy and modify this specification, provided that if you redistribute a -- -- modified version, any changes that you have made are clearly indicated. -- -- -- ------------------------------------------------------------------------------ -- This is one of the Ada 2012 package defined in AI05-0137-1. It is a parent -- package that contains declarations used in the child packages used for -- handling UTF encoded strings. Note: this package is consistent with Ada 95, -- and may be used in Ada 95, or Ada 2005 mode. package Ada.Strings.UTF_Encoding is pragma Pure (UTF_Encoding); subtype UTF_String is String; -- Used to represent a string of 8-bit values representing a string encoded -- in one of three ways (UTF-8, UTF-16BE, or UTF-16LE). Typically used in -- connection with a Scheme parameter indicating which of the encodings -- applies. This is not strictly a String value in the sense defined in the -- Ada RM, but in practice type String accomodates all possible 256 codes, -- and can be used to hold any sequence of 8-bit codes. We use String -- directly rather than create a new type so that all existing facilities -- for manipulating type String (e.g. the child packages of Ada.Strings) -- are available for manipulation of UTF_Strings. type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE); -- Used to specify which of three possible encodings apply to a UTF_String subtype UTF_8_String is String; -- Similar to UTF_String but specifically represents a UTF-8 encoded string subtype UTF_16_Wide_String is Wide_String; -- This is similar to UTF_8_String but is used to represent a Wide_String -- value which is a sequence of 16-bit values encoded using UTF-16. Again -- this is not strictly a Wide_String in the sense of the Ada RM, but the -- type Wide_String can be used to represent a sequence of arbitrary 16-bit -- values, and it is more convenient to use Wide_String than a new type. Encoding_Error : exception; -- This exception is raised if a UTF encoded string contains an invalid -- coding sequence, or when generating a Wide_String output if the output -- value is out of range of Wide_Character, or if an input Wide_Character -- or Wide_Wide_Character value does not represent a valid 10646 character -- value (e.g. 16#DC03# is not a valid unicode character and hence cannot -- be encoded in a UTF string. -- The BOM (BYTE_ORDER_MARK) values defined here are used at the start of -- a string to indicate the encoding. The convention in this package is -- that decoding routines ignore a BOM, and output of encoding routines -- may or may not include a BOM depending on the setting of Output_BOM. BOM_8 : constant String := Character'Val (16#EF#) & Character'Val (16#BB#) & Character'Val (16#BF#); BOM_16BE : constant String := Character'Val (16#FE#) & Character'Val (16#FF#); BOM_16LE : constant String := Character'Val (16#FF#) & Character'Val (16#FE#); BOM_16 : constant Wide_String := (1 => Wide_Character'Val (16#FEFF#)); function Encoding (Item : UTF_String; Default : Encoding_Scheme := UTF_8) return Encoding_Scheme; -- This function inspects a UTF_String value to determine whether it -- starts with a BOM for UTF-8, UTF-16BE, or UTF_16LE. If so, the result -- is the scheme corresponding to the BOM. If no valid BOM is present -- then the result is the specified Default value. end Ada.Strings.UTF_Encoding; ------------------------------------------------------------------------------ -- -- -- GNAT RUN-TIME COMPONENTS -- -- -- -- ADA.STRINGS.UTF_ENCODING.WIDE_ENCODING -- -- -- -- S p e c -- -- -- -- This specification is derived from the Ada Reference Manual for use with -- -- GNAT. In accordance with the copyright of that document, you can freely -- -- copy and modify this specification, provided that if you redistribute a -- -- modified version, any changes that you have made are clearly indicated. -- -- -- ------------------------------------------------------------------------------ -- This is an Ada 2012 package defined in AI05-0137-1. It is used for encoding -- and decoding Wide_String values using UTF encodings. Note: this package is -- consistent with Ada 95, and may be included in Ada 95 implementations. package Ada.Strings.UTF_Encoding.Wide_Encoding is pragma Pure (Wide_Encoding); -- The encoding routines take a Wide_String as input and encode the result -- using the specified UTF encoding method. The result includes a BOM if -- the Output_BOM parameter is set to True. function Encode (Item : Wide_String; Scheme : Encoding_Scheme := UTF_8; Output_BOM : Boolean := True) return UTF_String; -- Encode Wide_String using UTF-8, UTF-16LE or UTF-16BE encoding as -- specified by the Output_Scheme parameter. function Encode (Item : Wide_String; Output_BOM : Boolean := True) return UTF_8_String; -- Encode Wide_String using UTF-8 encoding function Encode (Item : Wide_String; Output_BOM : Boolean := True) return UTF_16_Wide_String; -- Encode Wide_String using UTF_16 encoding -- The decoding routines take a UTF String as input, and return a decoded -- Wide_String. If the UTF String starts with a BOM that matches the -- encoding method, it is ignored. function Decode (Item : UTF_String; Scheme : Encoding_Scheme := UTF_8) return Wide_String; -- The input is encoded in UTF_8, UTF_16LE or UTF_16BE as specified by -- the Scheme parameter. It is decoded and returned as a Wide_String value. -- Note: a convenient form for Scheme may be Encoding (UTF_String). function Decode (Item : UTF_8_String) return Wide_String; -- The input is encoded in UTF-8 and returned as a Wide_String value function Decode (Item : UTF_16_Wide_String) return Wide_String; -- The input is encoded in UTF-8 and returned as a Wide_String value end Ada.Strings.UTF_Encoding.Wide_Encoding; ------------------------------------------------------------------------------ -- -- -- GNAT RUN-TIME COMPONENTS -- -- -- -- ADA.STRINGS.UTF_ENCODING.WIDE_WIDE_ENCODING -- -- -- -- S p e c -- -- -- -- This specification is derived from the Ada Reference Manual for use with -- -- GNAT. In accordance with the copyright of that document, you can freely -- -- copy and modify this specification, provided that if you redistribute a -- -- modified version, any changes that you have made are clearly indicated. -- -- -- ------------------------------------------------------------------------------ -- This is an Ada 2012 package defined in AI05-0137-1. It is used for encoding -- and decoding Wide_String values using UTF encodings. Note: this package is -- consistent with Ada 2005, and may be used in Ada 2005 mode. package Ada.Strings.UTF_Encoding.Wide_Wide_Encoding is pragma Pure (Wide_Wide_Encoding); -- The encoding routines take a Wide_Wide_String as input and encode the -- result using the specified UTF encoding method. The result includes a -- BOM if the Output_BOM parameter is set to True. function Encode (Item : Wide_Wide_String; Scheme : Encoding_Scheme := UTF_8; Output_BOM : Boolean := True) return UTF_String; -- Encode Wide_Wide_String using UTF-8, UTF-16LE or UTF-16BE encoding as -- specified by the Output_Scheme parameter. function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := True) return UTF_8_String; -- Encode Wide_Wide_String using UTF-8 encoding function Encode (Item : Wide_Wide_String; Output_BOM : Boolean := True) return UTF_16_Wide_String; -- Encode Wide_Wide_String using UTF_16 encoding -- The decoding routines take a UTF String as input, and return a decoded -- Wide_Wide_String. If the UTF String starts with a BOM that matches the -- encoding method, it is ignored. function Decode (Item : UTF_String; Scheme : Encoding_Scheme := UTF_8) return Wide_Wide_String; -- The input is encoded in UTF_8, UTF_16LE or UTF_16BE as specified by the -- Scheme parameter. It is decoded and returned as a Wide_Wide_String -- value. Note: a convenient form for Scheme may be Encoding (UTF_String). function Decode (Item : UTF_8_String) return Wide_Wide_String; -- The input is encoded in UTF-8 and returned as a Wide_Wide_String value function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String; -- The input is encoded in UTF-8 and returned as a Wide_String value end Ada.Strings.UTF_Encoding.Wide_Wide_Encoding; ------------------------------------------------------------------------------ -- -- -- GNAT RUN-TIME COMPONENTS -- -- -- -- ADA.STRINGS.UTF_ENCODING.CONVERSIONS.CONVERSIONS -- -- -- -- S p e c -- -- -- -- This specification is derived from the Ada Reference Manual for use with -- -- GNAT. In accordance with the copyright of that document, you can freely -- -- copy and modify this specification, provided that if you redistribute a -- -- modified version, any changes that you have made are clearly indicated. -- -- -- ------------------------------------------------------------------------------ -- This is an the Ada 2012 package defined in AI05-0137-1. It is used for -- converting between different UTF encodings. package Ada.Strings.UTF_Encoding.Conversions is pragma Pure (Conversions); -- In the following conversion routines, a BOM in the input that matches -- the encoding scheme is ignored. A BOM is present in the output if the -- Output_BOM parameter is set to True. function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := True) return UTF_String; -- Convert from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified -- by the Input_Scheme argument, and generate an output encoded in one of -- these thre schemes as specified by the Output_Scheme argument. function Convert (Item : UTF_8_String; Output_BOM : Boolean := True) return UTF_16_Wide_String; -- Convert from UTF-8 to UTF-16 function Convert (Item : UTF_16_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := True) return UTF_String; -- Convert from UTF-16 to UTF-8, UTF-16LE, or UTF-16BE as specified by -- the Output_Scheme argument. function Convert (Item : UTF_16_Wide_String; Output_BOM : Boolean := True) return UTF_8_String; -- Convert from UTF-16 to UTF-8 end Ada.Strings.UTF_Encoding.Conversions; **************************************************************** From: Jean-Pierre Rosen Sent: Tuesday, March 23, 2010 5:50 AM > I propose the following for the string encoding package. > Compared with the version in the current AI, the changes > are: Making a concrete proposal is certainly the best thing to do! > Eliminate the long encodings, there was only one (UTF-16) and it just > seems clutter to have parameters that can only have one value, given > there is no permission to add additional UTF encodings. If additional > encodings are to be handled in a given implementation, they should be > added as child packages. > Such child packages cannot in any case add to the enumeration type > specifying the encoding method. Agreed. The first proposal suggested that the enumeration type could be extended for implementation-defined encodings. This was later rejected, but the (single) value stayed. > We could conceivably add implementation permission to add to the > enumeration type etc, but I really think this is more generality than > is useful. Really it is easier to have entirely separate child > packages to handle additional encodings. That was the general opinion. Some comments on your packages: I'd rather have the default for Output_BOM to false. In general, the BOM is output only for the first line of a file, so the general case is to not output the BOM. You don't seem to specify what happens if the string provided to Decode starts with a BOM that does not correspond to the expected Scheme. Do you agree to raise Encoding_Error? **************************************************************** From: Robert Dewar Sent: Tuesday, March 23, 2010 9:12 AM > Some comments on your packages: > > I'd rather have the default for Output_BOM to false. In general, the > BOM is output only for the first line of a file, so the general case > is to not output the BOM. That's actually a typo, I meant to make the default False, so no disagreement here (I switched from Suppress_Bom => True to Output_BOM => True, and of course I meant to switch to Output_BOM => False :-)) > You don't seem to specify what happens if the string provided to > Decode starts with a BOM that does not correspond to the expected > Scheme. Do you agree to raise Encoding_Error? Sure, that's fine, and makes good sense. Thanks for the input, so if I make these two changes are you good to go? **************************************************************** From: Robert Dewar Sent: Tuesday, March 23, 2010 1:17 PM Oops, I left out one conversion possibility, here is the updated version of the conversions package [See later messages - Editor.] **************************************************************** From: Robert Dewar Sent: Tuesday, March 23, 2010 1:24 PM These versions incorporate the changes suggested by JPR [Attachments skipped as these were changed again soon - Editor.] **************************************************************** From: Tucker Taft Sent: Friday, March 19, 2010 3:35 PM I think these two Encode routines are ambiguous if all parameters are defaulted. Hence I would recommend you remove the default for "Scheme". function Encode (Item : Wide_String; Scheme : Encoding_Scheme := UTF_8; Output_BOM : Boolean := False) return UTF_String; -- Encode Wide_String using UTF-8, UTF-16LE or UTF-16BE encoding as -- specified by the Output_Scheme parameter. function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_8_String; -- Encode Wide_String using UTF-8 encoding **************************************************************** From: Robert Dewar Sent: Tuesday, March 23, 2010 3:45 PM > I think these two Encode routines are ambiguous if all parameters are > defaulted. Hence I would recommend you remove the default for > "Scheme". Indeed it was my intention to remove the default for Scheme in this case, will do so! **************************************************************** From: Robert Dewar Sent: Tuesday, March 23, 2010 3:49 AM done! In fact the idea is that when dealing with UTF_String arguments which can be UTF-8/UTF-LE/UTF-BE, you are forced to specify the encoding. It makes no sense to have a UTF-8 default, since there is a specific routine for the UTF-8 case (essentially providing this default in any case). So the Scheme parameter being explicitly specified is important to signal the reader (and the compiler!) that we are dealing with a case where all three encodings are possible. **************************************************************** From: Robert Dewar Sent: Wednesday, March 24, 2010 8:31 AM Randy asked me why four packages, and thought it was overkill Here is my reasoning Basically there are three packages and a parent package, you need the parent because it has useful common declarations like the BOM values. The three packages are 1. Wide_String conversions to and from UTF 2. Wide_Wide_String conversions to and from UTF 3. Conversions between UTF forms I separated 3 off, because as per my previous reference there are interesting complex target-dependent approaches for these conversions, and so from an implementation point of view if nothing else you don't want to have to deal with this in the context of 1 and 2. 1 and 2 are separated because the wide string package can be used with Ada 95. The great majority of Ada users in the US are using Ada 95 rather than any other version, so it is useful if they can have access to this functionality. **************************************************************** From: Tucker Taft Sent: Wednesday, March 24, 2010 8:41 AM I think Robert's proposed structure is reasonable. I think we could still describe them in a single RM section (I realize that would be a bit radical ;-). **************************************************************** From: Robert Dewar Sent: Thursday, March 25, 2010 10:41 PM Attached are the specs of the final versions that I implemented. I incorporated all the suggestions that I received. ------------------------------------------------------------------------------ -- -- -- GNAT RUN-TIME COMPONENTS -- -- -- -- A D A . S T R I N G S . U T F _ E N C O D I N G -- -- -- -- S p e c -- -- -- -- This specification is derived from the Ada Reference Manual for use with -- -- GNAT. In accordance with the copyright of that document, you can freely -- -- copy and modify this specification, provided that if you redistribute a -- -- modified version, any changes that you have made are clearly indicated. -- -- -- ------------------------------------------------------------------------------ -- This is one of the Ada 2012 package defined in AI05-0137-1. It is a parent -- package that contains declarations used in the child packages for handling -- UTF encoded strings. Note: this package is consistent with Ada 95, and may -- be used in Ada 95 or Ada 2005 mode. with Interfaces; with Unchecked_Conversion; package Ada.Strings.UTF_Encoding is pragma Pure (UTF_Encoding); subtype UTF_String is String; -- Used to represent a string of 8-bit values containing a sequence of -- values encoded in one of three ways (UTF-8, UTF-16BE, or UTF-16LE). -- Typically used in connection with a Scheme parameter indicating which -- of the encodings applies. This is not strictly a String value in the -- sense defined in the Ada RM, but in practice type String accomodates -- all possible 256 codes, and can be used to hold any sequence of 8-bit -- codes. We use String directly rather than create a new type so that -- all existing facilities for manipulating type String (e.g. the child -- packages of Ada.Strings) are available for manipulation of UTF_Strings. type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE); -- Used to specify which of three possible encodings apply to a UTF_String subtype UTF_8_String is String; -- Similar to UTF_String but specifically represents a UTF-8 encoded string subtype UTF_16_Wide_String is Wide_String; -- This is similar to UTF_8_String but is used to represent a Wide_String -- value which is a sequence of 16-bit values encoded using UTF-16. Again -- this is not strictly a Wide_String in the sense of the Ada RM, but the -- type Wide_String can be used to represent a sequence of arbitrary 16-bit -- values, and it is more convenient to use Wide_String than a new type. Encoding_Error : exception; -- This exception is raised in the following situations: -- a) A UTF encoded string contains an invalid encoding sequence -- b) A UTF-16BE or UTF-16LE input string has an odd length -- c) An incorrect character value is present in the Input string -- d) The result for a Wide_Character output exceeds 16#FFFF# -- The exception message has the index value where the error occurred. -- The BOM (BYTE_ORDER_MARK) values defined here are used at the start of -- a string to indicate the encoding. The convention in this package is -- that on input a correct BOM is ignored and an incorrect BOM causes an -- Encoding_Error exception. On output, the output string may or may not -- include a BOM depending on the setting of Output_BOM. BOM_8 : constant UTF_8_String := Character'Val (16#EF#) & Character'Val (16#BB#) & Character'Val (16#BF#); BOM_16BE : constant UTF_String := Character'Val (16#FE#) & Character'Val (16#FF#); BOM_16LE : constant UTF_String := Character'Val (16#FF#) & Character'Val (16#FE#); BOM_16 : constant UTF_16_Wide_String := (1 => Wide_Character'Val (16#FEFF#)); function Encoding (Item : UTF_String; Default : Encoding_Scheme := UTF_8) return Encoding_Scheme; -- This function inspects a UTF_String value to determine whether it -- starts with a BOM for UTF-8, UTF-16BE, or UTF_16LE. If so, the result -- is the scheme corresponding to the BOM. If no valid BOM is present -- then the result is the specified Default value. end Ada.Strings.UTF_Encoding; ------------------------------------------------------------------------------ -- -- -- GNAT RUN-TIME COMPONENTS -- -- -- -- ADA.STRINGS.UTF_ENCODING.CONVERSIONS -- -- -- -- S p e c -- -- -- -- This specification is derived from the Ada Reference Manual for use with -- -- GNAT. In accordance with the copyright of that document, you can freely -- -- copy and modify this specification, provided that if you redistribute a -- -- modified version, any changes that you have made are clearly indicated. -- -- -- ------------------------------------------------------------------------------ -- This is an Ada 2012 package defined in AI05-0137-1. It provides conversions -- from one UTF encoding method to another. Note: this package is consistent -- with Ada 95, and may be used in Ada 95 or Ada 2005 mode. package Ada.Strings.UTF_Encoding.Conversions is pragma Pure (Conversions); -- In the following conversion routines, a BOM in the input that matches -- the encoding scheme is ignored, an incorrect BOM causes Encoding_Error -- to be raised. A BOM is present in the output if the Output_BOM parameter -- is set to True. function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; -- Convert from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified -- by the Input_Scheme argument, and generate an output encoded in one of -- these three schemes as specified by the Output_Scheme argument. function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_16_Wide_String; -- Convert from input encoded in UTF-8, UTF-16LE, or UTF-16BE as specified -- by the Input_Scheme argument, and generate an output encoded in UTF-16. function Convert (Item : UTF_8_String; Output_BOM : Boolean := False) return UTF_16_Wide_String; -- Convert from UTF-8 to UTF-16 function Convert (Item : UTF_16_Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; -- Convert from UTF-16 to UTF-8, UTF-16LE, or UTF-16BE as specified by -- the Output_Scheme argument. function Convert (Item : UTF_16_Wide_String; Output_BOM : Boolean := False) return UTF_8_String; -- Convert from UTF-16 to UTF-8 end Ada.Strings.UTF_Encoding.Conversions; ------------------------------------------------------------------------------ -- -- -- GNAT RUN-TIME COMPONENTS -- -- -- -- ADA.STRINGS.UTF_ENCODING.WIDE_ENCODING -- -- -- -- S p e c -- -- -- -- This specification is derived from the Ada Reference Manual for use with -- -- GNAT. In accordance with the copyright of that document, you can freely -- -- copy and modify this specification, provided that if you redistribute a -- -- modified version, any changes that you have made are clearly indicated. -- -- -- ------------------------------------------------------------------------------ -- This is an Ada 2012 package defined in AI05-0137-1. It is used for encoding -- and decoding Wide_String values using UTF encodings. Note: this package is -- consistent with Ada 95, and may be included in Ada 95 implementations. package Ada.Strings.UTF_Encoding.Wide_Encoding is pragma Pure (Wide_Encoding); -- The encoding routines take a Wide_String as input and encode the result -- using the specified UTF encoding method. The result includes a BOM if -- the Output_BOM argument is set to True. Encoding_Error is raised if an -- invalid character appears in the input. In particular the characters -- in the range 16#D800# .. 16#DFFF# are invalid because they conflict -- with UTF-16 surrogate encodings, and the characters 16#FFFE# and -- 16#FFFF# are also invalid because they conflict with BOM codes. function Encode (Item : Wide_String; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean := False) return UTF_String; -- Encode Wide_String using UTF-8, UTF-16LE or UTF-16BE encoding as -- specified by the Output_Scheme parameter. function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_8_String; -- Encode Wide_String using UTF-8 encoding function Encode (Item : Wide_String; Output_BOM : Boolean := False) return UTF_16_Wide_String; -- Encode Wide_String using UTF_16 encoding -- The decoding routines take a UTF String as input, and return a decoded -- Wide_String. If the UTF String starts with a BOM that matches the -- encoding method, it is ignored. An incorrect BOM raises Encoding_Error. function Decode (Item : UTF_String; Input_Scheme : Encoding_Scheme) return Wide_String; -- The input is encoded in UTF_8, UTF_16LE or UTF_16BE as specified by the -- Input_Scheme parameter. It is decoded and returned as a Wide_String -- value. Note: a convenient form for scheme may be Encoding (UTF_String). function Decode (Item : UTF_8_String) return Wide_String; -- The input is encoded in UTF-8 and returned as a Wide_String value function Decode (Item : UTF_16_Wide_String) return Wide_String; -- The input is encoded in UTF-16 and returned as a Wide_String value end Ada.Strings.UTF_Encoding.Wide_Encoding; **************************************************************** From: Robert Dewar Sent: Saturday, March 27, 2010 10:10 AM AARGH! I just noticed that we completely forgot the case of converting to and from type String, this is not trivial, it is easy, but the conversion between String and UTF_8_String is not an identity! I will add the necessary Ada.Strings.UTF_Encoding.Encoding package for the String case. **************************************************************** From: Robert Dewar Sent: Saturday, March 27, 2010 10:55 AM Oops, that name conflicts with Encoding in the parent package, it's a bit too generic anyway. I will change the names of the three child packages to Ada.Strings.UTF_Encoding.String_Encoding; Ada.Strings.UTF_Encoding.Wide_String_Encoding; Ada.Strings.UTF_Encoding.Wide_Wide_String_Encoding; Really better in any case ... **************************************************************** From: Jean-Pierre Rosen Sent: Saturday, March 27, 2010 11:30 AM >> AARGH! [...] > Oops [...] I think I'll still hold my breath a little bit longer before wording AI05-137-2 ... **************************************************************** From: Robert Dewar Sent: Saturday, March 27, 2010 11:34 AM I will send out final versions of packages later today ... Really hard to get things right till you write the whole implementation along with complete test programs. **************************************************************** From: Jean-Pierre Rosen Sent: Wednesday, April 28, 2010 11:51 AM I'm currently trying to reword this AI after Robert's latest proposal (which he didn't change for a while ;-) ). Since it is largely different from the previous (and ARG approved) proposal, I'll dub it AI05-0137-2. 1) Is there any reason to keep these packages as children of Ada.Strings? There is nothing used from the Strings package, and nothing in common with the other children of Ada.Strings. 2) Is it a problem to put this stuff as A.5, and shift all following subclause numbers by 1 ? **************************************************************** From: Randy Brukardt Sent: Wednesday, April 28, 2010 12:07 PM I'd much prefer to avoid clause number changes, because it invalidates every existing clause reference (in any forum: web, compiler error messages, AIs, books, etc.) without any indication (there would be no /3 on unchanged but moved paragraphs). I was dubious about rearranging the containers alone (although the group decided to do that) for this reason, and those are solely new text only found in Ada 2005. You're talking about renumbering all of the I/O (sequential, direct, stream) as well as Ada.Directories and Ada.Containers. That seems like way too much. I think (1) follows from (2). It is perfectly logical to have this under Ada.Strings: it is a set of string operations, it operates on the same types was Ada.Strings.Fixed, etc. People are used to looking in Ada.Strings for string operations, why not conversions? And it means less name pollution and a logical place in the Standard that doesn't renumber everything. If we were starting from scratch, I'd be more sympathetic to making it a separate package, but then I think we'd want to integrate it better with the rest of the library (meaning it probably would still end up part of Ada.Strings). **************************************************************** From: Bob Duff Sent: Wednesday, April 28, 2010 12:53 PM I agree with Randy -- better not to renumber sections, and it's just fine under Strings. **************************************************************** From: Jean-Pierre Rosen Sent: Thursday, April 29, 2010 8:33 AM > I'd much prefer to avoid clause number changes, because it invalidates > every existing clause reference (in any forum: web, compiler error > messages, AIs, books, etc.) without any indication (there would be no > /3 on unchanged but moved paragraphs). I was dubious about rearranging > the containers alone (although the group decided to do that) for this > reason, and those are solely new text only found in Ada 2005. You're > talking about renumbering all of the I/O (sequential, direct, stream) > as well as Ada.Directories and Ada.Containers. That seems like way too much. I suspected that, that's why I asked. Now, the problem is that I have to put 4 packages in the same subclause (I don't think we are allowed to have a 4th level of titles, and I can find no precedent in the RM on how to present that, safe for 3-liners units). Should I put the 4 packages upfront ("the following packages exist:") and then all the explanations? Group explanations with each package specs (i.e. like if there were a fourth level)? **************************************************************** From: Randy Brukardt Sent: Thursday, April 29, 2010 12:26 PM There are two answers here. First, ISO allows 6(!) levels of subclauses. Moreover, since ASIS originally used 5 levels, I had to update the tools to support 4 levels (we got rid of the 5th level). So there is no problem using 4 levels if you need to. The objection might be from people who have tools that only support 3 levels (Janus/Ada compiler error messages are like that, we used a binary format for clause references to support never-built tools that directly link to the RM text). Second, we've put multiple packages in a clause before. 9.6.1 (Formatting, Time Zones, and other operations for Time) comes to mind. (I think there are others but I can't remember any right now.) If you copy that style, you would be consistent with at least something. **************************************************************** From: Robert Dewar Sent: Thursday, April 29, 2010 12:40 PM > I think (1) follows from (2). It is perfectly logical to have this > under > Ada.Strings: it is a set of string operations, it operates on the same > types was Ada.Strings.Fixed, etc. People are used to looking in > Ada.Strings for string operations, why not conversions? And it means > less name pollution and a logical place in the Standard that doesn't renumber > everything. I prefer to keep it in Ada.Strings for the reasons Randy states above **************************************************************** From: Randy Brukardt Sent: Friday, June 11, 2010 1:35 AM > I have emptied the !corrigendum section, but left the !appendix > section; although the discussion is not about this version, it seemed > still relevant. I trust Randy to do The Right Thing anyway. OK, here are some problems with the wording: (1) The introductory text vanished. Not sure why (9.6.1 is just additional operations, and they are a real scatter, but there is no such problem with this). This is a separate block of functionality and it ought to be described. So I put the introduction back as: Facilities for encoding, decoding, and converting strings in various character encoding schemes are provided by packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Wide_Encoding, and Strings.UTF_Encoding.Wide_Wide_Encoding. This format is similar to those for A.4.7 and A.4.8. (A.4.9 and A.4.10 are also missing this introductory text: anyone want to volunteer to write those??) (2) The lead-in text is very different than the text from the other clauses in A.4. Replace it by: The encoding library packages have the following declaration: (3) The following text uses an ordered list: The exception Encoding_Error is also propagated in the following situations: a) By a Decode function when an UTF encoded string contains an invalid encoding sequence. b) By a Decode function when the expected encoding is UTF-16BE or UTF-16LE and the input string has an odd length. c) By a Decode function yielding a Wide_String when the decoding of a sequence results in a code-point whose value exceeds 16#FFFF# d) by an Encode function taking a Wide_String as input when an invalid character appears in the input. In particular the characters whose position is in the range 16#D800# .. 16#DFFF# are invalid because they conflict with UTF-16 surrogate encodings, and the characters whose position is 16#FFFE# or 16#FFFF# are also invalid because they conflict with BOM codes. But there is no ordering here, thus this violates the ISO rules for writing Standards (and good taste). These should be bullets. (Having the same capitalization would be nice, too.) **************************************************************** From: Randy Brukardt Sent: Friday, June 11, 2010 1:49 AM I just noticed that the proposed packages are missing part of Robert's proposal. Robert's last "ARGGH" message (mail of March 27th) noted that we needed to add an encoding package for type String, and that forces renaming all of the packages. Neither of these things was done. Robert had suggested names of: Ada.Strings.UTF_Encoding.String_Encoding; Ada.Strings.UTF_Encoding.Wide_String_Encoding; Ada.Strings.UTF_Encoding.Wide_Wide_String_Encoding; These seem like candidates for inclusion in the museum of redundancy museum, with both "String" and "Encoding" repeated twice. There also seems to be something wrong here, because these are very odd packages for Strings: there are no Wide_String and Wide_Wide_String versions. (All existing Ada.Strings packages come in all three versions.) But of course there *are* Wide and Wide_String versions, they just have the *wrong* names! There aren't supposed to be any Wide_Wide_String operations under Ada.Strings! These really ought to be called: Ada.Strings.UTF_Encoding.String_Encoding; Ada.Wide_Strings.UTF_Encoding.Wide_String_Encoding; Ada.Wide_Wide_Strings.UTF_Encoding.Wide_Wide_String_Encoding; but that doesn't work because we can't share the parent in that case. What to do?? One possibility is to put them under Ada.Characters (these aren't that different than Ada.Characters.Handling, and it is usual under Ada.Characters to not have Wide and Wide_Wide versions). Then the structure can be left the same. Alternatively, we could split these up and use appropriate with clauses for the shared text: Ada.Characters.UTF_Encoding; Ada.Strings.UTF_Encoding; Ada.Wide_Strings.Wide_UTF_Encoding; Ada.Wide_Wide_Strings.Wide_Wide_UTF_Encoding; Perhaps there is even a better idea?? **************************************************************** From: Robert Dewar Sent: Friday, June 11, 2010 7:06 AM > Robert had suggested names of: > > Ada.Strings.UTF_Encoding.String_Encoding; > Ada.Strings.UTF_Encoding.Wide_String_Encoding; > Ada.Strings.UTF_Encoding.Wide_Wide_String_Encoding; Note that's what is implemented now in GNAT, I am happy with these names, but don'tmind too much if someone wants to put these packages somewhere else, I would be willing to move them around and rename them > Alternatively, we could split these up and use appropriate with > clauses for the shared text: > > Ada.Characters.UTF_Encoding; > Ada.Strings.UTF_Encoding; > Ada.Wide_Strings.Wide_UTF_Encoding; > Ada.Wide_Wide_Strings.Wide_Wide_UTF_Encoding; I do not like at all any further reorganization of the packages. I wasted enough time on these (and believe me, no one else at AdaCore has the slightest interest in them :-) :-)) **************************************************************** From: Jean-Pierre Rosen Sent: Wednesday, June 16, 2010 4:37 AM > I just noticed that the proposed packages are missing part of Robert's > proposal. Unless I missed something, I don't think Robert sent the String version. (Robert are you listening...;-) ) **************************************************************** From: Robert Dewar Sent: Monday, July 26, 2010 6:53 AM I have reviewed the packages in this AI (but not all the commentary, I will try to find time to do that some time). Right now GNAT implements: > a-stuten.ads:package Ada.Strings.UTF_Encoding a-suenco.ads:package > Ada.Strings.UTF_Encoding.Conversions > a-suesen.ads:package Ada.Strings.UTF_Encoding.String_Encoding > a-suewse.ads:package Ada.Strings.UTF_Encoding.Wide_String_Encoding > a-suezse.ads:package > Ada.Strings.UTF_Encoding.Wide_Wide_String_Encoding And the proposal has the packages > package Ada.Strings.UTF_Encoding > package Ada.Strings.UTF_Encoding.Conversions > package Ada.Strings.UTF_Encoding.Wide_Encoding > package Ada.Strings.UTF_Encoding.Wide_Wide_Encoding I somewhat prefer my names, but I don't mind changing. I do intend to keep the GNAT package that is now called String_Encoding, it's contents are: > package Ada.Strings.UTF_Encoding.String_Encoding is > pragma Pure (String_Encoding); > > -- The encoding routines take a String as input and encode the result > -- using the specified UTF encoding method. The result includes a BOM if > -- the Output_BOM argument is set to True. All 256 values of type Character > -- are valid, so Encoding_Error cannot be raised for string input data. > > function Encode > (Item : String; > Output_Scheme : Encoding_Scheme; > Output_BOM : Boolean := False) return UTF_String; > -- Encode String using UTF-8, UTF-16LE or UTF-16BE encoding as specified by > -- the Output_Scheme parameter. > > function Encode > (Item : String; > Output_BOM : Boolean := False) return UTF_8_String; > -- Encode String using UTF-8 encoding > > function Encode > (Item : String; > Output_BOM : Boolean := False) return UTF_16_Wide_String; > -- Encode String using UTF_16 encoding > > -- The decoding routines take a UTF String as input, and return a decoded > -- Wide_String. If the UTF String starts with a BOM that matches the > -- encoding method, it is ignored. An incorrect BOM raises Encoding_Error, > -- as does a code out of range of type Character. > > function Decode > (Item : UTF_String; > Input_Scheme : Encoding_Scheme) return String; > -- The input is encoded in UTF_8, UTF_16LE or UTF_16BE as specified by the > -- Input_Scheme parameter. It is decoded and returned as a String value. > -- Note: a convenient form for scheme may be Encoding (UTF_String). > > function Decode > (Item : UTF_8_String) return String; > -- The input is encoded in UTF-8 and returned as a String value > > function Decode > (Item : UTF_16_Wide_String) return String; > -- The input is encoded in UTF-16 and returned as a String value > > end Ada.Strings.UTF_Encoding.String_Encoding; I don't really care one way or another if this is included in the standard or not (it doesn't affect us), obviously I think it should be included. I do have to think about its name. If I change the names of the Wide and Wide_Wide packages to match the AI, then I suppose I will call this additional package Ada.Strings.UTF_Encoding.Encoding which seems a bit redundant. I really prefer my names, they emphasize that the three packages deal with Strings, Wide_Strings and Wide_Wide_Strings. In fact I think I will keep my names anyway, but I will provide the AI names as renamings if people really feel they are better. Thoughts? **************************************************************** From: Tucker Taft Sent: Monday, July 26, 2010 9:34 AM We debated various namings and came up with the ones in the AI. It is not clear whether there is sufficient reason to reopen the AI. **************************************************************** From: Robert Dewar Sent: Monday, July 26, 2010 10:03 AM Yes, but note the additional factor that you are missing a package, and the name Ada.Strings.UTF_Encoding.Encoding for the String specific version is distinctly unpleasant in my view! It was the addition of this package (which is definitely needed) that made me add String to the names. The real issue here is whether to add the missing package. Once that is decided, the issue of what to call it can be addressed In the case of package Ada.Strings.UTF_Encoding.Wide_Wide_Encoding You know it has to do with Wide_Wide_String, but that hint is not there for the base package. As I say, I am going to maintain the existing names in GNAT for this reason, but it's of course trivial to add renames for the 2/3 packages that are in the standard. **************************************************************** From: Randy Brukardt Sent: Monday, July 26, 2010 1:19 PM That would be true, except that (a) the AI Robert referenced (version /01) doesn't reflect the discussion from Valencia; and (b) we only voted intent in Valencia because of the extensive changes that the AI underwent there. Jean-Pierre sent a newer version of this AI to the ARG list on June 25th; I haven't posted it yet, but anyone planning to comment on the AI needs to refer to that most recent draft and not previous versions. Thus I agree with Tucker that Robert's comments are moot, but for a different reason: he's referring to an obsolete version of the AI. P.S. Note that the package names were changed again, this time because of a complaint that is recorded as coming from Ed: too many "Encoding"s in the names. **************************************************************** From: Robert Dewar Sent: Monday, July 26, 2010 5:42 PM Can we at least see the current modified specs ASAP. I really don't care about all the text, just the specs are enough. BTW, when you use the word moot, are you using it in the correct sense of "undecided, arguable", or in the peculiar common US modern sense of "irrelevant". I recommend against using the word at all because of this confusion. **************************************************************** From: Randy Brukardt Sent: Monday, July 26, 2010 6:40 PM > Can we at least see the current modified specs ASAP. I really don't > care about all the text, just the specs are enough. I've forwarded J-P's old message to you (note that I haven't edited it yet, so there may be some small changes before it gets posted). For any one else that is interested, it was sent June 25th to this list. > BTW, when you use the word moot, are you using it in the correct sense > of "undecided, arguable", or in the peculiar common US modern sense of > "irrelevant". I recommend against using the word at all because of > this confusion. I think I used it correctly, but I think whether I used it correctly is moot (using it correctly). :-) I actually thought about this before I used the word, because I knew that you have complained about inaccurate use of this word in the past and that the definition isn't "irrelevant". In this case, I meant that your comments ought "to be left undecided", regardless of their technical merits, because they were superseded by the full ARG meeting. **************************************************************** From: Robert Dewar Sent: Monday, July 26, 2010 6:47 PM > I've forwarded J-P's old message to you (note that I haven't edited it > yet, so there may be some small changes before it gets posted). For > any one else that is interested, it was sent June 25th to this list. So this message basically represents the current proposal? >> BTW, when you use the word moot, are you using it in the correct >> sense of "undecided, arguable", or in the peculiar common US modern >> sense of "irrelevant". I recommend against using the word at all >> because of this confusion. > > I think I used it correctly, but I think whether I used it correctly > is moot (using it correctly). :-) > > I actually thought about this before I used the word, because I knew > that you have complained about inaccurate use of this word in the past > and that the definition isn't "irrelevant". In this case, I meant that > your comments ought "to be left undecided", regardless of their > technical merits, because they were superseded by the full ARG meeting. fair enough **************************************************************** From: Robert Dewar Sent: Monday, July 26, 2010 6:49 PM I will look at the new version and comment ASAP! **************************************************************** From: Robert Dewar Sent: Monday, July 26, 2010 7:18 PM I read the specs in the latest version Randy sent to me, and they are fine, I agree that removing the second Encoding from the names makes sense, and I will make that change in the GNAT implementation. **************************************************************** From: Randy Brukardt Sent: Monday, July 26, 2010 7:42 PM For the record, I've now checked over Jean-Pierre's version carefully, and while I found a couple of typos and the !corrigendum section was incompletely updated, I didn't find any significant problems. ****************************************************************