Version 1.20 of ais/ai-00285.txt

Unformatted version of ais/ai-00285.txt version 1.20
Other versions for file ais/ai-00285.txt

!standard 2.1(1)          04-11-10 AI95-00285/11
!standard 2.1(2)
!standard 2.1(3)
!standard 2.1(4)
!standard 2.1(5)
!standard 2.1(7)
!standard 2.1(8)
!standard 2.1(9)
!standard 2.1(10)
!standard 2.1(11)
!standard 2.1(12)
!standard 2.1(13)
!standard 2.1(14)
!standard 2.1(15)
!standard 2.1(16)
!standard 2.1(17)
!standard 0.3(32)
!standard 0.3(34)
!standard 1.1.4(14)
!standard 2.2(03)
!standard 2.2(04)
!standard 2.2(05)
!standard 2.2(08)
!standard 2.2(09)
!standard 2.3(02)
!standard 2.3(03)
!standard 2.3(05)
!standard 2.6(06)
!standard 3.5(28)
!standard 3.5(29)
!standard 3.5(34)
!standard 3.5(37)
!standard 3.5(40)
!standard 3.5(41)
!standard 3.5(42)
!standard 3.5(43)
!standard 3.5(44)
!standard 3.5(45)
!standard 3.5(51)
!standard 3.5(55)
!standard 3.5(56)
!standard 3.5(59)
!standard 3.5.2(2)
!standard 3.5.2(3)
!standard 3.5.2(4)
!standard 3.5.2(5)
!standard 3.6.3(2)
!standard 3.6.3(4)
!standard A.1(36)
!standard A.1(42)
!standard A.1(49)
!standard A.3(1)
!standard A.3.2(13)
!standard A.3.2(14)
!standard A.3.2(16)
!standard A.3.2(18)
!standard A.3.2(42)
!standard A.3.2(43)
!standard A.3.2(44)
!standard A.3.2(45)
!standard A.3.2(46)
!standard A.3.2(47)
!standard A.3.2(48)
!standard A.3.2(49)
!standard A.4(1)
!standard A.4.1(4)
!standard A.4.8(1)
!standard A.6(1)
!standard A.7(4)
!standard A.7(10)
!standard A.7(13)
!standard A.7(15)
!standard A.11(00)
!standard A.11(01)
!standard A.11(02)
!standard A.11(03)
!standard A.12(01)
!standard A.12.4(01)
!standard B.3(39)
!standard B.3(40)
!standard B.3(60)
!standard C.5(07)
!standard F(04)
!standard F.3(01)
!standard F.3(06)
!standard F.3(19)
!standard F.3(20)
!standard F.3.5(01)
!standard G.1.5(01)
!standard H.4(20)
!class amendment 02-01-23
!status Amendment 200Y 04-09-27
!status ARG Approved 8-0-1 04-09-17
!status work item 02-09-24
!status received 02-01-15
!priority Medium
!difficulty Hard
!subject Support for 16-bit and 32-bit characters
!summary
Support is added for program text using the entire set of characters from ISO/IEC 10646:2003, and for operating on characters outside of the BMP at run- time.
!problem
SC22 directed its working groups to provide support for the ISO/IEC 10646 character set. Resolution 02-24 "Recommendation on Coded Character Sets Support" of the SC22 2002 plenary states:
"JTC 1/SC 22 believes that programming languages should offer the appropriate support for ISO/IEC 10646, and the Unicode character set where appropriate."
Moreover, ISO/IEC 10646:2003 makes use of planes other than the BMP.
!proposal
The essence of this proposal is to allow the source of the program to be written using 16-bit characters (from the BMP) or 32-bit characters. Also, it makes it possible to operate on 32-bit characters at run-time
The main difficulty in supporting characters beyond Row 00 of the BMP in the program text is to define how identifiers and literals are built (which characters are letters, digits, etc.) and to define the lower/upper case equivalence rules. Fortunately, the Unicode Consortium has already done most of the work for us, so it's only a matter of defining how we want to piggyback on their categorization and conversion rules.
Unicode defines a "character database" which describes all the properties of each character. The most important property for our purposes is the "General Category". General categories are disjoint. The following categories are of interest for describing Ada program text:
- Letter, Uppercase -- e.g., LATIN CAPITAL LETTER A - Letter, Lowercase -- e.g., LATIN SMALL LETTER A - Letter, Titlecase -- e.g., LATIN CAPITAL LETTER L WITH SMALL LETTER J - Letter, Modifier -- e.g., MODIFIER LETTER APOSTROPHE - Letter, Other -- e.g., HEBREW LETTER ALEF - Mark, Non-Spacing -- e.g., COMBINING GRAVE ACCENT - Mark, Spacing Combining -- e.g., MUSICAL SYMBOL COMBINING AUGMENTATION DOT - Number, Decimal Digit -- e.g., DIGIT ZERO - Number, Letter -- e.g., ROMAN NUMERAL TWO - Other, Control -- e.g., NULL - Other, Format -- e.g., ACTIVATE ARABIC FORM SHAPING - Other, Private Use -- e.g., <Private Use, First> - Other, Surrogate -- e.g., <Non Private Use High Surrogate, First> - Punctuation, Connector -- e.g., LOW LINE - Separator, Space -- e.g., SPACE - Separator, Line -- e.g., LINE SEPARATOR - Separator, Paragraph -- e.g., PARAGRAPH SEPARATOR
(See http://www.unicode.org/Public/4.0-Update/UCD-
4.0.0.html#General_Category_Values for details on the categorization.)
In paragraph 2.1 we define a non-terminal of the grammar for each of the above categories, e.g., letter_uppercase, letter_lowercase, etc.
The characters in the category other_format are effectively ignored in most lexical elements, with the exception that they are illegal in string_literals and character_literals.
Throughout the syntax rules, we specify which characters are allowed for the lexical elements. For instance, the E in the exponent part of a numeric literal may not be a "GREEK CAPITAL LETTER EPSILON", even though a capital E and a capital epsilon look very much the same. Similar considerations apply to the extended digits, the point, etc.
Unicode proposes to define identifiers for programming languages as follows (see annex 7 of UAX #15 at http://www.unicode.org/reports/tr15/tr15- 23.html#Programming_Language_Identifiers):
identifier ::= identifier_start {identifier_start | identifier_extend} identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= mark_non_spacing | mark_spacing_combining | number_decimal_digit | punctuation_connector | other_format
This definition was made with C in mind, and is not exactly appropriate for Ada, as it would allow consecutive underlines. Because the underline is the only character of Row 00 of the BMP which is a punctuation_connector, it seems sensible to remain close to the existing syntax rules of 2.3(2-3), and to use the following definitions:
identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= identifier_start | mark_non_spacing | mark_spacing_combining | number_decimal_digit | other_format identifier ::= identifier_start {[punctuation_connector] identifier_extend}
Unicode recommends that, before storing or comparing identifiers, the following transformations be applied:
o Characters in category other_format are filtered out. o For languages which have case insensitive identifiers, Normalization Form
KC is applied (see http://www.unicode.org/reports/tr15/tr15-23.html#Specification). This is to ensure that identifiers which look visually the same are considered as identical, even if they are composed of different characters.
o _Full_ case folding, as described in the table
http://www.unicode.org/Public/4.0-Update/CaseFolding-4.0.0.txt, is used to find the uppercase version of each character.
We decided not to apply Normalization Form KC, as there seems to be insufficient experience on using normalization forms. This seems to be a lose-lose situation anyway: without normalization, texts that look alike don't have the same meaning; with normalization the widely available text tools like grep, awk, etc. don't work. We follow the lead of C# (ECMA-334) in specifying that a program which is not in Normalization Form KC has an implementation-defined effect. This ensures that a program text which is normalized is portable. It also allows an implementation to provide useful support for non-normalized texts if appropriate in a particular computing environment (in that case, the implementation must document how it handles such texts).
Unicode doesn't provide guidance for the composition of numeric literals, so we don't change them. The use of the digits at positions 16#30# to 16#39# is universal in computer science, and allowing digits from other cultures could cause confusion while bringing little benefits.
The definition and role of format_effectors is modified to include the characters at positions 16#85#, 16#2028# and 16#2029#. These characters may be used to terminate lines, as recommended by section 5.8 of Unicode 4.0 (see http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G10213).
Note that characters in category other_format are forbidden in character_literals and string_literals, because their sole purpose is to affect the presentation of characters. If a program needs to operate on these characters, it can do that by using Wide_Wide_Character'Val (...).
Private use characters are not considered to be graphic characters (even though for some applications they may actually turn out to be graphic). The reason is that we wouldn't be able to define the normalization and case folding rules for these characters, so it seems better to disallow them, except in comments where they cannot do any harm.
We are removing 3.5.2(5) since an implementation may want to provide a nonstandard mode where the set of graphic characters is not a proper subset of that defined in ISO/IEC 10646:2003, for instance to deal with private use characters. We don't want to prevent implementations from doing anything useful. This paragraph has no force anyway, since in a nonstandard mode an implementation may do pretty much what it likes.
In order to represent 32-bit characters at run-time, we add new declarations to Standard. We also provide the following new predefined packages for 32-bit characters:
Ada.Strings.Wide_Wide_Bounded Ada.Strings.Wide_Wide_Fixed Ada.Strings.Wide_Wide_Maps Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants Ada.Strings.Wide_Wide_Unbounded Ada.Wide_Wide_Text_IO Ada.Wide_Wide_Text_IO.Text_Streams Ada.Wide_Wide_Text_IO.Complex_IO Ada.Wide_Wide_Text_IO.Editing
These packages are similar to their Wide_ equivalents, with Wide_Wide_ substituted for Wide_ everywhere. In addition the following declaration is present in Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants:
Wide_Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set;
It contains each Wide_Wide_Character value in the BMP of ISO/IEC 10646:2003.
The attributes Wide_Wide_Image, Wide_Wide_Value and Wide_Wide_Width are also provided. Their definition is similar to that of Wide_Image, Wide_Value and Wide_Width, respectively, with Wide_Character and Wide_String replaced by Wide_Wide_Character and Wide_Wide_String.
Note that the dynamic semantics of a number of operations (attribute Value, procedures Get in Text_IO, procedures Trim in the string packages, etc.) are defined in terms of "space" and "blank". A space is the character at position 16#20# and a blank is either a space or a horizontal tabulation. We are not changing the definition of space or blank, so characters like NO-BREAK SPACE or IDEOGRAPHIC SPACE are not considered to be space or blank in this context.
SC22/WG14 is planning to include support for Unicode 16- and 32-bit characters in C. Their proposal is presented in ISO/IEC TR 19769:2004 (http://www.open-std.org/jtc1/sc22//WG14/www/docs/n1040.pdf). In order to provide compatibility with the upcoming C standard, new types are added to Interfaces.C that correspond to C char16_t and char32_t. It is recognized that adding new declarations to predefined units can cause incompatibilities, but it is thought that the new identifiers are unlikely to conflict with existing code.
There has been considerable discussion in the ARG regarding the best reference material to use for this AI. ISO/IEC 10646:2003, ISO/IEC TR 10176 (4th edition) and Unicode 4.0 are all relevant. To clarify the matter, we presented an earlier version of this AI (version 8) to the September 2004 SC22 plenary meeting in Jeju, Korea (document N3758). SC22 passed the following resolution:
"Resolution 04-15: Coded Character Sets: JTC 1/SC 22 agrees that the proposed implementation of coded character set support described in document N 3758 agrees with the principles for coded character set support previously adopted by SC 22, notably resolution 02-24. JTC 1/SC 22 instructs WG 9 to consider referencing ISO/IEC TR 10176 Annex A in the revision of the Ada language standard."
The AARM note in section 2.1(4-14) of the wording explains why we decided to use ISO/IEC 10646:2003 and Unicode 4.0 instead of ISO/IEC TR 10176.
!wording
In Introduction (32) change:
... Character, [and Wide_Character]{Wide_Character, and Wide_Wide_Character} ...
In Introduction (34) change:
... String [and Wide_String]{, Wide_String, and Wide_Wide_String} ...
Add after 1.1.4(14):
The terminals of the grammar, including reserved words, punctuation and components of lexical elements, are exclusively made of the characters whose code position is between 16#20# and 16#7E#, inclusively. [For example, the character E in the definition of exponent is the character whose name is "LATIN CAPITAL LETTER E", not "GREEK CAPITAL LETTER EPSILON".]
Replace 2.1(1) by:
The characters whose code position is 16#FFFE# or 16#FFFF# are not allowed anywhere in the text of a program. The characters in categories other_control, other_private_use, and other_surrogate are only allowed in comments.
Delete 2.1(2-3).
Replace 2.1(4-14) by:
The character repertoire for the text of an Ada program consists of the collection of characters described by the ISO/IEC 10646:2003 Universal Multiple-Octet Coded Character Set. The coded representation for these characters is implementation defined (it need not be a representation defined within ISO/IEC 10646:2003).
The semantics of an Ada program whose text is not in Normalization Form KC (as defined by section 24 of ISO/IEC 10646:2003) are implementation defined.
The description of the language definition in this International Standard uses the character properties General Category, Simple Uppercase Mapping, Uppercase Mapping, and Special Case Condition of the documents referenced by the note in section 1 of ISO/IEC 10646:2003. The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified.
The categories of characters are defined as follows:
letter_uppercase Any character whose General Category is defined to be "Letter, Uppercase".
letter_lowercase Any character whose General Category is defined to be "Letter, Lowercase".
letter_titlecase Any character whose General Category is defined to be "Letter, Titlecase".
letter_modifier Any character whose General Category is defined to be "Letter, Modifier".
letter_other Any character whose General Category is defined to be "Letter, Other".
mark_non_spacing Any character whose General Category is defined to be "Mark, Non-Spacing".
mark_spacing_combining Any character whose General Category is defined to be "Mark, Spacing Combining".
number_decimal_digit Any character whose General Category is defined to be "Number, Decimal Digit".
number_letter Any character whose General Category is defined to be "Number, Letter".
other_control Any character whose General Category is defined to be "Other, Control".
other_format Any character whose General Category is defined to be "Other, Format".
other_private_use Any character whose General Category is defined to be "Other, Private Use".
other_surrogate Any character whose General Category is defined to be "Other, Surrogate".
punctuation_connector Any character whose General Category is defined to be "Punctuation, Connector".
separator_space Any character whose General Category is defined to be "Separator, Space".
separator_line Any character whose General Category is defined to be "Separator, Line".
separator_paragraph Any character whose General Category is defined to be "Separator, Paragraph".
format_effector The characters whose code position is 16#09# (CHARACTER TABULATION), 16#0A# (LINE FEED(LF)), 16#0B# (LINE TABULATION), 16#0C# (FORM FEED(FF)), 16#0D# (CARRIAGE RETURN(CR)), 16#85# (NEXT LINE(NEL)), and the characters in categories separator_line and separator_paragraph. The names mentioned in parentheses in this list are not defined by ISO/IEC 10646:2003; they are only used for convenience in this International Standard.
graphic_character Any character which is not in the categories other_control, other_private_use, other_surrogate, other_format, format_effector, and whose code position is neither 16#FFFE# nor 16#FFFF#.
AARM NOTE
We considered basing the definition of lexical elements on Annex A of ISO/IEC TR 10176 (4th edition), which lists the characters which should be supported in identifiers for all programming languages, but we finally decided against this option. Note that it is not our intent to diverge from ISO/IEC TR 10176, except to the extent that ISO/IEC TR 10176 itself diverges from ISO/IEC 10646:2003 (which is the case at the time of this writing).
More precisely, we intend to align strictly with ISO/IEC 10646:2003. It must be noted that ISO/IEC TR 10176 is a Technical Report while ISO/IEC 10646:2003 is a Standard. If one has to make a choice, one should conform with the Standard rather than with the Technical Report. And, it turns out that one must make a choice because there are important differences between the two:
o ISO/IEC TR 10176 is still based on ISO/IEC 10646:2000 while ISO/IEC 10646:2003 has already been published for a year.
o There are considerable differences between the two editions of ISO/IEC 10646, notably in supporting characters beyond the BMP (this might be significant for some languages, e.g. Korean).
o ISO/IEC TR 10176 is a moving target. It is in its fourth edition already, and nevertheless needs additional revision to catch up with ISO/IEC 10646:2003. We cannot afford to revise the Ada language and the vendors cannot afford to change the compilers each time ISO/IEC TR 10176 changes. And we cannot afford to delay the adoption of our amendment until ISO/IEC TR 10176 has been revised; we would run out of interest, money, and the ISO time table before then.
o ISO/IEC TR 10176 does not define case conversion tables, which are essential for a case-insensitive language like Ada. To get case conversion tables, we would have to reference either ISO/IEC 10646:2003 or Unicode, or we would have to invent our own.
For the purpose of defining the lexical elements of the language, we need character properties like categorization, as well as case conversion tables. These are mentioned in ISO/IEC 10646:2003 as useful for implementations, with a reference to Unicode. Machine-readable tables are available on the web at URLs:
http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt http://www.unicode.org/Public/4.0-Update/CaseFolding-4.0.0.txt
with an explanatory document found at URL:
http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html
The actual text of the standard only makes specific references to the corresponding clauses of ISO/IEC 10646:2003, not to Unicode.
END AARM NOTE
Change 2.1(15):
Replace the leading sentence by: The following names are used when referring to certain characters (the first name is that given in ISO/IEC 10646:2003):
Replace "symbol" by "graphic symbol" in the column headers.
Delete the last four characters (Ada doesn't use brackets).
Add ! Exclamation Mark and % Percent Sign to the table.
Replace the AARM Note:
This table serves to show the correspondence between ISO 10646 names and the graphic symbols (glyphs) used in this International Standard. These are the characters that play a special role in the syntax of Ada.
Delete 2.1(16).
Delete 2.1(17).
Replace 2.2(3-5) by:
In some cases an explicit _separator_ is required to separate adjacent lexical elements. A separator is any of a separator_space, a format_effector or the end of a line, as follows:
o A separator_space is a separator except within a comment, a string_literal,
or a character_literal.
o Character Tabulation is a separator except within a comment.
Replace 2.2(8) by:
A delimiter is either one of the following characters:
Replace 2.3(2-3) by:
identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= identifier_start | mark_non_spacing | mark_spacing_combining | number_decimal_digit | other_format identifier ::= identifier_start {[punctuation_connector] identifier_extend}
Replace 2.3(5) by:
Two identifiers are considered the same if they consist of the same sequence of characters after applying the following transformations (in this order):
o The characters in category other_format are eliminated. o Locale-independent full case folding, as defined by documents referenced
in the note in section 1 of ISO/IEC 10646:2003, is applied to obtain the uppercase version of each character.
Add after 2.6(6):
No modification is performed on the sequence of characters in a string_literal.
Replace 3.5(28-29) by:
S'Wide_Wide_Image S'Wide_Wide_Image denotes a function with the following specification:
function S'Wide_Wide_Image (Arg : S'Base) return Wide_Wide_String;
Add after 3.5(34):
S'Wide_Image S'Wide_Image denotes a function with the following specification:
function S'Wide_Image (Arg : S'Base) return Wide_String;
The function returns an image of the value of Arg as a Wide_String. The lower bound of the result is one. The image has the same sequence of character as defined for S'Wide_Wide_Image if all the graphic characters are defined in Wide_Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of Arg).
Replace 3.5(37) by:
The function returns an image of the value of Arg as a String. The lower bound of the result is one. The image has the same sequence of character as defined for S'Wide_Wide_Image if all the graphic characters are defined in Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of Arg).
Add after 3.5(37):
S'Wide_Wide_Width S'Wide_Wide_Width denotes the maximum length of a Wide_Wide_String returned by S'Wide_Wide_Image over all the values of S. It denotes zero for a subtype that has a null range. Its type is universal_integer.
Replace 3.5(40-45) by:
S'Wide_Wide_Value S'Wide_Wide_Value denotes a function with the following specification:
function S'Wide_Wide_Value (Arg : Wide_Wide_String) return S'Base;
This function returns a value given an image of the value as a Wide_Wide_String, ignoring any leading or trailing spaces.
For the evaluation of a call on S'Wide_Wide_Value for an enumeration subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an enumeration literal and if it corresponds to a literal of the type of S (or corresponds to the result of S'Wide_Wide_Image for a nongraphic character of the type), the result is the corresponding enumeration value; otherwise Constraint_Error is raised.
For the evaluation of a call on S'Wide_Wide_Value for an integer subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an integer literal, with an optional leading sign character (plus or minus for a signed type; only plus for a modular type), and the corresponding numeric value belongs to the base range of the type of S, then that value is the result; otherwise Constraint_Error is raised.
For the evaluation of a call on S'Wide_Wide_Value for a real subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of one of the following:
Add after 3.5(51):
S'Wide_Value S'Wide_Value denotes a function with the following specification:
function S'Wide_Value(Arg : Wide_String) return S'Base
This function returns a value given an image of the value as a Wide_String, ignoring any leading or trailing spaces. For the evaluation of a call on S'Wide_Value for an enumeration subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an enumeration literal and if it corresponds to a literal of the type of S (or corresponds to the result of S'Wide_Image for a value of the type), the result is the corresponding enumeration value; otherwise Constraint_Error is raised. For a numeric subtype S, the evaluation of a call on S'Wide_Value with Arg of type Wide_String is equivalent to a call on S'Wide_Wide_Value for a corresponding Arg of type Wide_Wide_String.
At the end of 3.5(55) change:
... to a call on [S'Wide_Value]{S'Wide_Wide_Value} for a corresponding Arg of type [Wide_String]{Wide_Wide_String}.
In 3.5(56) change:
... {Wide_Wide_Value,} Wide_Value, Value, {Wide_Wide_Image,} Wide_Image, and Image ...
In 3.5(59) change:
... as [does]{do} S'Wide_Value (S'Wide_Image (V)) {and S'Wide_Wide_Value (S'Wide_Wide_Image (V))} ...
In the middle of 3.5.2(2), change:
... the attributes [(Wide_)Image and (Wide_)Value]{Image, Wide_Image, Wide_Wide_Image, Value, Wide_Value, and Wide_Wide_Value}
Replace 3.5.2(3) with:
The predefined type Wide_Character is a character type whose values correspond to the 65536 code positions of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP). Each of the graphic characters of the BMP has a corresponding character_literal in Wide_Character. The first 256 values of Wide_Character have the same character_literal or language-defined name as defined for Character. Each of the graphic_characters have a corresponding character_literal.
The predefined type Wide_Wide_Character is a character type whose values correspond to the 2147483648 code positions of the ISO/IEC 10646:2003 character set. Each of the graphic_characters has a corresponding character_literal in Wide_Wide_Character. The first 65536 values of Wide_Wide_Character have the same character_literal or language-defined name as defined for Wide_Character.
In types Wide_Character and Wide_Wide_Character, the characters whose code positions are 16#FFFE# and 16#FFFF# are assigned the language-defined names FFFE and FFFF. The other characters whose code position is larger than 16#FF# and which are not graphic_characters have language-defined names which are formed by appending to the string "Character_" the representation of their code position in hexadecimal as eight extended digits. As with other language-defined names, these names are usable only with the attributes (Wide_)Wide_Image and (Wide_)Wide_Value; they are not usable as enumeration literals.
In 3.5.2(4) change:
... Character [and Wide_Character]{, Wide_Character, and Wide_Wide_Character} ...
Delete 3.5.2(5).
Replace 3.6.3(2) by:
There are three predefined string types, String, Wide_String, and Wide_Wide_String, each indexed by the value of the predefined subtype Positive; these are declared in the visible part of package Standard:
Replace 3.6.3(4) by:
type String is array (Positive range <>) of Character; type Wide_String is array (Positive range <>) of Wide_Character; type Wide_Wide_String is array (Positive range <>) of Wide_Wide_Character;
Fix the list in A(2/1).
Add in the middle of A.1(36)
-- The declaration of type Wide_Wide_Character is based on the full -- ISO/IEC 10646:2003 character set. The first 65536 positions have the -- same contents as type Wide_Character. See 3.5.2. type Wide_Wide_Character is (nul, soh, ..., FFFE, FFFF, ...);
Add after A.1(42):
type Wide_Wide_String is array (Positive range <>) of Wide_Wide_Character; pragma Pack (Wide_Wide_String); -- The predefined operators for this type correspond to those for String.
Replace the beginning of A.1(49) by:
In each of the types Character [and Wide_Character]{, Wide_Character, and Wide_Wide_Character} ...
In A.3(1) change:
... with Wide_Character {and Wide_Wide_Character} data ...
In A.3.2(13) change:
... between {Wide_Wide_Character, } Wide_Character{,} ...
Add after A.3.2(14):
function Is_Character (Item : in Wide_Wide_Character) return Boolean; function Is_String (Item : in Wide_Wide_String) return Boolean; function Is_Wide_Character (Item : in Wide_Wide_Character) return Boolean; function Is_Wide_String (Item : in Wide_Wide_String) return Boolean;
Add after A.3.2(16):
function To_Character (Item : in Wide_Wide_Character; Substitute : in Character := ' ') return Character; function To_String (Item : in Wide_Wide_String; Substitute : in Character := ' ') return String;
Add after A.3.2(18):
function To_Wide_Character (Item : in Wide_Wide_Character; Substitute : in Wide_Character := ' ') return Wide_Character; function To_Wide_String (Item : in Wide_Wide_String; Substitute : in Wide_Character := ' ') return Wide_String; function To_Wide_Wide_Character (Item : in Character) return Wide_Wide_Character; function To_Wide_Wide_String (Item : in String) return Wide_Wide_String; function To_Wide_Wide_Character (Item : in Wide_Character) return Wide_Wide_Character; function To_Wide_Wide_String (Item : in Wide_String) return Wide_Wide_String;
Replace A.3.2(42-48) by:
The following functions test Wide_Wide_Character or Wide_Character values for membership in Wide_Character or Character, or convert between corresponding characters of Wide_Wide_Character, Wide_Character, and Character.
function Is_Character (Item : in Wide_Character) return Boolean; Returns True if Wide_Character'Pos(Item) <= Character'Pos(Character'Last).
function Is_Character (Item : in Wide_Wide_Character) return Boolean; Returns True if Wide_Wide_Character'Pos(Item) <= Character'Pos(Character'Last).
function Is_Wide_Character (Item : in Wide_Wide_Character) return Boolean; Returns True if Wide_Wide_Character'Pos(Item) <= Wide_Character'Pos(Wide_Character'Last).
function Is_String (Item : in Wide_String) return Boolean; function Is_String (Item : in Wide_Wide_String) return Boolean; Returns True if Is_Character(Item(I)) is True for each I in Item'Range.
function Is_Wide_String (Item : in Wide_Wide_String) return Boolean; Returns True if Is_Wide_Character(Item(I)) is True for each I in Item'Range.
function To_Character (Item : in Wide_Character; Substitute : in Character := ' ') return Character; function To_Character (Item : in Wide_Wide_Character; Substitute : in Character := ' ') return Character; Returns the Character corresponding to Item if Is_Character(Item), and returns the Substitute Character otherwise.
function To_Wide_Character (Item : in Character) return Wide_Character; Returns the Wide_Character X such that Character'Pos(Item) = Wide_Character'Pos (X).
function To_Wide_Character (Item : in Wide_Wide_Character; Substitute : in Wide_Character := ' ') return Wide_Character; Returns the Wide_Character corresponding to Item if Is_Wide_Character(Item), and returns the Substitute Wide_Character otherwise.
function To_Wide_Wide_Character (Item : in Character) return Wide_Wide_Character; Returns the Wide_Wide_Character X such that Character'Pos(Item) = Wide_Wide_Character'Pos (X).
function To_Wide_Wide_Character (Item : in Wide_Character) return Wide_Wide_Character; Returns the Wide_Wide_Character X such that Wide_Character'Pos(Item) = Wide_Wide_Character'Pos (X).
function To_String (Item : in Wide_String; Substitute : in Character := ' ') return String; function To_String (Item : in Wide_Wide_String; Substitute : in Character := ' ') return String; Returns the String whose range is 1..Item'Length and each of whose elements is given by To_Character of the corresponding element in Item.
function To_Wide_String (Item : in String) return Wide_String; Returns the Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Character of the corresponding element in Item.
function To_Wide_String (Item : in Wide_Wide_String; Substitute : in Wide_Character := ' ') return Wide_String; Returns the Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Character of the corresponding element in Item with the given Substitute Wide_Character.
function To_Wide_Wide_String (Item : in String) return Wide_Wide_String; function To_Wide_Wide_String (Item : in Wide_String) return Wide_Wide_String; Returns the Wide_Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Wide_Character of the corresponding element in Item.
Delete A.3.2(49).
In A.4(1) change:
... [both] String [and Wide_String]{, Wide_String, and Wide_Wide_String} ...
Add after A.4.1(4):
Wide_Wide_Space : constant Wide_Wide_Character := ' ';
Add after A.4.7 a new section, A.4.8:
A.4.8 Wide_Wide_String Handling
Facilities for handling strings of Wide_Wide_Character components are found in the packages Strings.Wide_Wide_Maps, Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, Strings.Wide_Wide_Unbounded, and Strings.Wide_Wide_Maps.Wide_Wide_Constants. They provide the same string-handling operations as the corresponding packages for strings of Character components.
Static Semantics
The package Strings.Wide_Wide_Maps has the following declaration.
package Ada.Strings.Wide_Wide_Maps is pragma Preelaborate(Wide_Wide_Maps); -- Representation for a set of Wide_Wide_Character values: type Wide_Wide_Character_Set is private; Null_Set : constant Wide_Wide_Character_Set; type Wide_Wide_Character_Range is record Low : Wide_Wide_Character; High : Wide_Wide_Character; end record; -- Represents Wide_Wide_Character range Low..High type Wide_Wide_Character_Ranges is array (Positive range <>) of Wide_Wide_Character_Range; function To_Set (Ranges : in Wide_Wide_Character_Ranges) return Wide_Wide_Character_Set; function To_Set (Span : in Wide_Wide_Character_Range) return Wide_Wide_Character_Set; function To_Ranges (Set : in Wide_Wide_Character_Set) return Wide_Wide_Character_Ranges; function "=" (Left, Right : in Wide_Wide_Character_Set) return Boolean; function "not" (Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "and" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "or" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "xor" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "-" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function Is_In (Element : in Wide_Wide_Character; Set : in Wide_Wide_Character_Set) return Boolean; function Is_Subset (Elements : in Wide_Wide_Character_Set; Set : in Wide_Wide_Character_Set) return Boolean; function "<=" (Left : in Wide_Wide_Character_Set; Right : in Wide_Wide_Character_Set) return Boolean renames Is_Subset; -- Alternative representation for a set of Wide_Wide_Character values: subtype Wide_Wide_Character_Sequence is Wide_Wide_String; function To_Set (Sequence : in Wide_Wide_Character_Sequence) return Wide_Wide_Character_Set; function To_Set (Singleton : in Wide_Wide_Character) return Wide_Wide_Character_Set; function To_Sequence (Set : in Wide_Wide_Character_Set) return Wide_Wide_Character_Sequence; -- Representation for a Wide_Wide_Character to Wide_Wide_Character -- mapping: type Wide_Wide_Character_Mapping is private; function Value (Map : in Wide_Wide_Character_Mapping; Element : in Wide_Wide_Character) return Wide_Wide_Character; Identity : constant Wide_Wide_Character_Mapping; function To_Mapping (From, To : in Wide_Wide_Character_Sequence) return Wide_Wide_Character_Mapping; function To_Domain (Map : in Wide_Wide_Character_Mapping) return Wide_Wide_Character_Sequence; function To_Range (Map : in Wide_Wide_Character_Mapping) return Wide_Wide_Character_Sequence; type Wide_Wide_Character_Mapping_Function is access function (From : in Wide_Wide_Character) return Wide_Wide_Character; private ... -- not specified by the language end Ada.Strings.Wide_Wide_Maps;
The context clause for each of the packages Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, and Strings.Wide_Wide_Unbounded identifies Strings.Wide_Wide_Maps instead of Strings.Maps.
for each of the packages Strings.Fixed, Strings.Bounded, Strings.Unbounded, and Strings.Maps.Constants the corresponding wide wide string package has the same contents except that
o Wide_Wide_Space replaces Space o Wide_Wide_Character replaces Character o Wide_Wide_String replaces String o Wide_Wide_Character_Set replaces Character_Set o Wide_Wide_Character_Mapping replaces Character_Mapping o Wide_Wide_Character_Mapping_Function replaces Character_Mapping_Function o Wide_Wide_Maps replaces Maps o Bounded_Wide_Wide_String replaces Bounded_String o Null_Bounded_Wide_Wide_String replaces Null_Bounded_String o To_Bounded_Wide_Wide_String replaces To_Bounded_String o To_Wide_Wide_String replaces To_String o Unbounded_Wide_Wide_String replaces Unbounded_String o Null_Unbounded_Wide_Wide_String replaces Null_Unbounded_String o Wide_Wide_String_Access replaces String_Access o To_Unbounded_Wide_Wide_String replaces To_Unbounded_String
The following additional declarations are present in Strings.Wide_Wide_Maps.Wide_Wide_Constants:
Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set; -- Contains each Wide_Wide_Character value WWC such that Characters.Handling.Is_Character(WWC) is True Wide_Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set; -- Contains each Wide_Wide_Character value WWC such that -- Characters.Handling.Is_Wide_Character (WWC) is True
[Author's note: the preceding comment is missing ".Handling" in A.4.7(46).]
NOTES If a null Wide_Wide_Character_Mapping_Function is passed to any of the Wide_Wide_String handling subprograms, Constraint_Error is propagated.
In A.6(1) change:
... packages Text_IO [and Wide_Text_IO]{, Wide_Text_IO, and Wide_Wide_Text_IO} ...
In A.7(4) change:
... data, [and] Wide_Text_IO for Wide_Character and Wide_String data {, and Wide_Wide_Text_IO for Wide_Wide_Character and Wide_Wide_String data} ...
In A.7(10) change:
... Text_IO, Wide_Text_IO {, Wide_Wide_Text_IO}, and Stream_IO ...
In A.7(13) change:
... Direct_IO, Text_IO [and Wide_Text_IO]{, Wide_Text_IO, and Wide_Wide_Text_IO} ...
In A.7(15) change:
... Text_IO, Wide_Text_IO {, Wide_Wide_Text_IO}, and Stream_IO ...
Replace A.11 by:
A.11 Wide Text Input-Output and Wide Wide Text Input-Output
The packages Wide_Text_IO and Wide_Wide_Text_IO provide facilities for input and output in human-readable form. Each file is read or written sequentially, as a sequence of wide characters (or wide wide characters) grouped into lines, and as a sequence of lines grouped into pages.
Static Semantics
The specification of package Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Character, and any occurrence of String is replaced by Wide_String. Nongeneric equivalents of Wide_Text_IO.Integer_IO and Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Text_IO, Ada.Long_Integer_Wide_Text_IO, Ada.Float_Wide_Text_IO, Ada.Long_Float_Wide_Text_IO.
The specification of package Wide_Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Wide_Character, and any occurrence of String is replaced by Wide_Wide_String. Nongeneric equivalents of Wide_Wide_Text_IO.Integer_IO and Wide_Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Wide_Text_IO, Ada.Long_Integer_Wide_Wide_Text_IO, Ada.Float_Wide_Wide_Text_IO, Ada.Long_Float_Wide_Wide_Text_IO.
In A.12(1) change:
... Text_IO.Text_Streams [and Wide_Text_IO.Text_Streams]{, Wide_Text_IO.Text_Streams, and Wide_Wide_Text_IO.Text_Streams} ...
Add a new section after A.12.3:
A.12.4 The Package Wide_Wide_Text_IO.Text_Streams
The package Wide_Wide_Text_IO.Text_Streams provides a function for treating a wide wide text file as a stream.
Static Semantics
The library package Wide_Wide_Text_IO.Text_Streams has the following declaration:
with Ada.Streams; package Ada.Wide_Wide_Text_IO.Text_Streams is type Stream_Access is access all Streams.Root_Stream_Type'Class; function Stream (File : in File_Type) return Stream_Access; end Ada.Wide_Wide_Text_IO.Text_Streams;
The Stream function has the same effect as the corresponding function in Streams.Stream_IO.
Add after B.3(39):
-- ISO/IEC 10646:2003 compatible types defined by SC22/WG14 document N1010.
type char16_t is <implementation-defined character type>;
char16_nul : constant char16_t := implementation-defined;
function To_C (Item : in Wide_Character) return char16_t; function To_Ada (Item : in char16_t) return Wide_Character;
type char16_array is array (size_t range <>) of aliased char16_t;
pragma Pack(char16_array);
function Is_Nul_Terminated (Item : in char16_array) return Boolean; function To_C (Item : in Wide_String;
Append_Nul : in Boolean := True)
return char16_array;
function To_Ada (Item : in char16_array; Trim_Nul : in Boolean := True) return Wide_String;
procedure To_C (Item : in Wide_String; Target : out char16_array; Count : out size_t; Append_Nul : in Boolean := True);
procedure To_Ada (Item : in char16_array; Target : out Wide_String; Count : out Natural; Trim_Nul : in Boolean := True);
type char32_t is <implementation-defined character type>;
char32_nul : constant char32_t := implementation-defined;
function To_C (Item : in Wide_Wide_Character) return char32_t; function To_Ada (Item : in char32_t) return Wide_Wide_Character;
type char32_array is array (size_t range <>) of aliased char32_t;
pragma Pack(char32_array);
function Is_Nul_Terminated (Item : in char32_array) return Boolean; function To_C (Item : in Wide_Wide_String; Append_Nul : in Boolean := True) return char32_array;
function To_Ada (Item : in char32_array; Trim_Nul : in Boolean := True) return Wide_Wide_String;
procedure To_C (Item : in Wide_Wide_String; Target : out char32_array; Count : out size_t; Append_Nul : in Boolean := True);
procedure To_Ada (Item : in char32_array; Target : out Wide_Wide_String; Count : out Natural; Trim_Nul : in Boolean := True);
In B.3(43) change:
The types int, short, long, unsigned, ptrdiff_t, size_t, double, char [, and wchar_t]{, wchar_t, char16_t, and char32_t} correspond respectively to the C types having the same names.
Add after B.3(60):
function Is_Nul_Terminated (Item : in char16_array) return Boolean;
The result of Is_Nul_Terminated is True if Item contains char16_nul, and is False otherwise.
function To_C (Item : in Wide_Character) return char16_t; function To_Ada (Item : in char16_t ) return Wide_Character;
To_C and To_Ada provide mappings between the Ada and C 16-bit character types.
function To_C (Item : in Wide_String; Append_Nul : in Boolean := True) return char16_array;
function To_Ada (Item : in char16_array; Trim_Nul : in Boolean := True) return Wide_String;
procedure To_C (Item : in Wide_String; Target : out char16_array; Count : out size_t; Append_Nul : in Boolean := True);
procedure To_Ada (Item : in char16_array; Target : out Wide_String; Count : out Natural; Trim_Nul : in Boolean := True);
The To_C and To_Ada subprograms that convert between Wide_String and char16_array have analogous effects to the To_C and To_Ada subprograms that convert between String and char_array, except that char16_nul is used instead of nul.
function Is_Nul_Terminated (Item : in char32_array) return Boolean;
The result of Is_Nul_Terminated is True if Item contains char16_nul, and is False otherwise.
function To_C (Item : in Wide_Wide_Character) return char32_t; function To_Ada (Item : in char32_t ) return Wide_Wide_Character;
To_C and To_Ada provide the mappings between the Ada and C 32-bit character types.
function To_C (Item : in Wide_Wide_String; Append_Nul : in Boolean := True) return char32_array;
function To_Ada (Item : in char32_array; Trim_Nul : in Boolean := True) return Wide_Wide_String;
procedure To_C (Item : in Wide_Wide_String; Target : out char32_array; Count : out size_t; Append_Nul : in Boolean := True);
procedure To_Ada (Item : in char32_array; Target : out Wide_Wide_String; Count : out Natural; Trim_Nul : in Boolean := True);
The To_C and To_Ada subprograms that convert between Wide_Wide_String and char32_array have analogous effects to the To_C and To_Ada subprograms that convert between String and char_array, except that char32_nul is used instead of nul.
At the beginning of C.5(7) change:
If the pragma applies to an enumeration type, then the semantics of the Wide_Wide_Image and Wide_Wide_Value attributes are implementation defined for that type; the semantics of Image, Wide_Image, Value, and Wide_Value are still defined in terms of Wide_Wide_Image and Wide_Wide_Value...
In F(4) change:
... Text_IO.Editing [and Wide_Text_IO.Editing]{, Wide_Text_IO.Editing, and Wide_Wide_Text_IO.Editing} ...
At the beginning of F.3(1) change:
The child packages Text_IO.Editing [and Wide_Text_IO.Editing]{, Wide_Text_IO.Editing, and Wide_Wide_Text_IO.Editing}...
Add at the end of F.3(6):
... For Wide_Wide_Text_IO.Editing their types are Wide_Wide_String and Wide_Wide_Character, respectively.
In F.3(19) change:
... Text_IO.Decimal_IO [and Wide_Text_IO.Decimal_IO]{, Wide_Text_IO.Decimal_IO and Wide_Wide_Text_IO.Decimal_IO}
In F.3(20) change:
... [both] for {all of} Text_IO.Editing [and Wide_Text_IO.Editing]{, Wide_Text_IO.Editing, and Wide_Wide_Text_IO.Editing} ...
Add a new section after F.3.4:
F.3.5 The Package Wide_Wide_Text_IO.Editing
Static Semantics
The child package Wide_Wide_Text_IO.Editing has the same contents as Text_IO.Editing, except that:
o each occurrence of Character is replaced by Wide_Wide_Character, o each occurrence of Text_IO is replaced by Wide_Wide_Text_IO, o the subtype of Default_Currency is Wide_Wide_String rather than String, and o each occurrence of String in the generic package Decimal_Output is replaced
by Wide_Wide_String.
NOTES
Each of the functions Wide_Wide_Text_IO.Editing.Valid, To_Picture, and Pic_String has String (versus Wide_Wide_String) as its parameter or result subtype, since a picture String is not localizable.
Add a new section after G.1.4:
G.1.5 The Package Wide_Wide_Text_IO.Complex_IO
Static Semantics
Implementations shall also provide the generic library package Wide_Wide_Text_IO.Complex_IO. Its declaration is obtained from that of Text_IO.Complex_IO by systematically replacing Text_IO by Wide_Wide_Text_IO and String by Wide_Wide_String; the description of its behavior is obtained by additionally replacing references to particular characters (commas, parentheses, etc.) by those for the corresponding wide wide characters.
In H.4(20) change:
... Text_IO, Wide_Text_IO {, Wide_Wide_Text_IO}, or Stream_IO ...
Fix annex K. [Author's note: I'm pretty sure it's auto-generated...]
!discussion
See proposal.
!example
An example would show identifiers using characters from the CJKV ideographs or from non-Latin alphabets (Cyrillic, Greek, Arabic, etc.). But that's hard to do in a Latin-1, plain text file...
!comment Introduction clause
!corrigendum 0.3(32)
Replace the paragraph:
An enumeration type defines an ordered set of distinct enumeration literals, for example a list of states or an alphabet of characters. The enumeration types Boolean, Character, and Wide_Character are predefined.
by:
An enumeration type defines an ordered set of distinct enumeration literals, for example a list of states or an alphabet of characters. The enumeration types Boolean, Character, Wide_Character, and Wide_Wide_Character are predefined.
!comment Introduction clause
!corrigendum 0.3(34)
Replace the paragraph:
Composite types allow definitions of structured objects with related components. The composite types in the language include arrays and records. An array is an object with indexed components of the same type. A record is an object with named components of possibly different types. Task and protected types are also forms of composite types. The array types String and Wide_String are predefined.
by:
Composite types allow definitions of structured objects with related components. The composite types in the language include arrays and records. An array is an object with indexed components of the same type. A record is an object with named components of possibly different types. Task and protected types are also forms of composite types. The array types String, Wide_String, and Wide_Wide_String are predefined.
!corrigendum 1.1.4(14)
Insert after the paragraph:
the new paragraph:
The terminals of the grammar, including reserved words, punctuation and components of lexical elements, are exclusively made of the characters whose code position is between 16#20# and 16#7E#, inclusively. For example, the character E in the definition of exponent is the character whose name is "LATIN CAPITAL LETTER E", not "GREEK CAPITAL LETTER EPSILON".
!corrigendum 2.1(1)
Replace the paragraph:
The only characters allowed outside of comments are the graphic_characters and format_effectors.
by:
The characters whose code position is 16#FFFE# or 16#FFFF# are not allowed anywhere in the text of a program. The characters in categories other_control, other_private_use, and other_surrogate are only allowed in comments.
!corrigendum 2.1(2)
Delete the paragraph:
character ::= graphic_character | format_effector | other_control_function
!corrigendum 2.1(3)
Delete the paragraph:
graphic_character ::= identifier_letter | digit | space_character | special_character
!corrigendum 2.1(4)
Replace the paragraph:
The character repertoire for the text of an Ada program consists of the collection of characters called the Basic Multilingual Plane (BMP) of the ISO 10646 Universal Multiple-Octet Coded Character Set, plus a set of format_effectors and, in comments only, a set of other_control_functions; the coded representation for these characters is implementation defined (it need not be a representation defined within ISO-10646-1).
by:
The character repertoire for the text of an Ada program consists of the collection of characters described by the ISO/IEC 10646:2003 Universal Multiple-Octet Coded Character Set. The coded representation for these characters is implementation defined (it need not be a representation defined within ISO/IEC 10646:2003).
The semantics of an Ada program whose text is not in Normalization Form KC (as defined by section 24 of ISO/IEC 10646:2003) are implementation defined.
!corrigendum 2.1(5)
Replace the paragraph:
The description of the language definition in this International Standard uses the graphic symbols defined for Row 00: Basic Latin and Row 00: Latin-1 Supplement of the ISO 10646 BMP; these correspond to the graphic symbols of ISO 8859-1 (Latin-1); no graphic symbols are used in this International Standard for characters outside of Row 00 of the BMP. The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified.
by:
The description of the language definition in this International Standard uses the character properties General Category, Simple Uppercase Mapping, Uppercase Mapping, and Special Case Condition of the documents referenced by the note in section 1 of ISO/IEC 10646:2003. The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified.
!corrigendum 2.1(7)
Delete the paragraph:
identifier_letter
upper_case_identifier_letter | lower_case_identifier_letter
!corrigendum 2.1(8)
Replace the paragraph:
upper_case_identifier_letter
Any character of Row 00 of ISO 10646 BMP whose name begins ``Latin Capital Letter''.
by:
letter_uppercase
Any character whose General Category is defined to be "Letter, Uppercase".
!corrigendum 2.1(9)
Replace the paragraph:
lower_case_identifier_letter
Any character of Row 00 of ISO 10646 BMP whose name begins ``Latin Small Letter''.
by:
letter_lowercase
Any character whose General Category is defined to be "Letter, Lowercase".
letter_titlecase
Any character whose General Category is defined to be "Letter, Titlecase".
letter_modifier
Any character whose General Category is defined to be "Letter, Modifier".
letter_other
Any character whose General Category is defined to be "Letter, Other".
mark_non_spacing
Any character whose General Category is defined to be "Mark, Non-Spacing".
mark_spacing_combining
Any character whose General Category is defined to be "Mark, Spacing Combining".
!corrigendum 2.1(10)
Replace the paragraph:
digit
One of the characters 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.
by:
number_decimal_digit
Any character whose General Category is defined to be "Number, Decimal Digit".
number_letter
Any character whose General Category is defined to be "Number, Letter".
!corrigendum 2.1(11)
Delete the paragraph:
space_character
The character of ISO 10646 BMP named ``Space''.
!corrigendum 2.1(12)
Replace the paragraph:
special_character
Any character of the ISO 10646 BMP that is not reserved for a control function, and is not the space_character, an identifier_letter, or a digit.
by:
other_control
Any character whose General Category is defined to be "Other, Control".
other_format
Any character whose General Category is defined to be "Other, Format".
other_private_use
Any character whose General Category is defined to be "Other, Private Use".
other_surrogate
Any character whose General Category is defined to be "Other, Surrogate".
punctuation_connector
Any character whose General Category is defined to be "Punctuation, Connector".
separator_space
Any character whose General Category is defined to be "Separator, Space".
separator_line
Any character whose General Category is defined to be "Separator, Line".
separator_paragraph
Any character whose General Category is defined to be "Separator, Paragraph".
!corrigendum 2.1(13)
Replace the paragraph:
format_effector
The control functions of ISO 6429 called character tabulation (HT), line tabulation (VT), carriage return (CR), line feed (LF), and form feed (FF).
by:

format_effector
The characters whose code position is 16#09# (CHARACTER TABULATION), 16#0A# (LINE FEED(LF)), 16#0B# (LINE TABULATION), 16#0C# (FORM FEED(FF)), 16#0D# (CARRIAGE RETURN(CR)), 16#85# (NEXT LINE(NEL)), and the characters in categories separator_line and separator_paragraph. The names mentioned in parenthese in this list are not defined by ISO/IEC 10646:2003; they are only used for convenience in this International Standard.
!corrigendum 2.1(14)
Replace the paragraph:
other_control_function
Any control function, other than a format_effector, that is allowed in a comment; the set of other_control_functions allowed in comments is implementation defined.
by:
graphic_character
Any character which is not in the categories other_control, other_private_use, other_surrogate, other_format, format_effector, and whose code position is neither 16#FFFE# nor 16#FFFF#.
!corrigendum 2.1(15)
Replace the paragraph:
The following names are used when referring to certain special_characters:
by:
The following names are used when referring to certain characters (the first name is that given in ISO/IEC 10646:2003):
!comment I'm not going to try to update the table here, as it would be very
!comment difficult to format properly. Moreover, there is nothing important
!comment wrong with it. I'll make the suggested changes as editorial
!comment corrections in the Standard.
!corrigendum 2.1(16)
Delete the paragraph:
In a nonstandard mode, the implementation may support a different character repertoire; in particular, the set of characters that are considered identifier_letters can be extended or changed to conform to local conventions.
!corrigendum 2.1(17)
Delete the paragraph:
1 Every code position of ISO 10646 BMP that is not reserved for a control function is defined to be a graphic_character by this International Standard. This includes all code positions other than 0000 - 001F, 007F - 009F, and FFFE - FFFF.
!corrigendum 2.2(3)
Replace the paragraph:
In some cases an explicit separator is required to separate adjacent lexical elements. A separator is any of a space character, a format effector, or the end of a line, as follows:
by:
In some cases an explicit separator is required to separate adjacent lexical elements. A separator is any of a separator_space, a format_effector or the end of a line, as follows:
!corrigendum 2.2(4)
Replace the paragraph:
by:
!corrigendum 2.2(5)
Replace the paragraph:
by:
!corrigendum 2.2(8)
Replace the paragraph:
A delimiter is either one of the following special characters:
by:
A delimiter is either one of the following characters:
!corrigendum 2.3(02)
Replace the paragraph:
identifier ::= identifier_letter {[underline] letter_or_digit}
by:
identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= identifier_start | mark_non_spacing | mark_spacing_combining | number_decimal_digit | other_format identifier ::= identifier_start {[punctuation_connector] identifier_extend}
!corrigendum 2.3(03)
Delete the paragraph:
letter_or_digit ::= identifier_letter | digit
!corrigendum 2.3(05)
Replace the paragraph:
All characters of an identifier are significant, including any underline character. Identifiers differing only in the use of corresponding upper and lower case letters are considered the same.
by:
Two identifiers are considered the same if they consist of the same sequence of characters after applying the following transformations (in this order):
!corrigendum 2.6(06)
Insert after the paragraph:
A null string literal is a string_literal with no string_elements between the quotation marks.
the new paragraph:
No modification is performed on the sequence of characters in a string_literal.
!corrigendum 3.5(28)
Replace the paragraph:
S'Wide_Image
S'Wide_Image denotes a function with the following specification:
by:
S'Wide_Wide_Image
S'Wide_Wide_Image denotes a function with the following specification:
!corrigendum 3.5(29)
Replace the paragraph:
function S'Wide_Image(Arg : S'Base) return Wide_String
by:
function S'Wide_Wide_Image(Arg : S'Base) return Wide_Wide_String
!corrigendum 3.5(34)
Insert after the paragraph:
The image of a fixed point value is a decimal real literal best approximating the value (rounded away from zero if halfway between) with a single leading character that is either a minus sign or a space, one or more digits before the decimal point (with no redundant leading zeros), a decimal point, and S'Aft (see 3.5.10) digits after the decimal point.
the new paragraphs:
S'Wide_Image
S'Wide_Image denotes a function with the following specification:
function S'Wide_Image(Arg : S'Base) return Wide_String
The function returns an image of the value of Arg as a Wide_String. The lower bound of the result is one. The image has the same sequence of character as defined for S'Wide_Wide_Image if all the graphic characters are defined in Wide_Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of Arg).
!corrigendum 3.5(37)
Replace the paragraph:
The function returns an image of the value of Arg as a String. The lower bound of the result is one. The image has the same sequence of graphic characters as that defined for S'Wide_Image if all the graphic characters are defined in Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Image for the same value of Arg).
by:
The function returns an image of the value of Arg as a String. The lower bound of the result is one. The image has the same sequence of character as defined for S'Wide_Wide_Image if all the graphic characters are defined in Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of Arg).
S'Wide_Wide_Width
S'Wide_Wide_Width denotes the maximum length of a Wide_Wide_String returned by S'Wide_Wide_Image over all the values of S. It denotes zero for a subtype that has a null range. Its type is universal_integer.
!corrigendum 3.5(40)
Replace the paragraph:
S'Wide_Value
S'Wide_Value denotes a function with the following specification:
by:
S'Wide_Wide_Value
S'Wide_Wide_Value denotes a function with the following specification:
!corrigendum 3.5(41)
Replace the paragraph:
function S'Wide_Value(Arg : Wide_String) return S'Base
by:
function S'Wide_Wide_Value(Arg : Wide_Wide_String) return S'Base
!corrigendum 3.5(42)
Replace the paragraph:
This function returns a value given an image of the value as a Wide_String, ignoring any leading or trailing spaces.
by:
This function returns a value given an image of the value as a Wide_Wide_String, ignoring any leading or trailing spaces.
!corrigendum 3.5(43)
Replace the paragraph:
For the evaluation of a call on S'Wide_Value for an enumeration subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an enumeration literal and if it corresponds to a literal of the type of S (or corresponds to the result of S'Wide_Image for a nongraphic character of the type), the result is the corresponding enumeration value; otherwise Constraint_Error is raised.
by:
For the evaluation of a call on S'Wide_Wide_Value for an enumeration subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an enumeration literal and if it corresponds to a literal of the type of S (or corresponds to the result of S'Wide_Wide_Image for a nongraphic character of the type), the result is the corresponding enumeration value; otherwise Constraint_Error is raised.
!corrigendum 3.5(44)
Replace the paragraph:
For the evaluation of a call on S'Wide_Value (or S'Value) for an integer subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an integer literal, with an optional leading sign character (plus or minus for a signed type; only plus for a modular type), and the corresponding numeric value belongs to the base range of the type of S, then that value is the result; otherwise Constraint_Error is raised.
by:
For the evaluation of a call on S'Wide_Wide_Value for an integer subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an integer literal, with an optional leading sign character (plus or minus for a signed type; only plus for a modular type), and the corresponding numeric value belongs to the base range of the type of S, then that value is the result; otherwise Constraint_Error is raised.
!corrigendum 3.5(45)
Replace the paragraph:
For the evaluation of a call on S'Wide_Value (or S'Value) for a real subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of one of the following:
by:
For the evaluation of a call on S'Wide_Wide_Value for a real subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of one of the following:
!corrigendum 3.5(51)
Insert after the paragraph:
with an optional leading sign character (plus or minus), and if the corresponding numeric value belongs to the base range of the type of S, then that value is the result; otherwise Constraint_Error is raised. The sign of a zero value is preserved (positive if none has been specified) if S'Signed_Zeros is True.
the new paragraphs:
S'Wide_Value
S'Wide_Value denotes a function with the following specification:
function S'Wide_Value(Arg : Wide_String) return S'Base
This function returns a value given an image of the value as a Wide_String, ignoring any leading or trailing spaces. For the evaluation of a call on S'Wide_Value for an enumeration subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an enumeration literal and if it corresponds to a literal of the type of S (or corresponds to the result of S'Wide_Image for a value of the type), the result is the corresponding enumeration value; otherwise Constraint_Error is raised. For a numeric subtype S, the evaluation of a call on S'Wide_Value with Arg of type Wide_String is equivalent to a call on S'Wide_Wide_Value for a corresponding Arg of type Wide_Wide_String.
!corrigendum 3.5(55)
Replace the paragraph:
For the evaluation of a call on S'Value for an enumeration subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an enumeration literal and if it corresponds to a literal of the type of S (or corresponds to the result of S'Image for a value of the type), the result is the corresponding enumeration value; otherwise Constraint_Error is raised. For a numeric subtype S, the evaluation of a call on S'Value with Arg of type String is equivalent to a call on S'Wide_Value for a corresponding Arg of type Wide_String.
by:
For the evaluation of a call on S'Value for an enumeration subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an enumeration literal and if it corresponds to a literal of the type of S (or corresponds to the result of S'Image for a value of the type), the result is the corresponding enumeration value; otherwise Constraint_Error is raised. For a numeric subtype S, the evaluation of a call on S'Value with Arg of type String is equivalent to a call on S'Wide_Wide_Value for a corresponding Arg of type Wide_Wide_String.
!corrigendum 3.5(56)
Replace the paragraph:
An implementation may extend the Wide_Value, Value, Wide_Image, and Image attributes of a floating point type to support special values such as infinities and NaNs.
by:
An implementation may extend the Wide_Wide_Value, Wide_Value, Value, Wide_Wide_Image, Wide_Image, and Image attributes of a floating point type to support special values such as infinities and NaNs.
!corrigendum 3.5(59)
Replace the paragraph:
21 For any value V (including any nongraphic character) of an enumeration subtype S, S'Value(S'Image(V)) equals V, as does S'Wide_Value(S'Wide_Image(V)). Neither expression ever raises Constraint_Error.
by:
21 For any value V (including any nongraphic character) of an enumeration subtype S, S'Value(S'Image(V)) equals V, as do S'Wide_Value(S'Wide_Image(V)) and S'Wide_Wide_Value(S'Wide_Wide_Image(V)). Neither expression ever raises Constraint_Error.
!corrigendum 3.5.2(2)
Replace the paragraph:
The predefined type Character is a character type whose values correspond to the 256 code positions of Row 00 (also known as Latin-1) of the ISO 10646 Basic Multilingual Plane (BMP). Each of the graphic characters of Row 00 of the BMP has a corresponding character_literal in Character. Each of the nongraphic positions of Row 00 (0000-001F and 007F-009F) has a corresponding language-defined name, which is not usable as an enumeration literal, but which is usable with the attributes (Wide_)Image and (Wide_)Value; these names are given in the definition of type Character in A.1, ``The Package Standard'', but are set in italics.
by:
The predefined type Character is a character type whose values correspond to the 256 code positions of Row 00 (also known as Latin-1) of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP). Each of the graphic characters of Row 00 of the BMP has a corresponding character_literal in Character. Each of the nongraphic positions of Row 00 (0000-001F and 007F-009F) has a corresponding language-defined name, which is not usable as an enumeration literal, but which is usable with the attributes Image, Wide_Image, Wide_Wide_Image, Value, Wide_Value, and Wide_Wide_Value; these names are given in the definition of type Character in A.1, ``The Package Standard'', but are set in italics.
!corrigendum 3.5.2(3)
Replace the paragraph:
The predefined type Wide_Character is a character type whose values correspond to the 65536 code positions of the ISO 10646 Basic Multilingual Plane (BMP). Each of the graphic characters of the BMP has a corresponding character_literal in Wide_Character. The first 256 values of Wide_Character have the same character_literal or language-defined name as defined for Character. The last 2 values of Wide_Character correspond to the nongraphic positions FFFE and FFFF of the BMP, and are assigned the language-defined names FFFE and FFFF. As with the other language-defined names for nongraphic characters, the names FFFE and FFFF are usable only with the attributes (Wide_)Image and (Wide_)Value; they are not usable as enumeration literals. All other values of Wide_Character are considered graphic characters, and have a corresponding character_literal.
by:
The predefined type Wide_Character is a character type whose values correspond to the 65536 code positions of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP). Each of the graphic characters of the BMP has a corresponding character_literal in Wide_Character. The first 256 values of Wide_Character have the same character_literal or language-defined name as defined for Character. Each of the graphic_characters has a corresponding character_literal.
The predefined type Wide_Wide_Character is a character type whose values correspond to the 2147483648 code positions of the ISO/IEC 10646:2003 character set. Each of the graphic_characters has a corresponding character_literal in Wide_Wide_Character. The first 65536 values of Wide_Wide_Character have the same character_literal or language-defined name as defined for Wide_Character.
In types Wide_Character and Wide_Wide_Character, the characters whose code positions are 16#FFFE# and 16#FFFF# are assigned the language-defined names FFFE and FFFF. The other characters whose code position is larger than 16#FF# and which are not graphic_characters have language-defined names which are formed by appending to the string "Character_" the representation of their code position in hexadecimal as eight extended digits. As with other language-defined names, these names are usable only with the attributes (Wide_)Wide_Image and (Wide_)Wide_Value; they are not usable as enumeration literals.
!corrigendum 3.5.2(4)
Replace the paragraph:
In a nonstandard mode, an implementation may provide other interpretations for the predefined types Character and Wide_Character, to conform to local conventions.
by:
In a nonstandard mode, an implementation may provide other interpretations for the predefined types Character, Wide_Character, and Wide_Wide_Character to conform to local conventions.
!corrigendum 3.5.2(5)
Delete the paragraph:
If an implementation supports a mode with alternative interpretations for Character and Wide_Character, the set of graphic characters of Character should nevertheless remain a proper subset of the set of graphic characters of Wide_Character. Any character set ``localizations'' should be reflected in the results of the subprograms defined in the language-defined package Characters.Handling (see A.3) available in such a mode. In a mode with an alternative interpretation of Character, the implementation should also support a corresponding change in what is a legal identifier_letter.
!corrigendum 3.6.3(2)
Replace the paragraph:
There are two predefined string types, String and Wide_String, each indexed by values of the predefined subtype Positive; these are declared in the visible part of package Standard:
by:
There are three predefined string types, String, Wide_String, and Wide_Wide_String, each indexed by the value of the predefined subtype Positive; these are declared in the visible part of package Standard:
!corrigendum 3.6.3(4)
Replace the paragraph:
type String is array (Positive range <>) of Character; type Wide_String is array (Positive range <>) of Wide_Character;
by:
type String is array (Positive range <>) of Character; type Wide_String is array (Positive range <>) of Wide_Character; type Wide_Wide_String is array (Positive range <>) of Wide_Wide_Character;
!corrigendum A.1(36)
Replace the paragraph:
-- The predefined operators for the type Character are the same as for -- any enumeration type.
-- The declaration of type Wide_Character is based on the standard ISO 10646 BMP character set. -- The first 256 positions have the same contents as type Character. See 3.5.2.
type Wide_Character is (nul, soh ... FFFE, FFFF);
package ASCII is ... end ASCII; --Obsolescent; see J.5
by:
-- The predefined operators for the type Character are the same as for -- any enumeration type.
-- The declaration of type Wide_Character is based on the standard ISO 10646 BMP character set. -- The first 256 positions have the same contents as type Character. See 3.5.2.
type Wide_Character is (nul, soh ... FFFE, FFFF);
-- The declaration of type Wide_Wide_Character is based on the full -- ISO/IEC 10646:2003 character set. The first 65536 positions have the -- same contents as type Wide_Character. See 3.5.2.
type Wide_Wide_Character is (nul, soh ... FFFE, FFFF, ...);
package ASCII is ... end ASCII; --Obsolescent; see J.5
!corrigendum A.1(42)
Insert after the paragraph:
-- The predefined operators for this type correspond to those for String
the new paragraphs:
type Wide_Wide_String is array (Positive range <>) of Wide_Wide_Character; pragma Pack (Wide_Wide_String);
-- The predefined operators for this type correspond to those for String.
!corrigendum A.1(49)
Replace the paragraph:
In each of the types Character and Wide_Character, the character literals for the space character (position 32) and the non-breaking space character (position 160) correspond to different values. Unless indicated otherwise, each occurrence of the character literal ' ' in this International Standard refers to the space character. Similarly, the character literals for hyphen (position 45) and soft hyphen (position 173) correspond to different values. Unless indicated otherwise, each occurrence of the character literal '-' in this International Standard refers to the hyphen character.
by:
In each of the types Character, Wide_Character, and Wide_Wide_Character, the character literals for the space character (position 32) and the non-breaking space character (position 160) correspond to different values. Unless indicated otherwise, each occurrence of the character literal ' ' in this International Standard refers to the space character. Similarly, the character literals for hyphen (position 45) and soft hyphen (position 173) correspond to different values. Unless indicated otherwise, each occurrence of the character literal '-' in this International Standard refers to the hyphen character.
!corrigendum A.3(1)
Replace the paragraph:
This clause presents the packages related to character processing: an empty pure package Characters and child packages Characters.Handling and Characters.Latin_1. The package Characters.Handling provides classification and conversion functions for Character data, and some simple functions for dealing with Wide_Character data. The child package Characters.Latin_1 declares a set of constants initialized to values of type Character.
by:
This clause presents the packages related to character processing: an empty pure package Characters and child packages Characters.Handling and Characters.Latin_1. The package Characters.Handling provides classification and conversion functions for Character data, and some simple functions for dealing with Wide_Character and Wide_Wide_Character data. The child package Characters.Latin_1 declares a set of constants initialized to values of type Character.
!corrigendum A.3.2(13)
Replace the paragraph:
--Classifications of and conversions between Wide_Character and Character.
by:
--Classifications of and conversions between Wide_Wide_Character, Wide_Character, and Character.
!corrigendum A.3.2(14)
Insert after the paragraph:
function Is_Character (Item : in Wide_Character) return Boolean; function Is_String (Item : in Wide_String) return Boolean;
the new paragraph:
function Is_Character (Item : in Wide_Wide_Character) return Boolean; function Is_String (Item : in Wide_Wide_String) return Boolean; function Is_Wide_Character (Item : in Wide_Wide_Character) return Boolean; function Is_Wide_String (Item : in Wide_Wide_String) return Boolean;
!corrigendum A.3.2(16)
Insert after the paragraph:
function To_String (Item : in Wide_String; Substitute : in Character := ' ') return String;
the new paragraph:
function To_Character (Item : in Wide_Wide_Character; Substitute : in Character := ' ') return Character; function To_String (Item : in Wide_Wide_String; Substitute : in Character := ' ') return String;
!corrigendum A.3.2(18)
Insert after the paragraph:
function To_Wide_String (Item : in String) return Wide_String;
the new paragraphs:
function To_Wide_Character (Item : in Wide_Wide_Character; Substitute : in Wide_Character := ' ') return Wide_Character;
function To_Wide_String (Item : in Wide_Wide_String; Substitute : in Wide_Character := ' ') return Wide_String;
function To_Wide_Wide_Character (Item : in Character) return Wide_Wide_Character;
function To_Wide_Wide_String (Item : in String) return Wide_Wide_String;
function To_Wide_Wide_Character (Item : in Wide_Character) return Wide_Wide_Character;
function To_Wide_Wide_String (Item : in Wide_String) return Wide_Wide_String;
!corrigendum A.3.2(42)
Replace the paragraph:
The following set of functions test Wide_Character values for membership in Character, or convert between corresponding characters of Wide_Character and Character.
by:
The following functions test Wide_Wide_Character or Wide_Character values for membership in Wide_Character or Character, or convert between corresponding characters of Wide_Wide_Character, Wide_Character, and Character.
!corrigendum A.3.2(43)
Replace the paragraph:
Is_Character
Returns True if Wide_Character'Pos(Item) <= Character'Pos(Character'Last).
by:
function Is_Character (Item : in Wide_Character) return Boolean;
Returns True if Wide_Character'Pos(Item) <= Character'Pos(Character'Last).
function Is_Character (Item : in Wide_Wide_Character) return Boolean;
Returns True if Wide_Wide_Character'Pos(Item) <= Character'Pos(Character'Last).
function Is_Wide_Character (Item : in Wide_Wide_Character) return Boolean;
Returns True if Wide_Wide_Character'Pos(Item) <= Wide_Character'Pos(Wide_Character'Last).
!corrigendum A.3.2(44)
Replace the paragraph:
Is_String
Returns True if Is_Character(Item(I)) is True for each I in Item'Range.
by:
function Is_String (Item : in Wide_String) return Boolean; function Is_String (Item : in Wide_Wide_String) return Boolean;
Returns True if Is_Character(Item(I)) is True for each I in Item'Range.
function Is_Wide_String (Item : in Wide_Wide_String) return Boolean;
Returns True if Is_Wide_Character(Item(I)) is True for each I in Item'Range.
!corrigendum A.3.2(45)
Replace the paragraph:
To_Character
Returns the Character corresponding to Item if Is_Character(Item), and returns the Substitute Character otherwise.
by:
function To_Character (Item : in Wide_Character; Substitute : in Character := ' ') return Character; function To_Character (Item : in Wide_Wide_Character; Substitute : in Character := ' ') return Character;
Returns the Character corresponding to Item if Is_Character(Item), and returns the Substitute Character otherwise.
function To_Wide_Character (Item : in Character) return Wide_Character;
Returns the Wide_Character X such that Character'Pos(Item) = Wide_Character'Pos (X).
function To_Wide_Character (Item : in Wide_Wide_Character; Substitute : in Wide_Character := ' ') return Wide_Character;
Returns the Wide_Character corresponding to Item if Is_Wide_Character(Item), and returns the Substitute Wide_Character otherwise.
function To_Wide_Wide_Character (Item : in Character) return Wide_Wide_Character;
Returns the Wide_Wide_Character X such that Character'Pos(Item) = Wide_Wide_Character'Pos (X).
function To_Wide_Wide_Character (Item : in Wide_Character) return Wide_Wide_Character;
Returns the Wide_Wide_Character X such that Wide_Character'Pos(Item) = Wide_Wide_Character'Pos (X).
!corrigendum A.3.2(46)
Replace the paragraph:
To_String
Returns the String whose range is 1..Item'Length and each of whose elements is given by To_Character of the corresponding element in Item.
by:
function To_String (Item : in Wide_String; Substitute : in Character := ' ') return String; function To_String (Item : in Wide_Wide_String; Substitute : in Character := ' ') return String;
Returns the String whose range is 1..Item'Length and each of whose elements is given by To_Character of the corresponding element in Item.
function To_Wide_String (Item : in String) return Wide_String;
Returns the Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Character of the corresponding element in Item.
function To_Wide_String (Item : in Wide_Wide_String; Substitute : in Wide_Character := ' ') return Wide_String;
Returns the Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Character of the corresponding element in Item with the given Substitute Wide_Character.
function To_Wide_Wide_String (Item : in String) return Wide_Wide_String; function To_Wide_Wide_String (Item : in Wide_String) return Wide_Wide_String;
Returns the Wide_Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Wide_Character of the corresponding element in Item.
!corrigendum A.3.2(47)
Delete the paragraph:
To_Wide_Character
Returns the Wide_Character X such that Character'Pos(Item) = Wide_Character'Pos(X).
!corrigendum A.3.2(48)
Delete the paragraph:
To_Wide_String Returns the Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Character of the corresponding element in Item.


!corrigendum A.3.2(49)
Delete the paragraph:
If an implementation provides a localized definition of Character or Wide_Character, then the effects of the subprograms in Characters.Handling should reflect the localizations. See also 3.5.2.
!corrigendum A.4(1)
Replace the paragraph:
This clause presents the specifications of the package Strings and several child packages, which provide facilities for dealing with string data. Fixed-length, bounded-length, and unbounded-length strings are supported, for both String and Wide_String. The string-handling subprograms include searches for pattern strings and for characters in program-specified sets, translation (via a character-to-character mapping), and transformation (replacing, inserting, overwriting, and deleting of substrings).
by:
This clause presents the specifications of the package Strings and several child packages, which provide facilities for dealing with string data. Fixed-length, bounded-length, and unbounded-length strings are supported, for String, Wide_String, and Wide_Wide_String. The string-handling subprograms include searches for pattern strings and for characters in program-specified sets, translation (via a character-to-character mapping), and transformation (replacing, inserting, overwriting, and deleting of substrings).
!corrigendum A.4.1(4)
Replace the paragraph:
Space : constant Character := ' '; Wide_Space : constant Wide_Character := ' ';
by:
Space : constant Character := ' '; Wide_Space : constant Wide_Character := ' '; Wide_Wide_Space : constant Wide_Wide_Character := ' ';
!corrigendum A.4.7(46)
Replace the paragraph:
Character_Set : constant Wide_Maps.Wide_Character_Set; -- Contains each Wide_Character value WC such that Characters.Is_Character(WC) is True
by:
Character_Set : constant Wide_Maps.Wide_Character_Set; -- Contains each Wide_Character value WC such that Characters.Handling.Is_Character(WC) is True
!comment Updated for AI-161 change.
!corrigendum A.4.8(01)
Insert new clause:
Facilities for handling strings of Wide_Wide_Character elements are found in the packages Strings.Wide_Wide_Maps, Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, Strings.Wide_Wide_Unbounded, and Strings.Wide_Wide_Maps.Wide_Wide_Constants. They provide the same string-handling operations as the corresponding packages for strings of Character elements.
Static Semantics
The package Strings.Wide_Wide_Maps has the following declaration.
package Ada.Strings.Wide_Wide_Maps is pragma Preelaborate(Wide_Wide_Maps);
-- Representation for a set of Wide_Wide_Character values: type Wide_Wide_Character_Set is private; pragma Preelaborable_Initialization(Wide_Wide_Character_Set);
Null_Set : constant Wide_Wide_Character_Set;
type Wide_Wide_Character_Range is record Low : Wide_Wide_Character; High : Wide_Wide_Character; end record; -- Represents Wide_Wide_Character range Low..High
type Wide_Wide_Character_Ranges is array (Positive range <>) of Wide_Wide_Character_Range;
function To_Set (Ranges : in Wide_Wide_Character_Ranges) return Wide_Wide_Character_Set;
function To_Set (Span : in Wide_Wide_Character_Range) return Wide_Wide_Character_Set;
function To_Ranges (Set : in Wide_Wide_Character_Set) return Wide_Wide_Character_Ranges;
function "=" (Left, Right : in Wide_Wide_Character_Set) return Boolean;
function "not" (Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "and" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "or" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "xor" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "-" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set;
function Is_In (Element : in Wide_Wide_Character; Set : in Wide_Wide_Character_Set) return Boolean;
function Is_Subset (Elements : in Wide_Wide_Character_Set; Set : in Wide_Wide_Character_Set) return Boolean;
function "<=" (Left : in Wide_Wide_Character_Set; Right : in Wide_Wide_Character_Set) return Boolean renames Is_Subset;
-- Alternative representation for a set of Wide_Wide_Character values: subtype Wide_Wide_Character_Sequence is Wide_Wide_String;
function To_Set (Sequence : in Wide_Wide_Character_Sequence) return Wide_Wide_Character_Set;
function To_Set (Singleton : in Wide_Wide_Character) return Wide_Wide_Character_Set;
function To_Sequence (Set : in Wide_Wide_Character_Set) return Wide_Wide_Character_Sequence;
-- Representation for a Wide_Wide_Character to Wide_Wide_Character -- mapping: type Wide_Wide_Character_Mapping is private; pragma Preelaborable_Initialization(Wide_Wide_Character_Mapping);
function Value (Map : in Wide_Wide_Character_Mapping; Element : in Wide_Wide_Character) return Wide_Wide_Character;
Identity : constant Wide_Wide_Character_Mapping;
function To_Mapping (From, To : in Wide_Wide_Character_Sequence) return Wide_Wide_Character_Mapping;
function To_Domain (Map : in Wide_Wide_Character_Mapping) return Wide_Wide_Character_Sequence;
function To_Range (Map : in Wide_Wide_Character_Mapping) return Wide_Wide_Character_Sequence;
type Wide_Wide_Character_Mapping_Function is access function (From : in Wide_Wide_Character) return Wide_Wide_Character;
private ... -- not specified by the language end Ada.Strings.Wide_Wide_Maps;
The context clause for each of the packages Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, and Strings.Wide_Wide_Unbounded identifies Strings.Wide_Wide_Maps instead of Strings.Maps.
For each of the packages Strings.Fixed, Strings.Bounded, Strings.Unbounded, and Strings.Maps.Constants the corresponding wide wide string package has the same contents except that
The following additional declarations are present in Strings.Wide_Wide_Maps.Wide_Wide_Constants:
Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set; -- Contains each Wide_Wide_Character value WWC such that -- Characters.Handling.Is_Character(WWC) is True Wide_Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set; -- Contains each Wide_Wide_Character value WWC such that -- Characters.Handling.Is_Wide_Character(WWC) is True
NOTES
14 If a null Wide_Wide_Character_Mapping_Function is passed to any of the Wide_Wide_String handling subprograms, Constraint_Error is propagated.
!corrigendum A.6(01)
Replace the paragraph:
Input-output is provided through language-defined packages, each of which is a child of the root package Ada. The generic packages Sequential_IO and Direct_IO define input-output operations applicable to files containing elements of a given type. The generic package Storage_IO supports reading from and writing to an in-memory buffer. Additional operations for text input-output are supplied in the packages Text_IO and Wide_Text_IO. Heterogeneous input-output is provided through the child packages Streams.Stream_IO and Text_IO.Text_Streams (see also 13.13). The package IO_Exceptions defines the exceptions needed by the predefined input-output packages.
by:
Input-output is provided through language-defined packages, each of which is a child of the root package Ada. The generic packages Sequential_IO and Direct_IO define input-output operations applicable to files containing elements of a given type. The generic package Storage_IO supports reading from and writing to an in-memory buffer. Additional operations for text input-output are supplied in the packages Text_IO, Wide_Text_IO, and Wide_Wide_Text_IO. Heterogeneous input-output is provided through the child packages Streams.Stream_IO and Text_IO.Text_Streams (see also 13.13). The package IO_Exceptions defines the exceptions needed by the predefined input-output packages.
!corrigendum A.7(04)
Replace the paragraph:
Input-output for direct access files is likewise defined by a generic package called Direct_IO. Input-output in human-readable form is defined by the (nongeneric) packages Text_IO for Character and String data, and Wide_Text_IO for Wide_Character and Wide_String data. Input-output for files containing streams of elements representing values of possibly different types is defined by means of the (nongeneric) package Streams.Stream_IO.
by:
Input-output for direct access files is likewise defined by a generic package called Direct_IO. Input-output in human-readable form is defined by the (nongeneric) packages Text_IO for Character and String data, Wide_Text_IO for Wide_Character and Wide_String data, and Wide_Wide_Text_IO for Wide_Wide_Character and Wide_Wide_String data. Input-output for files containing streams of elements representing values of possibly different types is defined by means of the (nongeneric) package Streams.Stream_IO.
!corrigendum A.7(10)
Replace the paragraph:
type File_Mode is (In_File, Out_File, Append_File); -- for Sequential_IO, Text_IO, Wide_Text_IO, and Stream_IO
by:
type File_Mode is (In_File, Out_File, Append_File); -- for Sequential_IO, Text_IO, Wide_Text_IO, Wide_Wide_Text_IO, and Stream_IO
!corrigendum A.7(13)
Replace the paragraph:
Several file management operations are common to Sequential_IO, Direct_IO, Text_IO, and Wide_Text_IO. These operations are described in subclause A.8.2 for sequential and direct files. Any additional effects concerning text input-output are described in subclause A.10.2.
by:
Several file management operations are common to Sequential_IO, Direct_IO, Text_IO, Wide_Text_IO, and Wide_Wide_Text_IO. These operations are described in subclause A.8.2 for sequential and direct files. Any additional effects concerning text input-output are described in subclause A.10.2.
!corrigendum A.7(15)
Replace the paragraph:
18 Each instantiation of the generic packages Sequential_IO and Direct_IO declares a different type File_Type. In the case of Text_IO, Wide_Text_IO, and Streams.Stream_IO, the corresponding type File_Type is unique.
Replace the paragraph:
18 Each instantiation of the generic packages Sequential_IO and Direct_IO declares a different type File_Type. In the case of Text_IO, Wide_Text_IO, Wide_Wide_Text_IO, and Streams.Stream_IO, the corresponding type File_Type is unique.
!corrigendum A.11(00)
Replace the paragraph:
Wide Text Input-Output
by:
Wide Text Input-Output and Wide Wide Text Input-Output
!corrigendum A.11(01)
Replace the paragraph:
The package Wide_Text_IO provides facilities for input and output in human-readable form. Each file is read or written sequentially, as a sequence of wide characters grouped into lines, and as a sequence of lines grouped into pages.
by:
The packages Wide_Text_IO and Wide_Wide_Text_IO provide facilities for input and output in human-readable form. Each file is read or written sequentially, as a sequence of wide characters (or wide wide characters) grouped into lines, and as a sequence of lines grouped into pages.
!corrigendum A.11(02)
Replace the paragraph:
The specification of package Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Character, and any occurrence of String is replaced by Wide_String.
by:
The specification of package Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Character, and any occurrence of String is replaced by Wide_String. Nongeneric equivalents of Wide_Text_IO.Integer_IO and Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Text_IO, Ada.Long_Integer_Wide_Text_IO, Ada.Float_Wide_Text_IO, Ada.Long_Float_Wide_Text_IO.
!corrigendum A.11(03)
Replace the paragraph:
Nongeneric equivalents of Wide_Text_IO.Integer_IO and Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Text_IO, Ada.Long_Integer_Wide_Text_IO, Ada.Float_Wide_Text_IO, Ada.Long_Float_Wide_Text_IO.
by:
The specification of package Wide_Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Wide_Character, and any occurrence of String is replaced by Wide_Wide_String. Nongeneric equivalents of Wide_Wide_Text_IO.Integer_IO and Wide_Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Wide_Text_IO, Ada.Long_Integer_Wide_Wide_Text_IO, Ada.Float_Wide_Wide_Text_IO, Ada.Long_Float_Wide_Wide_Text_IO.
!corrigendum A.12(01)
Replace the paragraph:
The packages Streams.Stream_IO, Text_IO.Text_Streams, and Wide_Text_IO.Text_Streams provide stream-oriented operations on files.
by:
The packages Streams.Stream_IO, Text_IO.Text_Streams, Wide_Text_IO.Text_Streams, and Wide_Wide_Text_IO.Text_Streams provide stream-oriented operations on files.
!corrigendum A.12.4(01)
Insert new clause:
The package Wide_Wide_Text_IO.Text_Streams provides a function for treating a wide wide text file as a stream.
Static Semantics
The library package Wide_Wide_Text_IO.Text_Streams has the following declaration:
with Ada.Streams; package Ada.Wide_Wide_Text_IO.Text_Streams is type Stream_Access is access all Streams.Root_Stream_Type'Class; function Stream (File : in File_Type) return Stream_Access; end Ada.Wide_Wide_Text_IO.Text_Streams;
The Stream function has the same effect as the corresponding function in Streams.Stream_IO.
!corrigendum B.3(39)
Insert after the paragraph:
procedure To_Ada (Item : in wchar_array; Target : out Wide_String; Count : out Natural; Trim_Nul : in Boolean := True);
the new paragraphs:
-- ISO/IEC 10646:2003 compatible types defined by SC22/WG14 document N1010.
type char16_t is <implementation-defined character type>;
char16_nul : constant char16_t := <implementation-defined;
function To_C (Item : in Wide_Character) return char16_t; function To_Ada (Item : in char16_t) return Wide_Character;
type char16_array is array (size_t range <>) of aliased char16_t;
pragma Pack(char16_array);
function Is_Nul_Terminated (Item : in char16_array) return Boolean; function To_C (Item : in Wide_String; Append_Nul : in Boolean := True) return char16_array;
function To_Ada (Item : in char16_array; Trim_Nul : in Boolean := True) return Wide_String;
procedure To_C (Item : in Wide_String; Target : out char16_array; Count : out size_t; Append_Nul : in Boolean := True);
procedure To_Ada (Item : in char16_array; Target : out Wide_String; Count : out Natural; Trim_Nul : in Boolean := True);
type char32_t is <implementation-defined character type>;
char32_nul : constant char32_t := <implementation-defined;
function To_C (Item : in Wide_Wide_Character) return char32_t; function To_Ada (Item : in char32_t) return Wide_Wide_Character;
type char32_array is array (size_t range <>) of aliased char32_t;
pragma Pack(char32_array);
function Is_Nul_Terminated (Item : in char32_array) return Boolean; function To_C (Item : in Wide_Wide_String; Append_Nul : in Boolean := True) return char32_array;
function To_Ada (Item : in char32_array; Trim_Nul : in Boolean := True) return Wide_Wide_String;
procedure To_C (Item : in Wide_Wide_String; Target : out char32_array; Count : out size_t; Append_Nul : in Boolean := True);
procedure To_Ada (Item : in char32_array; Target : out Wide_Wide_String; Count : out Natural; Trim_Nul : in Boolean := True);
!corrigendum B.3(43)
Replace the paragraph:
The types int, short, long, unsigned, ptrdiff_t, size_t, double, char, and wchar_t correspond respectively to the C types having the same names. The types signed_char, unsigned_short, unsigned_long, unsigned_char, C_float, and long_double correspond respectively to the C types signed char, unsigned short, unsigned long, unsigned char, float, and long double.
by:
The types int, short, long, unsigned, ptrdiff_t, size_t, double, char, wchar_t, char16_t, and char32_t correspond respectively to the C types having the same names. The types signed_char, unsigned_short, unsigned_long, unsigned_char, C_float, and long_double correspond respectively to the C types signed char, unsigned short, unsigned long, unsigned char, float, and long double.
!corrigendum B.3(60)
Insert after the paragraph:
The To_C and To_Ada subprograms that convert between Wide_String and wchar_array have analogous effects to the To_C and To_Ada subprograms that convert between String and char_array, except that wide_nul is used instead of nul.
the new paragraphs:
function Is_Nul_Terminated (Item : in char16_array) return Boolean;
The result of Is_Nul_Terminated is True if Item contains char16_nul, and is False otherwise.
function To_C (Item : in Wide_Character) return char16_t; function To_Ada (Item : in char16_t ) return Wide_Character;
To_C and To_Ada provide mappings between the Ada and C 16-bit character types.
function To_C (Item : in Wide_String; Append_Nul : in Boolean := True) return char16_array;
function To_Ada (Item : in char16_array; Trim_Nul : in Boolean := True) return Wide_String;
procedure To_C (Item : in Wide_String; Target : out char16_array; Count : out size_t; Append_Nul : in Boolean := True);
procedure To_Ada (Item : in char16_array; Target : out Wide_String; Count : out Natural; Trim_Nul : in Boolean := True);
The To_C and To_Ada subprograms that convert between Wide_String and char16_array have analogous effects to the To_C and To_Ada subprograms that convert between String and char_array, except that char16_nul is used instead of nul.
function Is_Nul_Terminated (Item : in char32_array) return Boolean;
The result of Is_Nul_Terminated is True if Item contains char16_nul, and is False otherwise.
function To_C (Item : in Wide_Wide_Character) return char32_t; function To_Ada (Item : in char32_t ) return Wide_Wide_Character;
To_C and To_Ada provide the mappings between the Ada and C 32-bit character types.
function To_C (Item : in Wide_Wide_String; Append_Nul : in Boolean := True) return char32_array;
function To_Ada (Item : in char32_array; Trim_Nul : in Boolean := True) return Wide_Wide_String;
procedure To_C (Item : in Wide_Wide_String; Target : out char32_array; Count : out size_t; Append_Nul : in Boolean := True);
procedure To_Ada (Item : in char32_array; Target : out Wide_Wide_String; Count : out Natural; Trim_Nul : in Boolean := True);
The To_C and To_Ada subprograms that convert between Wide_Wide_String and char32_array have analogous effects to the To_C and To_Ada subprograms that convert between String and char_array, except that char32_nul is used instead of nul.
!corrigendum C.5(7)
Replace the paragraph:
If the pragma applies to an enumeration type, then the semantics of the Wide_Image and Wide_Value attributes are implementation defined for that type; the semantics of Image and Value are still defined in terms of Wide_Image and Wide_Value. In addition, the semantics of Text_IO.Enumeration_IO are implementation defined. If the pragma applies to a tagged type, then the semantics of the Tags.Expanded_Name function are implementation defined for that type. If the pragma applies to an exception, then the semantics of the Exceptions.Exception_Name function are implementation defined for that exception.
by:
If the pragma applies to an enumeration type, then the semantics of the Wide_Wide_Image and Wide_Wide_Value attributes are implementation defined for that type; the semantics of Image, Wide_Image, Value, and Wide_Value are still defined in terms of Wide_Wide_Image and Wide_Wide_Value. In addition, the semantics of Text_IO.Enumeration_IO are implementation defined. If the pragma applies to a tagged type, then the semantics of the Tags.Expanded_Name function are implementation defined for that type. If the pragma applies to an exception, then the semantics of the Exceptions.Exception_Name function are implementation defined for that exception.
!corrigendum F(4)
Replace the paragraph:
by:
!corrigendum F.3(1)
Replace the paragraph:
The child packages Text_IO.Editing and Wide_Text_IO.Editing provide localizable formatted text output, known as edited output , for decimal types. An edited output string is a function of a numeric value, program-specifiable locale elements, and a format control value. The numeric value is of some decimal type. The locale elements are:
by:
The child packages Text_IO.Editing, Wide_Text_IO.Editing, and Wide_Wide_Text_IO.Editing provide localizable formatted text output, known as edited output, for decimal types. An edited output string is a function of a numeric value, program-specifiable locale elements, and a format control value. The numeric value is of some decimal type. The locale elements are:
!corrigendum F.3(6)
Replace the paragraph:
For Text_IO.Editing the edited output and currency strings are of type String, and the locale characters are of type Character. For Wide_Text_IO.Editing their types are Wide_String and Wide_Character, respectively.
by:
For Text_IO.Editing the edited output and currency strings are of type String, and the locale characters are of type Character. For Wide_Text_IO.Editing their types are Wide_String and Wide_Character, respectively. For Wide_Wide_Text_IO.Editing their types are Wide_Wide_String and Wide_Wide_Character, respectively.
!corrigendum F.3(19)
Replace the paragraph:
The generic packages Text_IO.Decimal_IO and Wide_Text_IO.Decimal_IO (see A.10.9, ''Input-Output for Real Types'') provide text input and non-edited text output for decimal types.
by:
The generic packages Text_IO.Decimal_IO, Wide_Text_IO.Decimal_IO, and Wide_Wide_Text_IO.Decimal_IO (see A.10.9, ''Input-Output for Real Types'') provide text input and non-edited text output for decimal types.
!corrigendum F.3(20)
Replace the paragraph:
2 A picture String is of type Standard.String, both for Text_IO.Editing and Wide_Text_IO.Editing.
by:
2 A picture String is of type Standard.String, for all of Text_IO.Editing, Wide_Text_IO.Editing, and Wide_Wide_Text_IO.Editing.
!corrigendum F.3.5(01)
Insert new clause:
Static Semantics
The child package Wide_Wide_Text_IO.Editing has the same contents as Text_IO.Editing, except that:
NOTES
6 Each of the functions Wide_Wide_Text_IO.Editing.Valid, To_Picture, and Pic_String has String (versus Wide_Wide_String) as its parameter or result subtype, since a picture String is not localizable.
!corrigendum G.1.5(01)
Insert new clause:
Static Semantics
Implementations shall also provide the generic library package Wide_Wide_Text_IO.Complex_IO. Its declaration is obtained from that of Text_IO.Complex_IO by systematically replacing Text_IO by Wide_Wide_Text_IO and String by Wide_Wide_String; the description of its behavior is obtained by additionally replacing references to particular characters (commas, parentheses, etc.) by those for the corresponding wide wide characters.
!corrigendum H.4(20)
Replace the paragraph:
No_IO
Semantic dependence on any of the library units Sequential_IO, Direct_IO, Text_IO, Wide_Text_IO, or Stream_IO is not allowed.
by:
No_IO
Semantic dependence on any of the library units Sequential_IO, Direct_IO, Text_IO, Wide_Text_IO, Wide_Wide_Text_IO, or Stream_IO is not allowed.
!ACATS test
ACATS tests need to be constructed for these facilities.
!appendix

From: Gary Dismukes
Sent: Tuesday, January 15, 2002  4:14 PM

Ben Brosgol recently pointed out to us (ACT) the introduction of a
variant of the Latin 1 character set that is designated Latin 9.

A web page describing Latin 9 can be viewed at:

  http://www.cs.tut.fi/~jkorpela/latin9.html

Here's the summary blurb on that page describing the relatively minor
differences between Latin 1 and Latin 9:

  ISO Latin 9 as compared with ISO Latin 1

  The ISO Latin 9 (ISO 8859-15) character set differs from the well-known
  ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro
  sign and some national letters used e.g. in French and Finnish have been
  introduced and some rarely used special characters omitted.

We've added a new package to the GNAT library named Ada.Characters.Latin_9,
analogous to Ada.Characters.Latin_1, to define character constants for this
new character set.

Robert Dewar asked me to post the following remarks from him
re Latin-9 and Ada.Characters.Handling:

----------

Note that the Ada package Latin-1 did not exactly follow the official
names of all characters, and I have copied its abbreviated naming style
for the new characters in Latin-9.

I have a gripe with the RM here. The setup for Ada.Characters.Latin_1 is
to have separate packages for separate character sets, which makes perfectly
good sense:

27   An implementation may provide additional packages as children of
Ada.Characters, to declare names for the symbols of the local character set
or other character sets.

But for Characters.Handling, we have the odd statement:

49   If an implementation provides a localized definition of Character or
Wide_Character, then the effects of the subprograms in Characters.Handling
should reflect the localizations. See also 3.5.2.

which implies that some mysterious transformation happens on this package
(under what circumstnaces?) I think this is a bad idea for two reasons:

a) it requires specialized mechanisms in the compiler, and it seems odd
for the meaning of this package to depend on some compiler switch etc.

b) it precludes handling multiple character sets in the same program,
whereas the design for Ada.Characters.Latin_1 etc seems to accomodate this.

My recommendation is that an implementation generate separate packages,
called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling
being a renaming of Ada.Characters.Handling_Latin_1 perhaps?)

Robert Dewar

*************************************************************

From: Pascal Leroy
Sent: Tuesday, January 15, 2002  5:05 PM

>   The ISO Latin 9 (ISO 8859-15) character set differs from the well-known
>   ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro
>   sign and some national letters used e.g. in French and Finnish have been
>   introduced and some rarely used special characters omitted.

Oh boy, good to see that the OE and oe ligatures are now available, and that
we now can write French without having to use Unicode!

*************************************************************

From: John Barnes
Sent: Wednesday, January 16, 2002  1:44 AM

Better put that on the agenda for the next ARG. Ada 2005
should use Latin 9 rather than Latin 1.  A minor change.
Might be a few incompatibilities.

*************************************************************

From: Pascal Leroy
Sent: Wednesday, January 16, 2002  12:53 PM

As I mentioned in a mail yesterday, the fact that you can use Latin 9 to
write French makes it look very interesting to me.

On the other hand, it is not too useful for Ada to support Latin 9 if the
OSes don't: if I emit the character OE and it print out as 1/4 on my screen,
I didn't gain much.

So while I agree that we should consider supporting Latin 9 _in_addition_ to
Latin 1 in Ada 05, I don't think Latin 9 should _replace_ Latin 1, because I
am ready to bet that we will still have Latin 1 OSes ten years from now.

*************************************************************

From: John Barnes
Sent: Thursday, January 17, 2002  1:33 AM

It was somewhat of a jokey suggestion as I am sure you are aware.

Indeed I had a big problem when writing my book and
displaying the type Character. I wrote it in QuarkXpress on
a PC and it was fine. The publishers moved it to a Mac
before printing and some characters came out wrong.  One of
them came out as a picture of an apple. Moreover, someone
had bitten a lump out of it. So much for standards I
thought.

But supporting Latin-9 would be nice. All those adverts on
the Paris Metro for eating an oeuf can then be printed
properly.

*************************************************************

From: Bob Duff
Sent: Thursday, January 17, 2002  1:14 PM

> Indeed I had a big problem when writing my book and
> displaying the type Character.

I had a great deal of trouble writing the part of the Reference Manual
where type Character lives.  I think Randy had some trouble with the
updated RM, too.  At least we didn't try to show type Wide_Character in
its full glory.  ;-)

7-bit ascii will live forever, I suppose.

*************************************************************

From: Bob Duff
Sent: Wednesday, January 16, 2002  2:15 PM

> Ben Brosgol recently pointed out to us (ACT) the introduction of a
> variant of the Latin 1 character set that is designated Latin 9.

The nice thing about standards is that there are so many to choose
from.  ;-)

> My recommendation is that an implementation generate separate packages,
> called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling
> being a renaming of Ada.Characters.Handling_Latin_1 perhaps?)

That makes sense.

But I think the RM statement you complain about is envisioning a
nonstandard version of Standard.[Wide_]Character, which is a separate
issue.  I don't see that as a big deal -- if you don't think it's a good
idea, don't implement any such thing.  I tend to agree that compiler
switches and the like shouldn't normally be meddling with the semantics
of packages Standard and Characters.Handling without a very good reason.

*************************************************************

From: Florian Weimer
Sent: Friday, January 18, 2002  6:58 AM

> But I think the RM statement you complain about is envisioning a
> nonstandard version of Standard.[Wide_]Character, which is a separate
> issue.

If you use Latin 9 for Standard.Character, this is certainly a
nonstandard version, and Ada.Characters.Handling has to be modified
to remain useful.

*************************************************************

From: Florian Weimer
Sent: Friday, January 18, 2002  6:58 AM

> Better put that on the agenda for the next ARG. Ada 2005
> should use Latin 9 rather than Latin 1.  A minor change.
> Might be a few incompatibilities.

I disagree.  With Latin 9, the mapping from Character to
Wide_Character is less straightforward, and this could have unexpected
results.

OTOH, it seems that Wide_Character is not widely used (unless you are
forced to do so by ASIS), so this might not matter much.

In addition, we really should add Wide_Wide_Character (which covers
the sixteen additional planes), or make Wide_Character itself wider.
Otherwise, using Unicode with standard Ada will be rather painful.

*************************************************************

From: Florian Weimer
Sent: Saturday, April 20, 2002  3:18 AM

ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits,
and 10646-2:2001 allocates characters outside the Basic Multilingual
Plane.

Not too long ago, quite a few people assumed that characters beyond
the BMP would be interesting only for rather esoteric scholarly use
(Linear B is a perfect example).  However, we now have got at least
different sets of code positions outside the BMP which will see more
widespread use eventually: the mathematical alphabets and Plane 14
Language Tags (which are required to make some Japanese people happy
who fear that Japanese characters are rendered using Chinese glyphs).

Therefore, I think Ada 200X should somehow support characters outside
the BMP.

A few random thoughts (sorry, I'm probably not using strict ISO 10646
terminology):

  * Several major vendors have adopted ISO 10646-1:1993 early, using a
    16 bit representation for characters (i.e. wchar_t in C is 16
    bits).

These vendors include Sun (Java) and Microsoft (Windows), and probably
most proprietary UNIX vendors.  These vendor implementations now cover
the code positions beyond the BMP using UTF-16, which uses surrogate
pairs (a single character is represented using two 16 bit values from
reserved ranges in the BMP).

UTF-16 has got a few drawbacks: the ordering (in terms UCS code
positions) is no longer lexicographic (which leads us to such brain
damage as CESU-8), dealing with individual characters is complicated,
and you cannot implement the C wide character functions properly.

For Ada, numerous changes would be required if we want to expose the
UTF-16 representation to programmers, for example by declaring
Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no
longer be arrays of characters indexed by position).

GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS
characters in a single 32 bit value, that is, UTF-32), and while this
is certainly not the "industry standard" (it is encouraged by ISO
9899:1999, though), I really hope we can use this approach (UTF-32
internal representation) for Ada, as it simplifies things
considerably, especially if we want to add character properties
support (see below).

  * We could add Wide_Wide_Character and Wide_Wide_String types to
    pacakge Standard (and extending the Ada.Strings hierarchy), which
    are encoded in UTF-32.

I don't know if this is necessary.  IIRC, Robert Dewar once told that
the only applications using Wide_Character are based on ASIS, where
using Wide_Character is not really voluntarily.  Maybe it is possible
to bump Wide_Character'Size to 32 bits instead, without really
breaking backwards compatibility.

Of course, we would need a way to converted UTF-32 strings to UTF-16
strings and vice versa (the UTF-16 string type could become a
second-class citizen, though, without full support in the Ada.Strings
hierarchy).

  * External representation of UCS characters is rapidly moving
    towards UTF-8 (especially in Internet standards).

Ada should provide an interface for converting between the wide string
type(s) and UTF-8 octet sequences.  It should be possible to use
string literals where UTF-8 strings are expected.

  * Supporting higher levels of Unicode (e.g. accessing the character
    properties database, normalization forms) would be interesting,
    too.

Such documents will eventually follow in the ISO 10646 series, but I
don't know if the ISO standard will be ready for Ada 200X.  Currently,
only the Unicode Consortium has standardized or documented issues like
character properties or terminal behavior in detail.

I don't know how ISO reacts if ISO standards refer to competing
standardization efforts.  IEEE POSIX.1 (and probably, or already, ISO
POSIX) standardizes the BSD sockets interface, and not OSI, so maybe
this isn't an issue.

In any case, this point is mostly a library issue which can be
addressed by a community implementation effort, it does not require
changes in the Ada language (adding Wide_Wide_Character does, for
example).

*************************************************************

From: Pascal Leroy
Sent: Monday, April 22, 2002  8:32 AM

> ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits,
> and 10646-2:2001 allocates characters outside the Basic Multilingual
> Plane.
>
> Therefore, I think Ada 200X should somehow support characters outside
> the BMP.

The normalization of new character sets (both as part of 10646 and of 8859)
was actually discussed at the last ARG meeting, and I was given an action
item to somehow integrate them in the language, probably as some kind of
amendment AI.

> A few random thoughts (sorry, I'm probably not using strict ISO 10646
> terminology):
>
>   * Several major vendors have adopted ISO 10646-1:1993 early, using a
>     16 bit representation for characters (i.e. wchar_t in C is 16
>     bits).

Which is fine as it maps directly to Ada's wide character.  I still think
that we want to retain the capacity of using 16-bit blobs to represent
characters in the BMP, as 99.5% of practical applications will only need the
BMP.

> For Ada, numerous changes would be required if we want to expose the
> UTF-16 representation to programmers, for example by declaring
> Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no
> longer be arrays of characters indexed by position).

Changes to Wide_Character and Wide_String are pretty much out of the
question.  On the other hand, the type that is intended for interfacing with
C is Interfaces.C.wchar_array, and it would be straightforward to provide
(in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit
Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C
compiler does).

> I really hope we can use this approach (UTF-32
> internal representation) for Ada, as it simplifies things
> considerably, especially if we want to add character properties
> support (see below).

I would think that we would want to use UCS-4, since it's an ISO standard.
Moreover, UTF-32 has a number of consistency rules (eg code points below
16#10ffff#) which seem irrelevant for internal manipulation of strings.

>   * We could add Wide_Wide_Character and Wide_Wide_String types to
>     pacakge Standard (and extending the Ada.Strings hierarchy), which
>     are encoded in UTF-32.

Wide_Wide_ types seem like the natural way to add this capability to the
language, except that some compilers may not be quite prepared to deal with
enumeration types with 2 ** 32 literals (ours isn't).

> (the UTF-16 string type could become a
> second-class citizen, though, without full support in the Ada.Strings
> hierarchy).

As far as I can tell, there is no support for UTF-16, only for UCS-2.
Anyway, I don't think it is reasonable to force applications to go to the
full 32-bit overhead just because they use, say, the french OE ligature.

>   * External representation of UCS characters is rapidly moving
>     towards UTF-8 (especially in Internet standards).
>
> Ada should provide an interface for converting between the wide string
> type(s) and UTF-8 octet sequences.  It should be possible to use
> string literals where UTF-8 strings are expected.

External representation is best handled by Text_IO and friends, typically by
using a form parameter to specify the encoding (and there are many more
encodings than just UCS and UTF).  The ARG won't get into the business of
specifying the details of the form parameter, so this is something that will
remain non-portable for the foreseeable future.  (Where do we stop?  Do we
want to require all validated compilers to support UTF-8?  What about the
chinese Big5 or the JIS encodings?)

>   * Supporting higher levels of Unicode (e.g. accessing the character
>     properties database, normalization forms) would be interesting,
>     too.

We certainly don't want to get into that business.  The designers of Ada 95
wisely decided to lump all of the characters in the range 16#0100# ..
16#FFFD# into the category special_character, so that they don't have to
decide which is a letter, a number, etc.  Similarly they didn't provide
classification functions or upper/lower conversions for wide characters.
This seems reasonable if we don't want to have to amend Ada each time a
bunch of characters are added to 10646.

*************************************************************

From: Nick Roberts
Sent: Wednesday, April 24, 2002  7:31 PM

> Therefore, I think Ada 200X should somehow support characters outside
> the BMP.

I agree.

> GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS
> characters in a single 32 bit value, that is, UTF-32), and while this is
> certainly not the "industry standard" (it is encouraged by ISO 9899:1999,
> though), I really hope we can use this approach (UTF-32 internal
> representation) for Ada, as it simplifies things considerably, especially
> if we want to add character properties support (see below).

I agree very strongly!

>   * We could add Wide_Wide_Character and Wide_Wide_String types to
>     pacakge Standard (and extending the Ada.Strings hierarchy), which
>     are encoded in UTF-32.

I must say I would prefer the identifiers Universal_Character and
Universal_String. I see the logic of Wide_Wide_ but it seems clumsy!

> I don't know if this is necessary.  IIRC, Robert Dewar once told that
> the only applications using Wide_Character are based on ASIS, where
> using Wide_Character is not really voluntarily.  Maybe it is possible
> to bump Wide_Character'Size to 32 bits instead, without really
> breaking backwards compatibility.

I disagree with this idea.

> Of course, we would need a way to converted UTF-32 strings to UTF-16
> strings and vice versa (the UTF-16 string type could become a
> second-class citizen, though, without full support in the Ada.Strings
> hierarchy).

Possibly these support packages should be in an optional annex.

>   * External representation of UCS characters is rapidly moving
>     towards UTF-8 (especially in Internet standards).
>
> Ada should provide an interface for converting between the wide string
> type(s) and UTF-8 octet sequences.  It should be possible to use string
> literals where UTF-8 strings are expected.
>
>   * Supporting higher levels of Unicode (e.g. accessing the character
>     properties database, normalization forms) would be interesting,
>     too.

Again, perhaps all this should really be in (or moved into) an optional annex.

*************************************************************

From: Robert Dewar
Sent: Wednesday, April 24, 2002  9:50 PM

I suspect that the work on wide_wide_character will in practice turn
out to be nearly useless in the short or medium term. We certainly
put in a lot of work in GNAT in implementing wide character with many
different representation schemes, but this feature has been very little
used (ASIS being the main use :-). In practice I think the 16-bit character
type defined in Ada now will be adequate for almost all use, and I see no
reason in requring implementations to go beyond this in the absence of
real market demand.

Yes, it's fun to talk about character set issues (after all I was chair of
the CRG, so I appreciate this), but there is no point in increasing
implementation burdens unless it's really valuable.

I would just give clear permission for an implementation to add additional
character types in standard (indeed that permission exists today in Ada 95),
and leave it at that.

*************************************************************

From: John Barnes
Sent: Thursday, April 25, 2002  1:46 AM

The BSI is looking at character set issues across languages
and your message reminded me of the CRG. Was there ever a
final report that I could refer to?

*************************************************************

From: Robert Dewar
Sent: Thursday, April 26, 2002  10:25 PM

I think there was a final report, perhaps Jim could track it down.

*************************************************************

From: Randy Brukardt
Sent: Thursday, April 25, 2002  3:44 PM

> We certainly put in a lot of work in GNAT in implementing wide
> character with many different representation schemes, but this
> feature has been very little used (ASIS being the main use :-).

To add another data point: Claw was designed so that a wide character version
could be easily created. But we've never implemented that version, mainly
because we've never had a paying customer ask for it. So I have to wonder how
important "Really_Wide_Character" would be.

*************************************************************

From: Florian Weimer
Sent: Saturday, May 18, 2002  5:41 AM

> I suspect that the work on wide_wide_character will in practice turn
> out to be nearly useless in the short or medium term.

Using Ada for internationalized applications on GNU systems (using GNU
facilities) almost requires 32 bit Wide_Wide_Character support, since
GNU uses a 32 bit wchar_t internally.

(See a similar discussion on the GCC development list.)

*************************************************************

From: Robert Dewar
Sent: Saturday, May 18, 2002  7:32 AM

We have seen zero demand for such functionality, so would not invest any time
at all in either design or implementation work here. If such a feature is
added to Ada, I would definitely suggest it be optional.

*************************************************************

From: Florian Weimer
Sent: Saturday, May 18, 2002  6:00 AM

>>   * Several major vendors have adopted ISO 10646-1:1993 early, using a
>>     16 bit representation for characters (i.e. wchar_t in C is 16
>>     bits).
>
> Which is fine as it maps directly to Ada's wide character.  I still think
> that we want to retain the capacity of using 16-bit blobs to represent
> characters in the BMP, as 99.5% of practical applications will only need the
> BMP.

Quite a few people have already changed their minds about the 99.5%
figure (mathematical characters and Plane 14 Language being the
reason).  Maybe it's true for the character count, but I doubt it for
the application count.

> Changes to Wide_Character and Wide_String are pretty much out of the
> question.

Okay, accepted.

> On the other hand, the type that is intended for interfacing with
> C is Interfaces.C.wchar_array, and it would be straightforward to provide
> (in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit
> Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C
> compiler does).

I doubt that C compilers can use UTF-16 for wchar_t.  You cannot apply
iswlower() to a single surrogate character. :-/

> I would think that we would want to use UCS-4, since it's an ISO standard.
> Moreover, UTF-32 has a number of consistency rules (eg code points below
> 16#10ffff#) which seem irrelevant for internal manipulation of strings.

Yes, UCS-4 is indeed the correct encoding form to use.

>>   * We could add Wide_Wide_Character and Wide_Wide_String types to
>>     pacakge Standard (and extending the Ada.Strings hierarchy), which
>>     are encoded in UTF-32.
>
> Wide_Wide_ types seem like the natural way to add this capability to the
> language, except that some compilers may not be quite prepared to deal with
> enumeration types with 2 ** 32 literals (ours isn't).

Ah, this could be a problem indeed, together with the large
universal_integer returned by Wide_Wide_Character'Pos.

>> (the UTF-16 string type could become a
>> second-class citizen, though, without full support in the Ada.Strings
>> hierarchy).
>
> As far as I can tell, there is no support for UTF-16, only for UCS-2.

At the moment, yes, but I think we need some UTF-16 support, too,
because many operating system interfaces use it.

> Anyway, I don't think it is reasonable to force applications to go to the
> full 32-bit overhead just because they use, say, the french OE ligature.

Most people apparently refuse to use Wide_Character, too, for the same
reason.  They either go for ISO 8859-15 or Windows 1252, or don't use
the OE ligature at all.

> External representation is best handled by Text_IO and friends, typically by
> using a form parameter to specify the encoding (and there are many more
> encodings than just UCS and UTF).

There was a recent discussion to add other I/O facilities.  UTF-8 is
becoming more and more common in the Internet context, and often, you
can determine the encoding of a file only after reading the first
couple of lines (think of a MIME-encoded mail message).  Furthermore,
UTF-8 already plays an important role in interacting with other
libraries (not written in Ada).

> (Where do we stop?  Do we want to require all validated compilers to
> support UTF-8?

Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?

> What about the chinese Big5 or the JIS encodings?)

If there is support for UCS-4, handling these encodings could be
performed by a mechanism similar to POSIX iconv().

*************************************************************

From: Robert Dewar
Sent: Saturday, May 18, 2002  7:43 AM

> Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?

Why not = because there is no real demand. Especially this time around we need
to be very careful not to require things that no one is really interested in.
If we do this, the vendors will simply ignore any new standard. In fact I
think if there is a new standard, it will only be implemented as a result of
direct customer interest in features in this standard. The value of formal
conformance and validation has largely disappeared from the Ada marketplace
at this stage (in terms of customer demand). That's not to say that the Ada
marketplace is not very vital and dynamic, we get dozens of requests for
enhancements from our users every month, but there is precious little
intersection between the things users seem to need and want and these
kind of discussions.

In GNAT, we put a lot of effort into implementing multiple character sets
(we just added the new Latin set with the Euro symbol, because customers
needed that for example). Some of it has been useful (like this Euro
addition), but mostly these features are of entertainment and advertising
value only. In fact the only serious user that we have for Wide_Character
and Wide_String is us (from ASIS :-)

One thing to remember here is that very little is needed in the way of
language support for fancy character sets (most of the effort in GNAT
for example for 8-bit sets is in csets, which gives proper case mapping
for identifiers, and it is easy enough to add new tables to this -- someone
contributed a new Cyrillic table just a few months ago). Most of the issues
are representational issues, and the Ada standard has nothing to say about
source representation (and this should not change in any new standard).

*************************************************************

From: Pascal Leroy
Sent: Tuesday, May 21, 2002  4:03 AM

> > Which is fine as it maps directly to Ada's wide character.  I still think
> > that we want to retain the capacity of using 16-bit blobs to represent
> > characters in the BMP, as 99.5% of practical applications will only need the
> > BMP.
>
> Quite a few people have already changed their minds about the 99.5%
> figure (mathematical characters and Plane 14 Language being the
> reason).  Maybe it's true for the character count, but I doubt it for
> the application count.

Remember, we are talking Ada applications here.  There are probably many
applications out there that deal with mathematical symbols or with Tengwar, but
I doubt that they are written in Ada.

> > External representation is best handled by Text_IO and friends, typically by
> > using a form parameter to specify the encoding (and there are many more
> > encodings than just UCS and UTF).
>
> There was a recent discussion to add other I/O facilities.  UTF-8 is
> becoming more and more common in the Internet context, and often, you
> can determine the encoding of a file only after reading the first
> couple of lines (think of a MIME-encoded mail message).  Furthermore,
> UTF-8 already plays an important role in interacting with other
> libraries (not written in Ada).

Maybe we need a predefined unit to convert UCS-2 to/from UTF-8.  But then such
conversion functions could easily be written by the user, too, or provided by
some public domain stuff.

> > (Where do we stop?  Do we want to require all validated compilers to
> > support UTF-8?
>
> Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?

You don't sell many compilers if you don't support 8859-1.  As for UCS-2, well,
that's pretty much the default representation of wide characters anyway.  Other
than that, it would seem that we should let the market decide.  Speaking for
Rational, we have had wide character support for about 7 years, and I don't
recall seeing a single bug report or request for enhancement on this topic.
This may indicate that our technology is perfect, but there are other
explanation ;-) .  (As a matter of fact we probably have very few licenses
installed in countries where 8859-1 is not sufficient to write the native
language -- ignoring the problem with the OE ligature in French.)

One option would be to add Wide_Wide_Character in a new annex, and let users
decide if they want their vendors to support this annex. Of course, chances are
that nobody would care, in which case that would be a lot of standardization
effort for nothing.

*************************************************************

From: Robert Dewar
Sent: Tuesday, May 21, 2002  4:39 AM

I agree with everything Pascal had to say about wide character. We do have
one Japanese customer using wide characters, and as I mentioned earlier,
ASIS uses wide strings to represent source texts, but other than that,
we have heard very little about wide strings. The only real input we have
got from customers on character set issues was the request to support
Latin-9 with the new Euro symbol and we got contributed tables for
Cyrillic from a Russian enthusiast (not a customer, but it seemed a
harmless addition :-)

*************************************************************

From: Florian Weimer
Sent: Tuesday, May 21, 2002  1:42 PM

> I agree with everything Pascal had to say about wide character. We do have
> one Japanese customer using wide characters, and as I mentioned earlier,
> ASIS uses wide strings to represent source texts, but other than that,
> we have heard very little about wide strings.

I guess this customer doesn't use Wide_Character in the way it was
intended (for storing ISO 10646 code position), so this example is a
bit dubious.

> The only real input we have got from customers on character set
> issues was the request to support Latin-9 with the new Euro symbol

Even in this rather innocent case, Wide_Character is no longer using
UCS-2 with GNAT.

*************************************************************

From: Michael F. Yoder
Sent: Monday, October 21, 2002  10:58 AM

This is one of the items on my homework list.

UTF = UCS Transformation Format. UCS = Universal Multiple-Octet Coded
Character Set. I guess the MOC is silent.  :-)

UTF-8 encodes 31-bit values as 8-bit values, as follows.

0xxxxxxx                     encodes itself (the coding is ASCII-compatible)
110xxxxx 10Y                 encodes xxxxxY where Y stands for yyyyyy
1110xxxx 10Y 10Z             encodes xxxxYZ
11110xxx 10Y 10Z 10U         encodes xxxYZU
111110xx 10Y 10Z 10U 10V     encodes xxYZUV
1111110x 10Y 10Z 10U 10V 10W encodes xYZUVW

The octets 11111110 and 11111111 aren't used in the encoding. So,
excepting these 2, octets starting with 11 are headers, those starting
with 10 are trailers, and those starting with 0 are singletons.

It's forbidden to use the redundant encodings (you must use the shortest
encoding allowed). There are security reasons for this, aside from the
fact that doing so breaks the string search property mentioned below.

The encoding is self-synchronizing: if you start in the middle of a
string of octets, you skip octets of the form 10xxxxxx to get to the
next start of character.

If the encoding is proper, string searches for an encoded pattern within
an encoded string will work as desired to yield all occurrences of the
pattern. (For case-folded searches and the like this only works if the
string is mapped before being converted to UTF-8.)

*************************************************************

From: Robert Dewar
Sent: Monday, October 21, 2002  11:03 AM

Is anyone using UTF-8 encoding with Ada. We have some customers using wide
character encodings but none to our knowledge uses UTF-8.

*************************************************************

From: Robert A. Duff
Sent: Monday, October 21, 2002  11:43 AM

> It's forbidden to use the redundant encodings (you must use the shortest
> encoding allowed). There are security reasons for this,...

I'm curious: why is that?  (Not quite curious enough to go RTFM.  ;-))

>... aside from the
> fact that doing so breaks the string search property mentioned below.

Yes, I understand that.

*************************************************************

From: Michael F. Yoder
Sent: Monday, October 21, 2002  1:15 PM

This problem is one my previous employer is having to deal with.
Basically, it's that redundant encodings can be used to sneak things
past filters if the redundant encodings aren't rejected; if redundant
encodings are allowed, writing (say) a regular expression that will
match exactly all possible encoded forms is a pain, is error-prone, and
is probably significantly slower to check.

Here's a contrived case. A program reads a command, and if it's the
special command 'shazam' it checks the user's authorization; otherwise
it passes on the command unmodified, because all other commands are
safe. If there's a redundant encoding of 'shazam' that the filter
misses, an unauthorized user can bypass the checking if he can arrange
to supply that encoding.

*************************************************************

From: Michael F. Yoder
Sent: Thursday, October 24, 2002  5:46 PM

This is the easy part of my homework. The identifier character ranges
are defined in terms of multiple character categories (see below), so I
can't get the harder part without a little coding.

This is using Unicode version 3.2.

A "space" is itself a normative category.  It is anything in the range
U+2000 to U+200B, plus 5 other scattered characters.

A "separator" is any space plus the two characters "Line Separator"
U+2028 and "Paragraph Separator" U+2029. These are each in a normative
category containing just 1 value.

A "decimal digit" is itself a normative category. There are 25 ranges of
these, 23 including the digits 0 through 9 and 2 with only the digits 1
through 9. (These two scripts use the ASCII zero rather than encoding a
separate one.) Five of these ranges are above U+FFFF, that is, out of
the BMP (their character descriptions all start with "mathematical").
The digits 1 through 9 in these scripts don't in general look much like
our 1 through 9.

The rules for identifiers say (I'm condensing and interpreting) that the
syntax for identifiers should start with their basic definition and
fiddle it as appropriate to include extra characters (for Ada, that
means underscore). Their basic definition is

   identifier ::= id-start { id-start | id-extend }

id-start is any letter (which come in 5 subcategories) or a "letter
number." There are a lot of letters outside the BMP, including the large
range "CJK Ideograph Extension B."

id-extend is decimal digits plus nonspacing marks, spacing combining
marks, connector punctuation, and formatting codes.

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  7:19 PM

I am completely confused, why are we discussing this eactly can you
be clear as to the goals of this discussion?

*************************************************************

From: Randy Brukardt
Sent: Thursday, October 24, 2002  2:50 PM

I know I don't count, :-)
but I've had several requests to extend my spam filter to support UTF-8
encodings. Because I'm not asking for any money for the filter, and I
haven't had any signficant amount of UTF-8 mail, I haven't done anything
about it yet. But it seems likely that I will need to do this at some point
(I've seen occassional UTF-8 encoded mail, but not enough good mail that
handling it manually is a problem.)

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  4:29 PM

Oh sure, UTF-8 encoded spam is common indeed, but that was not what I was
talking about (unless you have some spam messages written in Ada source code :-)

*************************************************************

From: Randy Brukardt
Sent: Thursday, October 24, 2002  4:59 PM

I think you misunderstand. I have written an anti-spam plugin for the IMS
mailserver that I use. It is written in Ada, of course, and I've had
requests for it to be able to handle UTF-8 encoded mail. For me, it's fine
to treat such mail as all spam, but that is not true for some of the other
users of it. (I've made it available to the community of IMS users, as they
have made many useful plugins available that I have been using for years.)

In order to properly support UTF-8 mail, I'd need at least to convert the
search patterns (in Latin-1, of course) into UTF-8. I'd also need to verify
that the rules that Mike noted are followed (a common trick of spammers is
to violate basic encoding rules, as most decoders don't check. But the
illegal encodings tend to get ignored by filters, because they don't match
exactly. That was one of the prime reasons I wrote the plugin in the first
place, because a lot of spam is now coming encoded in one way or another,
and thus is not picked up by a plain text scan).

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  7:17 PM

Oh! I was confused then, I thought this was something to do with Ada.

*************************************************************

From: Randy Brukardt
Sent: Thursday, October 24, 2002  7:46 PM

Of course it has to do with Ada. You asked "Is anyone using UTF-8 encoding
with Ada." And I answered that I have an Ada program that needs to process
UTF-8 text (but doesn't yet). And I tried to explain what the program is and
why it needs to process UTF-8 text and what support from Ada would be
valuable.

Perhaps I should have just answered your original question "Yes"? :-)

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  8:09 PM

Sorry, when I meant "using UTF-8 encoding with Ada", I was talking about
language features for wide character representation.

The fact that your program is in Ada does not seem to be particularly
informative. I am completely confused here, what ARG-related language
problem is this thread addressing?

*************************************************************

From: Randy Brukardt
Sent: Thursday, October 24, 2002  8:32 PM

As I recall, one of the facets of UTF-8 support in Ada would be the
equivalent of Ada.Characters.Handling for UTF-8 represented Strings. Those
operations would be valuable for this application, particularly
To_Wide_String (UTF_8_String) or To_UTF_8_String (String). A UTF-8 Text_IO
would also be valuable, although I'd find that overkill for this application
(usually the text has to be decoded to UTF-8 from some 7-bit representation
anyway).

I'm not sure where else UTF-8 would appear in the standard. Source
representation and external file representations are outside of the scope of
the standard. The regular string operations seem to work for most (all?)
operations. Everything else seems to already be covered by the existing wide
character support.

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  8:45 PM

Well, harmless I suppose, but I doubt worth the effort. Again, I would
generate packages on the basis of packages that exist, have proved useful
and are actually widely used. It seems a mistake to get into the "here's
a neat idea for a package that would help with something I happen to be
doing".

*************************************************************

From: Michael F. Yoder
Sent: Thursday, October 24, 2002  5:46 PM

>  I am completely confused here, what ARG-related language
>problem is this thread addressing?

Kiyoshi Ishihata stated at the last meeting that there was in interest
in some countries in being able to write programs as much as possible in
native languages, the primary deficit in this regard being that
identifiers are entirely in Latin-1 characters. He didn't specify which
countries to my recollection, but Japan, Russia, China, and India are
obvious cases where the commonly used scripts are disjoint from Latin-1.

The information being supplied is exploratory in nature: the idea is to
find out how hard it would be to extend existing compilers so as to
satisfy all the national groups at once, and whether and to what extent
the ARG should be involved in specifying standards for such extensions.

There was a separate issue involving the fact that ISO 10646-n (I forget
what n is) now has mapped characters outside the BMP. This had to
happen, given that the code now maps some 70,000 Han characters.

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  8:54 PM

Well I would just allow arbitrary wide characters in identifiers, why not,
it does not cause any problems. GNAT has implemented an option for this
for ever. I would specify that there is no upper/lower case equivalence
in this case, since otherwise you get into a huge mess that is simply not
worth the effort.

*************************************************************

From: Tucker Taft
Sent: Thursday, October 24, 2002  10:10 PM

I suggest you read the ARG minutes when they are available.  Kiyoshi
indicated specifically that they wanted to restrict usage to
characters that "make sense" as identifier characters.  I will admit
I was in your camp that the simplest is to just allow anything.
However, I will leave it to Kiyoshi to explain his reasoning.
He certainly knows more than I do about the requirements.  You
should perhaps discuss it direclty with Kiyoshi if you don't agree.

Mike indicated that UTF-8 encoding makes it easy to support even
very wide characters in identifiers, because it provides a canonical
representation, as a stream of bytes.  We asked him to share his
knowledge in this area, so we didn't all have to become experts in
ISO-10646 to evaluate the implemenation issues in this area.

*************************************************************

From: Randy Brukardt
Sent: Thursday, October 24, 2002  10:29 PM

Here is my notes on the Wide_Character in identifiers issue, which will be
turned into the minutes.

"What about full source representation of the language in Wide_Character?
Kiyoshi reports that there is a push in SC22 to allow full wide characters
in identifiers.

How do you define which characters are letters? How do you define case
equivalence? Mike says just use "letter" in the character standard. But this
is likely to be very complex in the compiler and in the run-time. Tucker
suggests use anything out of row 00 be treated a letter. Kiyoshi says that
would not be acceptable to Japan, which is preparing a standard for which
characters are allowed in identifiers."

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  4:11 AM

> I suggest you read the ARG minutes when they are available.  Kiyoshi
> indicated specifically that they wanted to restrict usage to
> characters that "make sense" as identifier characters.  I will admit
> I was in your camp that the simplest is to just allow anything.
> However, I will leave it to Kiyoshi to explain his reasoning.
> He certainly knows more than I do about the requirements.  You
> should perhaps discuss it direclty with Kiyoshi if you don't agree.

I would leave such restrictions up to either local coding standards,
enforced e.g. by ASIS tools, or enforced by compiler restrictions.
Getting into what makes sense in different languages is way way out
of scope (I speak as the former chair of the CRG, character issues
are very difficult to deal with. In the context of the CRG work, we
spent ages discussing the issue of whether E and E-acute should be
equivalent in identifiers, and came to the conclusion that the answer
might be different in different languages.

There is no point in adding a huge national dependent mess here. Indeed I
would consider in the ISO standard saying specifically that national bodies
are welcome to devise local sub-standards for identifiers and character
set requirements and leave it at that.

I perfectly well understand where Kiyoshi is coming from. I am sure he feels
as strongly that only certain characters be used as Jean Ichbiah felt about
the E/E-acute issue. But it just is not practical for the international
standard to get into the business of deciding what are and what are not
useful identifier names in all the languages of the world, or even just
for the P members :-)

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  4:16 AM

OK, so great, very appropriate, there can be a Japanese National standard
that specifies that for Ada compilers to meet this standard, there must be
a mode in which identifiers are only allowed to contain bla bla characters.
Other countries in the world are free to devise similar national standards
but I fail to see why they should be a matter for an international standard.

What would be marginally useful in the international standard would be to
devise a general framework for those national standards, and make it clear
that it is an acceptable thing for Ada compilers to implement one or more
of these standards. Frankly I think that the standard already does that,
but it would be fine to make it explicit. GNAT for example allows lots
of localization of identifier characters sets, e.g. Latin-2, Cyrillic etc.

*************************************************************

From: Pascal Leroy
Sent: Friday, October 25, 2002  6:54 AM

> But it just is not practical for the international
> standard to get into the business of deciding what are and what are not
> useful identifier names in all the languages of the world...

It has certainly never been the intent to have the ARG discuss the
identifier characters for all the languages in the world.  However, there is
an ISO working group in charge of developing and maintaining the ISO 10646
standard, and the intent was to piggyback on the work done there.

10646 defines precisely what is a character (and so yes, E and E-acute are
distinct, as are uppercase A and uppercase alpha, even though they really
look the same), what is a letter, a digit, how the uppercase/lowercase
conversions work, etc.  I see no reason why the Ada standard couldn't use
these definitions.  (And Mike gave us a feeling of what this would look
like, and it doesn't seem unreasonably complicated to me.)

Note that Java does exactly that, and defines letters and digits in a way
which is derived from Unicode (itself a close approximation to 10646).  I
don't see why Ada would lag behind in this area: it would not be a big
implementation effort, and it would improve usability of the language.

I don't buy the notion that national bodies have a role to play here (except
of course that they probably want to influence 10646).  It's already hard to
define one language standard and ensure that it's implemented with a minimum
of consistency, I don't see how users or implementers could live with the
coexistence of "Japanese Ada" and "Hebrew Ada" and "Russian Ada".

Pascal

PS: Note that the E vs. E-acute discussion is moot, since this is already
settled by Latin-1 and yes, they are different.

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  7:55 PM

> I don't buy the notion that national bodies have a role to play here (except
> of course that they probably want to influence 10646).  It's already hard to
> define one language standard and ensure that it's implemented with a minimum
> of consistency, I don't see how users or implementers could live with the
> coexistence of "Japanese Ada" and "Hebrew Ada" and "Russian Ada".

Well GNAT implements lots of different localized character sets, and noone
seems to have dropped dead :-)

*************************************************************

From: Robert A. Duff
Sent: Friday, October 25, 2002  9:13 AM

> Kiyoshi Ishihata stated at the last meeting that there was in interest
> in some countries in being able to write programs as much as possible in
> native languages, the primary deficit in this regard being that
> identifiers are entirely in Latin-1 characters.

Yes, but it was also mentioned at the meeting that SC22 is trying to get
programming languages to do something-or-other related to this.
I.e. allow 31-bit characters in identifiers, and have some uniformity
across programming languages about which characters are allowed in
identifiers.  I suppose WG9 is supposed to "obey" SC22 on this point?

By the way, let's mention the AI number being discussed in these
messages, so we don't get the "What the heck are you talking about?"
kinds of messages from Robert or others who might have missed part of
the discussion.  ;-)  I believe Pascal raised the issue many months ago,
and it has an AI number, and one can presumably search for that AI
number in the meeting minutes (once Randy publishes them).

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  8:32 PM

I tried, I could not find the AI number on this one

Of course if there are uniform rules at the SC22 level, then it is fine
to adopt them in Ada. I just think it is not something we should expend
our own very limited resources on.

*************************************************************

From: Randy Brukardt
Sent: Friday, October 25, 2002  8:59 PM

This was discussed as part of AI-285, which started life as an AI about
Latin-9. That discussion took up the entire afternoon of the third day of
the meeting.

These other issues came up since it was felt that better Wide_Character
support would (might?) make it unnecessary for the standard to directly deal
with Latin-9. (Implementations still would have to, in all likelyhood.)

There are a lot of notes in this area, and I haven't gotten that far in the
minutes yet. So my summary might be suspect... (And I haven't posted the
mail yet, either, but it's likely that it will all got on AI-285.)

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  9:12 PM

> This was discussed as part of AI-285, which started life as an AI about
> Latin-9. That discussion took up the entire afternoon of the third day of
> the meeting.

Be careful not to be eaten alive by character discussions. It was quite
intentional that we banned discussion of these issues from the main group
in the Ada 9X effort and shoveled them off to the CRG. Spending one of
six sessions on this issue alone to me says that things are already getting
out of control :-) I quite understand how this happens (remember I was chair
of the CRG!)

> > These other issues came up since it was felt that better Wide_Character
> > support would (might?) make it unnecessary for the standard to directly deal
> > with Latin-9. (Implementations still would have to, in all likelyhood.)

Well of course in practice Latin-9 is barely interesting, it just introduces
a different name for the Euro character. But for sure most computing with
Ada will be done using latin-9 whatever the Ada standard says :-)

*************************************************************

From: Randy Brukardt
Sent: Friday, October 25, 2002  10:14 PM

Well, it sounds worse that it is. The afternoon session of the last day is
typically short. We didn't get back from lunch until about 2:15, and we
adjorned at 3:28. Still, I probably would have dozed off during this
discussion if I hadn't been taking notes...

*************************************************************

From: Robert A. Duff
Sent: Friday, October 25, 2002  9:19 AM

I agree that the ARG should not spend time thinking about characters.
And we should not add all kinds of verbiage about character sets to the
RM.  But if there is a character-set standard that can be simply
referred to, why not.  Apparently, there *is* a definition of which
31-bit characters are "letters".  I thought the intent was to simply
refer to that definition (which of course changes from year to year).

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  8:45 PM

Probably that's reasonable, although I worry that this will generate a lot
of busy work in implementations for extraordinarily little gain.

*************************************************************

From: Robert A. Duff
Sent: Saturday, October 26, 2002  9:58 AM

Yes.  The purpose of Mike Yoder's "homework assignment" was to determine
how difficult it is to write the "Is_Letter" function that the Ada lexer
would need.  And a case conversion routine, I guess.  And how
inefficient these would have to be.  (People at the meeting were
concerned about huge character-set tables having to be in the compiler.)

I'm not at all interested in these character set issues.  If folks can
make an AI that is trivial to implement (efficiently), and invokes all
character-set junk by reference to other standards, then I suppose it's
OK with me.

[ Insert my usual rant about what's important, here.  ;-) ]

*************************************************************

From: Robert A. Duff
Sent: Saturday, October 26, 2002  10:14 AM

I agree with Bob in all respects, including the parenthetical comment

*************************************************************

From: Pascal Leroy
Sent: Wednesday, November 27, 2002  4:27 AM

During the last meeting we discussed the possibility of allowing any Unicode
character (er, I mean, ISO 10646) in Ada source.  Some people were concerned
that the classification tables and the uppercase translation tables would be
huge and complex to produce.

Mike Y provided some input on this topic a while back, but since I (and
probably other people) prefer to see the real tables, I spent a couple of hours
writing a little Ada program to parse the Unicode database and spit out
aggregates for these tables.  I am attaching to this message three
classification tables (letters, digits, and spaces) as well as the table that
converts to uppercase.

The latter is the largest one, and it only has 419 entries, for a total of 5028
bytes.  And that's with a representation that is not particularly compact: a
more space-efficient representation could be obtained for instance by storing
the ranges as (First, Length) instead of (First, Last).

The tables would change slightly depending on the rules that we choose (e.g.
for the syntax of identifiers) but their size would not be substantially
modified.

This demonstrates two things:

1 - The tables are easy to produce from the Unicode database.
2 - The tables are small.

---

Digits : constant Ranges :=
   (
    (16#30#, 16#39#), -- DIGIT ZERO .. DIGIT NINE
    (16#B2#, 16#B3#), -- SUPERSCRIPT TWO .. SUPERSCRIPT THREE
    (16#B9#, 16#B9#), -- SUPERSCRIPT ONE .. SUPERSCRIPT ONE
    (16#660#, 16#669#), -- ARABIC-INDIC DIGIT ZERO .. ARABIC-INDIC DIGIT NINE
    (16#6F0#, 16#6F9#), -- EXTENDED ARABIC-INDIC DIGIT ZERO .. EXTENDED ARABIC-INDIC DIGIT NINE
    (16#966#, 16#96F#), -- DEVANAGARI DIGIT ZERO .. DEVANAGARI DIGIT NINE
    (16#9E6#, 16#9EF#), -- BENGALI DIGIT ZERO .. BENGALI DIGIT NINE
    (16#A66#, 16#A6F#), -- GURMUKHI DIGIT ZERO .. GURMUKHI DIGIT NINE
    (16#AE6#, 16#AEF#), -- GUJARATI DIGIT ZERO .. GUJARATI DIGIT NINE
    (16#B66#, 16#B6F#), -- ORIYA DIGIT ZERO .. ORIYA DIGIT NINE
    (16#BE7#, 16#BEF#), -- TAMIL DIGIT ONE .. TAMIL DIGIT NINE
    (16#C66#, 16#C6F#), -- TELUGU DIGIT ZERO .. TELUGU DIGIT NINE
    (16#CE6#, 16#CEF#), -- KANNADA DIGIT ZERO .. KANNADA DIGIT NINE
    (16#D66#, 16#D6F#), -- MALAYALAM DIGIT ZERO .. MALAYALAM DIGIT NINE
    (16#E50#, 16#E59#), -- THAI DIGIT ZERO .. THAI DIGIT NINE
    (16#ED0#, 16#ED9#), -- LAO DIGIT ZERO .. LAO DIGIT NINE
    (16#F20#, 16#F29#), -- TIBETAN DIGIT ZERO .. TIBETAN DIGIT NINE
    (16#1040#, 16#1049#), -- MYANMAR DIGIT ZERO .. MYANMAR DIGIT NINE
    (16#1369#, 16#1371#), -- ETHIOPIC DIGIT ONE .. ETHIOPIC DIGIT NINE
    (16#17E0#, 16#17E9#), -- KHMER DIGIT ZERO .. KHMER DIGIT NINE
    (16#1810#, 16#1819#), -- MONGOLIAN DIGIT ZERO .. MONGOLIAN DIGIT NINE
    (16#2070#, 16#2070#), -- SUPERSCRIPT ZERO .. SUPERSCRIPT ZERO
    (16#2074#, 16#2079#), -- SUPERSCRIPT FOUR .. SUPERSCRIPT NINE
    (16#2080#, 16#2089#), -- SUBSCRIPT ZERO .. SUBSCRIPT NINE
    (16#FF10#, 16#FF19#), -- FULLWIDTH DIGIT ZERO .. FULLWIDTH DIGIT NINE
    (16#1D7CE#, 16#1D7FF#) -- MATHEMATICAL BOLD DIGIT ZERO .. MATHEMATICAL MONOSPACE DIGIT NINE
   );

---

Letters : constant Ranges :=
   (
    (16#41#, 16#5A#), -- LATIN CAPITAL LETTER A .. LATIN CAPITAL LETTER Z
    (16#61#, 16#7A#), -- LATIN SMALL LETTER A .. LATIN SMALL LETTER Z
    (16#AA#, 16#AA#), -- FEMININE ORDINAL INDICATOR .. FEMININE ORDINAL INDICATOR
    (16#B5#, 16#B5#), -- MICRO SIGN .. MICRO SIGN
    (16#BA#, 16#BA#), -- MASCULINE ORDINAL INDICATOR .. MASCULINE ORDINAL INDICATOR
    (16#C0#, 16#D6#), -- LATIN CAPITAL LETTER A WITH GRAVE .. LATIN CAPITAL LETTER O WITH DIAERESIS
    (16#D8#, 16#F6#), -- LATIN CAPITAL LETTER O WITH STROKE .. LATIN SMALL LETTER O WITH DIAERESIS
    (16#F8#, 16#2B8#), -- LATIN SMALL LETTER O WITH STROKE .. MODIFIER LETTER SMALL Y
    (16#2BB#, 16#2C1#), -- MODIFIER LETTER TURNED COMMA .. MODIFIER LETTER REVERSED GLOTTAL STOP
    (16#2D0#, 16#2D1#), -- MODIFIER LETTER TRIANGULAR COLON .. MODIFIER LETTER HALF TRIANGULAR COLON
    (16#2E0#, 16#2E4#), -- MODIFIER LETTER SMALL GAMMA .. MODIFIER LETTER SMALL REVERSED GLOTTAL STOP
    (16#2EE#, 16#2EE#), -- MODIFIER LETTER DOUBLE APOSTROPHE .. MODIFIER LETTER DOUBLE APOSTROPHE
    (16#37A#, 16#37A#), -- GREEK YPOGEGRAMMENI .. GREEK YPOGEGRAMMENI
    (16#386#, 16#386#), -- GREEK CAPITAL LETTER ALPHA WITH TONOS .. GREEK CAPITAL LETTER ALPHA WITH TONOS
    (16#388#, 16#3F5#), -- GREEK CAPITAL LETTER EPSILON WITH TONOS .. GREEK LUNATE EPSILON SYMBOL
    (16#400#, 16#481#), -- CYRILLIC CAPITAL LETTER IE WITH GRAVE .. CYRILLIC SMALL LETTER KOPPA
    (16#48A#, 16#559#), -- CYRILLIC CAPITAL LETTER SHORT I WITH TAIL .. ARMENIAN MODIFIER LETTER LEFT HALF RING
    (16#561#, 16#587#), -- ARMENIAN SMALL LETTER AYB .. ARMENIAN SMALL LIGATURE ECH YIWN
    (16#5D0#, 16#5F2#), -- HEBREW LETTER ALEF .. HEBREW LIGATURE YIDDISH DOUBLE YOD
    (16#621#, 16#64A#), -- ARABIC LETTER HAMZA .. ARABIC LETTER YEH
    (16#66E#, 16#66F#), -- ARABIC LETTER DOTLESS BEH .. ARABIC LETTER DOTLESS QAF
    (16#671#, 16#6D3#), -- ARABIC LETTER ALEF WASLA .. ARABIC LETTER YEH BARREE WITH HAMZA ABOVE
    (16#6D5#, 16#6D5#), -- ARABIC LETTER AE .. ARABIC LETTER AE
    (16#6E5#, 16#6E6#), -- ARABIC SMALL WAW .. ARABIC SMALL YEH
    (16#6FA#, 16#6FC#), -- ARABIC LETTER SHEEN WITH DOT BELOW .. ARABIC LETTER GHAIN WITH DOT BELOW
    (16#710#, 16#710#), -- SYRIAC LETTER ALAPH .. SYRIAC LETTER ALAPH
    (16#712#, 16#72C#), -- SYRIAC LETTER BETH .. SYRIAC LETTER TAW
    (16#780#, 16#7A5#), -- THAANA LETTER HAA .. THAANA LETTER WAAVU
    (16#7B1#, 16#7B1#), -- THAANA LETTER NAA .. THAANA LETTER NAA
    (16#905#, 16#939#), -- DEVANAGARI LETTER A .. DEVANAGARI LETTER HA
    (16#93D#, 16#93D#), -- DEVANAGARI SIGN AVAGRAHA .. DEVANAGARI SIGN AVAGRAHA
    (16#950#, 16#950#), -- DEVANAGARI OM .. DEVANAGARI OM
    (16#958#, 16#961#), -- DEVANAGARI LETTER QA .. DEVANAGARI LETTER VOCALIC LL
    (16#985#, 16#9B9#), -- BENGALI LETTER A .. BENGALI LETTER HA
    (16#9DC#, 16#9E1#), -- BENGALI LETTER RRA .. BENGALI LETTER VOCALIC LL
    (16#9F0#, 16#9F1#), -- BENGALI LETTER RA WITH MIDDLE DIAGONAL .. BENGALI LETTER RA WITH LOWER DIAGONAL
    (16#A05#, 16#A39#), -- GURMUKHI LETTER A .. GURMUKHI LETTER HA
    (16#A59#, 16#A5E#), -- GURMUKHI LETTER KHHA .. GURMUKHI LETTER FA
    (16#A72#, 16#A74#), -- GURMUKHI IRI .. GURMUKHI EK ONKAR
    (16#A85#, 16#AB9#), -- GUJARATI LETTER A .. GUJARATI LETTER HA
    (16#ABD#, 16#ABD#), -- GUJARATI SIGN AVAGRAHA .. GUJARATI SIGN AVAGRAHA
    (16#AD0#, 16#AE0#), -- GUJARATI OM .. GUJARATI LETTER VOCALIC RR
    (16#B05#, 16#B39#), -- ORIYA LETTER A .. ORIYA LETTER HA
    (16#B3D#, 16#B3D#), -- ORIYA SIGN AVAGRAHA .. ORIYA SIGN AVAGRAHA
    (16#B5C#, 16#B61#), -- ORIYA LETTER RRA .. ORIYA LETTER VOCALIC LL
    (16#B83#, 16#BB9#), -- TAMIL SIGN VISARGA .. TAMIL LETTER HA
    (16#C05#, 16#C39#), -- TELUGU LETTER A .. TELUGU LETTER HA
    (16#C60#, 16#C61#), -- TELUGU LETTER VOCALIC RR .. TELUGU LETTER VOCALIC LL
    (16#C85#, 16#CB9#), -- KANNADA LETTER A .. KANNADA LETTER HA
    (16#CDE#, 16#CE1#), -- KANNADA LETTER FA .. KANNADA LETTER VOCALIC LL
    (16#D05#, 16#D39#), -- MALAYALAM LETTER A .. MALAYALAM LETTER HA
    (16#D60#, 16#D61#), -- MALAYALAM LETTER VOCALIC RR .. MALAYALAM LETTER VOCALIC LL
    (16#D85#, 16#DC6#), -- SINHALA LETTER AYANNA .. SINHALA LETTER FAYANNA
    (16#E01#, 16#E30#), -- THAI CHARACTER KO KAI .. THAI CHARACTER SARA A
    (16#E32#, 16#E33#), -- THAI CHARACTER SARA AA .. THAI CHARACTER SARA AM
    (16#E40#, 16#E46#), -- THAI CHARACTER SARA E .. THAI CHARACTER MAIYAMOK
    (16#E81#, 16#EB0#), -- LAO LETTER KO .. LAO VOWEL SIGN A
    (16#EB2#, 16#EB3#), -- LAO VOWEL SIGN AA .. LAO VOWEL SIGN AM
    (16#EBD#, 16#EC6#), -- LAO SEMIVOWEL SIGN NYO .. LAO KO LA
    (16#EDC#, 16#F00#), -- LAO HO NO .. TIBETAN SYLLABLE OM
    (16#F40#, 16#F6A#), -- TIBETAN LETTER KA .. TIBETAN LETTER FIXED-FORM RA
    (16#F88#, 16#F8B#), -- TIBETAN SIGN LCE TSA CAN .. TIBETAN SIGN GRU MED RGYINGS
    (16#1000#, 16#102A#), -- MYANMAR LETTER KA .. MYANMAR LETTER AU
    (16#1050#, 16#1055#), -- MYANMAR LETTER SHA .. MYANMAR LETTER VOCALIC LL
    (16#10A0#, 16#10F8#), -- GEORGIAN CAPITAL LETTER AN .. GEORGIAN LETTER ELIFI
    (16#1100#, 16#135A#), -- HANGUL CHOSEONG KIYEOK .. ETHIOPIC SYLLABLE FYA
    (16#13A0#, 16#166C#), -- CHEROKEE LETTER A .. CANADIAN SYLLABICS CARRIER TTSA
    (16#166F#, 16#1676#), -- CANADIAN SYLLABICS QAI .. CANADIAN SYLLABICS NNGAA
    (16#1681#, 16#169A#), -- OGHAM LETTER BEITH .. OGHAM LETTER PEITH
    (16#16A0#, 16#16EA#), -- RUNIC LETTER FEHU FEOH FE F .. RUNIC LETTER X
    (16#1700#, 16#1711#), -- TAGALOG LETTER A .. TAGALOG LETTER HA
    (16#1720#, 16#1731#), -- HANUNOO LETTER A .. HANUNOO LETTER HA
    (16#1740#, 16#1751#), -- BUHID LETTER A .. BUHID LETTER HA
    (16#1760#, 16#1770#), -- TAGBANWA LETTER A .. TAGBANWA LETTER SA
    (16#1780#, 16#17B3#), -- KHMER LETTER KA .. KHMER INDEPENDENT VOWEL QAU
    (16#17D7#, 16#17D7#), -- KHMER SIGN LEK TOO .. KHMER SIGN LEK TOO
    (16#17DC#, 16#17DC#), -- KHMER SIGN AVAKRAHASANYA .. KHMER SIGN AVAKRAHASANYA
    (16#1820#, 16#18A8#), -- MONGOLIAN LETTER A .. MONGOLIAN LETTER MANCHU ALI GALI BHA
    (16#1E00#, 16#1FBC#), -- LATIN CAPITAL LETTER A WITH RING BELOW .. GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
    (16#1FBE#, 16#1FBE#), -- GREEK PROSGEGRAMMENI .. GREEK PROSGEGRAMMENI
    (16#1FC2#, 16#1FCC#), -- GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI .. GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
    (16#1FD0#, 16#1FDB#), -- GREEK SMALL LETTER IOTA WITH VRACHY .. GREEK CAPITAL LETTER IOTA WITH OXIA
    (16#1FE0#, 16#1FEC#), -- GREEK SMALL LETTER UPSILON WITH VRACHY .. GREEK CAPITAL LETTER RHO WITH DASIA
    (16#1FF2#, 16#1FFC#), -- GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI .. GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
    (16#2071#, 16#2071#), -- SUPERSCRIPT LATIN SMALL LETTER I .. SUPERSCRIPT LATIN SMALL LETTER I
    (16#207F#, 16#207F#), -- SUPERSCRIPT LATIN SMALL LETTER N .. SUPERSCRIPT LATIN SMALL LETTER N
    (16#2102#, 16#2102#), -- DOUBLE-STRUCK CAPITAL C .. DOUBLE-STRUCK CAPITAL C
    (16#2107#, 16#2107#), -- EULER CONSTANT .. EULER CONSTANT
    (16#210A#, 16#2113#), -- SCRIPT SMALL G .. SCRIPT SMALL L
    (16#2115#, 16#2115#), -- DOUBLE-STRUCK CAPITAL N .. DOUBLE-STRUCK CAPITAL N
    (16#2119#, 16#211D#), -- DOUBLE-STRUCK CAPITAL P .. DOUBLE-STRUCK CAPITAL R
    (16#2124#, 16#2124#), -- DOUBLE-STRUCK CAPITAL Z .. DOUBLE-STRUCK CAPITAL Z
    (16#2126#, 16#2126#), -- OHM SIGN .. OHM SIGN
    (16#2128#, 16#2128#), -- BLACK-LETTER CAPITAL Z .. BLACK-LETTER CAPITAL Z
    (16#212A#, 16#212D#), -- KELVIN SIGN .. BLACK-LETTER CAPITAL C
    (16#212F#, 16#2131#), -- SCRIPT SMALL E .. SCRIPT CAPITAL F
    (16#2133#, 16#2139#), -- SCRIPT CAPITAL M .. INFORMATION SOURCE
    (16#213D#, 16#213F#), -- DOUBLE-STRUCK SMALL GAMMA .. DOUBLE-STRUCK CAPITAL PI
    (16#2145#, 16#2149#), -- DOUBLE-STRUCK ITALIC CAPITAL D .. DOUBLE-STRUCK ITALIC SMALL J
    (16#3005#, 16#3006#), -- IDEOGRAPHIC ITERATION MARK .. IDEOGRAPHIC CLOSING MARK
    (16#3031#, 16#3035#), -- VERTICAL KANA REPEAT MARK .. VERTICAL KANA REPEAT MARK LOWER HALF
    (16#303B#, 16#303C#), -- VERTICAL IDEOGRAPHIC ITERATION MARK .. MASU MARK
    (16#3041#, 16#3096#), -- HIRAGANA LETTER SMALL A .. HIRAGANA LETTER SMALL KE
    (16#309D#, 16#309F#), -- HIRAGANA ITERATION MARK .. HIRAGANA DIGRAPH YORI
    (16#30A1#, 16#30FA#), -- KATAKANA LETTER SMALL A .. KATAKANA LETTER VO
    (16#30FC#, 16#318E#), -- KATAKANA-HIRAGANA PROLONGED SOUND MARK .. HANGUL LETTER ARAEAE
    (16#31A0#, 16#31FF#), -- BOPOMOFO LETTER BU .. KATAKANA LETTER SMALL RO
    (16#3400#, 16#A48C#), -- <CJK Ideograph Extension A, First> .. YI SYLLABLE YYR
    (16#AC00#, 16#D7A3#), -- <Hangul Syllable, First> .. <Hangul Syllable, Last>
    (16#F900#, 16#FB1D#), -- CJK COMPATIBILITY IDEOGRAPH-F900 .. HEBREW LETTER YOD WITH HIRIQ
    (16#FB1F#, 16#FB28#), -- HEBREW LIGATURE YIDDISH YOD YOD PATAH .. HEBREW LETTER WIDE TAV
    (16#FB2A#, 16#FD3D#), -- HEBREW LETTER SHIN WITH SHIN DOT .. ARABIC LIGATURE ALEF WITH FATHATAN ISOLATED FORM
    (16#FD50#, 16#FDFB#), -- ARABIC LIGATURE TEH WITH JEEM WITH MEEM INITIAL FORM .. ARABIC LIGATURE JALLAJALALOUHOU
    (16#FE70#, 16#FEFC#), -- ARABIC FATHATAN ISOLATED FORM .. ARABIC LIGATURE LAM WITH ALEF FINAL FORM
    (16#FF21#, 16#FF3A#), -- FULLWIDTH LATIN CAPITAL LETTER A .. FULLWIDTH LATIN CAPITAL LETTER Z
    (16#FF41#, 16#FF5A#), -- FULLWIDTH LATIN SMALL LETTER A .. FULLWIDTH LATIN SMALL LETTER Z
    (16#FF66#, 16#FFDC#), -- HALFWIDTH KATAKANA LETTER WO .. HALFWIDTH HANGUL LETTER I
    (16#10300#, 16#1031E#), -- OLD ITALIC LETTER A .. OLD ITALIC LETTER UU
    (16#10330#, 16#10349#), -- GOTHIC LETTER AHSA .. GOTHIC LETTER OTHAL
    (16#10400#, 16#1044D#), -- DESERET CAPITAL LETTER LONG I .. DESERET SMALL LETTER ENG
    (16#1D400#, 16#1D6C0#), -- MATHEMATICAL BOLD CAPITAL A .. MATHEMATICAL BOLD CAPITAL OMEGA
    (16#1D6C2#, 16#1D6DA#), -- MATHEMATICAL BOLD SMALL ALPHA .. MATHEMATICAL BOLD SMALL OMEGA
    (16#1D6DC#, 16#1D6FA#), -- MATHEMATICAL BOLD EPSILON SYMBOL .. MATHEMATICAL ITALIC CAPITAL OMEGA
    (16#1D6FC#, 16#1D714#), -- MATHEMATICAL ITALIC SMALL ALPHA .. MATHEMATICAL ITALIC SMALL OMEGA
    (16#1D716#, 16#1D734#), -- MATHEMATICAL ITALIC EPSILON SYMBOL .. MATHEMATICAL BOLD ITALIC CAPITAL OMEGA
    (16#1D736#, 16#1D74E#), -- MATHEMATICAL BOLD ITALIC SMALL ALPHA .. MATHEMATICAL BOLD ITALIC SMALL OMEGA
    (16#1D750#, 16#1D76E#), -- MATHEMATICAL BOLD ITALIC EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD CAPITAL OMEGA
    (16#1D770#, 16#1D788#), -- MATHEMATICAL SANS-SERIF BOLD SMALL ALPHA .. MATHEMATICAL SANS-SERIF BOLD SMALL OMEGA
    (16#1D78A#, 16#1D7A8#), -- MATHEMATICAL SANS-SERIF BOLD EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL OMEGA
    (16#1D7AA#, 16#1D7C2#), -- MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL ALPHA .. MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL OMEGA
    (16#1D7C4#, 16#1D7C9#), -- MATHEMATICAL SANS-SERIF BOLD ITALIC EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL
    (16#20000#, 16#2FA1D#) -- <CJK Ideograph Extension B, First> .. CJK COMPATIBILITY IDEOGRAPH-2FA1D
   );

---

Spaces : constant Ranges :=
   (
    (16#20#, 16#20#), -- SPACE .. SPACE
    (16#A0#, 16#A0#), -- NO-BREAK SPACE .. NO-BREAK SPACE
    (16#1680#, 16#1680#), -- OGHAM SPACE MARK .. OGHAM SPACE MARK
    (16#2000#, 16#200B#), -- EN QUAD .. ZERO WIDTH SPACE
    (16#202F#, 16#202F#), -- NARROW NO-BREAK SPACE .. NARROW NO-BREAK SPACE
    (16#205F#, 16#205F#), -- MEDIUM MATHEMATICAL SPACE .. MEDIUM MATHEMATICAL SPACE
    (16#3000#, 16#3000#) -- IDEOGRAPHIC SPACE .. IDEOGRAPHIC SPACE
   );

---

Uppercase_Mapping : constant Mapping_Ranges :=
   (
    (16#61#, 16#7A#, -32), -- LATIN SMALL LETTER A .. LATIN SMALL LETTER Z
    (16#B5#, 16#B5#, 743), -- MICRO SIGN .. MICRO SIGN
    (16#E0#, 16#F6#, -32), -- LATIN SMALL LETTER A WITH GRAVE .. LATIN SMALL LETTER O WITH DIAERESIS
    (16#F8#, 16#FE#, -32), -- LATIN SMALL LETTER O WITH STROKE .. LATIN SMALL LETTER THORN
    (16#FF#, 16#FF#, 121), -- LATIN SMALL LETTER Y WITH DIAERESIS .. LATIN SMALL LETTER Y WITH DIAERESIS
    (16#101#, 16#101#, -1), -- LATIN SMALL LETTER A WITH MACRON .. LATIN SMALL LETTER A WITH MACRON
    (16#103#, 16#103#, -1), -- LATIN SMALL LETTER A WITH BREVE .. LATIN SMALL LETTER A WITH BREVE
    (16#105#, 16#105#, -1), -- LATIN SMALL LETTER A WITH OGONEK .. LATIN SMALL LETTER A WITH OGONEK
    (16#107#, 16#107#, -1), -- LATIN SMALL LETTER C WITH ACUTE .. LATIN SMALL LETTER C WITH ACUTE
    (16#109#, 16#109#, -1), -- LATIN SMALL LETTER C WITH CIRCUMFLEX .. LATIN SMALL LETTER C WITH CIRCUMFLEX
    (16#10B#, 16#10B#, -1), -- LATIN SMALL LETTER C WITH DOT ABOVE .. LATIN SMALL LETTER C WITH DOT ABOVE
    (16#10D#, 16#10D#, -1), -- LATIN SMALL LETTER C WITH CARON .. LATIN SMALL LETTER C WITH CARON
    (16#10F#, 16#10F#, -1), -- LATIN SMALL LETTER D WITH CARON .. LATIN SMALL LETTER D WITH CARON
    (16#111#, 16#111#, -1), -- LATIN SMALL LETTER D WITH STROKE .. LATIN SMALL LETTER D WITH STROKE
    (16#113#, 16#113#, -1), -- LATIN SMALL LETTER E WITH MACRON .. LATIN SMALL LETTER E WITH MACRON
    (16#115#, 16#115#, -1), -- LATIN SMALL LETTER E WITH BREVE .. LATIN SMALL LETTER E WITH BREVE
    (16#117#, 16#117#, -1), -- LATIN SMALL LETTER E WITH DOT ABOVE .. LATIN SMALL LETTER E WITH DOT ABOVE
    (16#119#, 16#119#, -1), -- LATIN SMALL LETTER E WITH OGONEK .. LATIN SMALL LETTER E WITH OGONEK
    (16#11B#, 16#11B#, -1), -- LATIN SMALL LETTER E WITH CARON .. LATIN SMALL LETTER E WITH CARON
    (16#11D#, 16#11D#, -1), -- LATIN SMALL LETTER G WITH CIRCUMFLEX .. LATIN SMALL LETTER G WITH CIRCUMFLEX
    (16#11F#, 16#11F#, -1), -- LATIN SMALL LETTER G WITH BREVE .. LATIN SMALL LETTER G WITH BREVE
    (16#121#, 16#121#, -1), -- LATIN SMALL LETTER G WITH DOT ABOVE .. LATIN SMALL LETTER G WITH DOT ABOVE
    (16#123#, 16#123#, -1), -- LATIN SMALL LETTER G WITH CEDILLA .. LATIN SMALL LETTER G WITH CEDILLA
    (16#125#, 16#125#, -1), -- LATIN SMALL LETTER H WITH CIRCUMFLEX .. LATIN SMALL LETTER H WITH CIRCUMFLEX
    (16#127#, 16#127#, -1), -- LATIN SMALL LETTER H WITH STROKE .. LATIN SMALL LETTER H WITH STROKE
    (16#129#, 16#129#, -1), -- LATIN SMALL LETTER I WITH TILDE .. LATIN SMALL LETTER I WITH TILDE
    (16#12B#, 16#12B#, -1), -- LATIN SMALL LETTER I WITH MACRON .. LATIN SMALL LETTER I WITH MACRON
    (16#12D#, 16#12D#, -1), -- LATIN SMALL LETTER I WITH BREVE .. LATIN SMALL LETTER I WITH BREVE
    (16#12F#, 16#12F#, -1), -- LATIN SMALL LETTER I WITH OGONEK .. LATIN SMALL LETTER I WITH OGONEK
    (16#131#, 16#131#, -232), -- LATIN SMALL LETTER DOTLESS I .. LATIN SMALL LETTER DOTLESS I
    (16#133#, 16#133#, -1), -- LATIN SMALL LIGATURE IJ .. LATIN SMALL LIGATURE IJ
    (16#135#, 16#135#, -1), -- LATIN SMALL LETTER J WITH CIRCUMFLEX .. LATIN SMALL LETTER J WITH CIRCUMFLEX
    (16#137#, 16#137#, -1), -- LATIN SMALL LETTER K WITH CEDILLA .. LATIN SMALL LETTER K WITH CEDILLA
    (16#13A#, 16#13A#, -1), -- LATIN SMALL LETTER L WITH ACUTE .. LATIN SMALL LETTER L WITH ACUTE
    (16#13C#, 16#13C#, -1), -- LATIN SMALL LETTER L WITH CEDILLA .. LATIN SMALL LETTER L WITH CEDILLA
    (16#13E#, 16#13E#, -1), -- LATIN SMALL LETTER L WITH CARON .. LATIN SMALL LETTER L WITH CARON
    (16#140#, 16#140#, -1), -- LATIN SMALL LETTER L WITH MIDDLE DOT .. LATIN SMALL LETTER L WITH MIDDLE DOT
    (16#142#, 16#142#, -1), -- LATIN SMALL LETTER L WITH STROKE .. LATIN SMALL LETTER L WITH STROKE
    (16#144#, 16#144#, -1), -- LATIN SMALL LETTER N WITH ACUTE .. LATIN SMALL LETTER N WITH ACUTE
    (16#146#, 16#146#, -1), -- LATIN SMALL LETTER N WITH CEDILLA .. LATIN SMALL LETTER N WITH CEDILLA
    (16#148#, 16#148#, -1), -- LATIN SMALL LETTER N WITH CARON .. LATIN SMALL LETTER N WITH CARON
    (16#14B#, 16#14B#, -1), -- LATIN SMALL LETTER ENG .. LATIN SMALL LETTER ENG
    (16#14D#, 16#14D#, -1), -- LATIN SMALL LETTER O WITH MACRON .. LATIN SMALL LETTER O WITH MACRON
    (16#14F#, 16#14F#, -1), -- LATIN SMALL LETTER O WITH BREVE .. LATIN SMALL LETTER O WITH BREVE
    (16#151#, 16#151#, -1), -- LATIN SMALL LETTER O WITH DOUBLE ACUTE .. LATIN SMALL LETTER O WITH DOUBLE ACUTE
    (16#153#, 16#153#, -1), -- LATIN SMALL LIGATURE OE .. LATIN SMALL LIGATURE OE
    (16#155#, 16#155#, -1), -- LATIN SMALL LETTER R WITH ACUTE .. LATIN SMALL LETTER R WITH ACUTE
    (16#157#, 16#157#, -1), -- LATIN SMALL LETTER R WITH CEDILLA .. LATIN SMALL LETTER R WITH CEDILLA
    (16#159#, 16#159#, -1), -- LATIN SMALL LETTER R WITH CARON .. LATIN SMALL LETTER R WITH CARON
    (16#15B#, 16#15B#, -1), -- LATIN SMALL LETTER S WITH ACUTE .. LATIN SMALL LETTER S WITH ACUTE
    (16#15D#, 16#15D#, -1), -- LATIN SMALL LETTER S WITH CIRCUMFLEX .. LATIN SMALL LETTER S WITH CIRCUMFLEX
    (16#15F#, 16#15F#, -1), -- LATIN SMALL LETTER S WITH CEDILLA .. LATIN SMALL LETTER S WITH CEDILLA
    (16#161#, 16#161#, -1), -- LATIN SMALL LETTER S WITH CARON .. LATIN SMALL LETTER S WITH CARON
    (16#163#, 16#163#, -1), -- LATIN SMALL LETTER T WITH CEDILLA .. LATIN SMALL LETTER T WITH CEDILLA
    (16#165#, 16#165#, -1), -- LATIN SMALL LETTER T WITH CARON .. LATIN SMALL LETTER T WITH CARON
    (16#167#, 16#167#, -1), -- LATIN SMALL LETTER T WITH STROKE .. LATIN SMALL LETTER T WITH STROKE
    (16#169#, 16#169#, -1), -- LATIN SMALL LETTER U WITH TILDE .. LATIN SMALL LETTER U WITH TILDE
    (16#16B#, 16#16B#, -1), -- LATIN SMALL LETTER U WITH MACRON .. LATIN SMALL LETTER U WITH MACRON
    (16#16D#, 16#16D#, -1), -- LATIN SMALL LETTER U WITH BREVE .. LATIN SMALL LETTER U WITH BREVE
    (16#16F#, 16#16F#, -1), -- LATIN SMALL LETTER U WITH RING ABOVE .. LATIN SMALL LETTER U WITH RING ABOVE
    (16#171#, 16#171#, -1), -- LATIN SMALL LETTER U WITH DOUBLE ACUTE .. LATIN SMALL LETTER U WITH DOUBLE ACUTE
    (16#173#, 16#173#, -1), -- LATIN SMALL LETTER U WITH OGONEK .. LATIN SMALL LETTER U WITH OGONEK
    (16#175#, 16#175#, -1), -- LATIN SMALL LETTER W WITH CIRCUMFLEX .. LATIN SMALL LETTER W WITH CIRCUMFLEX
    (16#177#, 16#177#, -1), -- LATIN SMALL LETTER Y WITH CIRCUMFLEX .. LATIN SMALL LETTER Y WITH CIRCUMFLEX
    (16#17A#, 16#17A#, -1), -- LATIN SMALL LETTER Z WITH ACUTE .. LATIN SMALL LETTER Z WITH ACUTE
    (16#17C#, 16#17C#, -1), -- LATIN SMALL LETTER Z WITH DOT ABOVE .. LATIN SMALL LETTER Z WITH DOT ABOVE
    (16#17E#, 16#17E#, -1), -- LATIN SMALL LETTER Z WITH CARON .. LATIN SMALL LETTER Z WITH CARON
    (16#17F#, 16#17F#, -300), -- LATIN SMALL LETTER LONG S .. LATIN SMALL LETTER LONG S
    (16#183#, 16#183#, -1), -- LATIN SMALL LETTER B WITH TOPBAR .. LATIN SMALL LETTER B WITH TOPBAR
    (16#185#, 16#185#, -1), -- LATIN SMALL LETTER TONE SIX .. LATIN SMALL LETTER TONE SIX
    (16#188#, 16#188#, -1), -- LATIN SMALL LETTER C WITH HOOK .. LATIN SMALL LETTER C WITH HOOK
    (16#18C#, 16#18C#, -1), -- LATIN SMALL LETTER D WITH TOPBAR .. LATIN SMALL LETTER D WITH TOPBAR
    (16#192#, 16#192#, -1), -- LATIN SMALL LETTER F WITH HOOK .. LATIN SMALL LETTER F WITH HOOK
    (16#195#, 16#195#, 97), -- LATIN SMALL LETTER HV .. LATIN SMALL LETTER HV
    (16#199#, 16#199#, -1), -- LATIN SMALL LETTER K WITH HOOK .. LATIN SMALL LETTER K WITH HOOK
    (16#19E#, 16#19E#, 130), -- LATIN SMALL LETTER N WITH LONG RIGHT LEG .. LATIN SMALL LETTER N WITH LONG RIGHT LEG
    (16#1A1#, 16#1A1#, -1), -- LATIN SMALL LETTER O WITH HORN .. LATIN SMALL LETTER O WITH HORN
    (16#1A3#, 16#1A3#, -1), -- LATIN SMALL LETTER OI .. LATIN SMALL LETTER OI
    (16#1A5#, 16#1A5#, -1), -- LATIN SMALL LETTER P WITH HOOK .. LATIN SMALL LETTER P WITH HOOK
    (16#1A8#, 16#1A8#, -1), -- LATIN SMALL LETTER TONE TWO .. LATIN SMALL LETTER TONE TWO
    (16#1AD#, 16#1AD#, -1), -- LATIN SMALL LETTER T WITH HOOK .. LATIN SMALL LETTER T WITH HOOK
    (16#1B0#, 16#1B0#, -1), -- LATIN SMALL LETTER U WITH HORN .. LATIN SMALL LETTER U WITH HORN
    (16#1B4#, 16#1B4#, -1), -- LATIN SMALL LETTER Y WITH HOOK .. LATIN SMALL LETTER Y WITH HOOK
    (16#1B6#, 16#1B6#, -1), -- LATIN SMALL LETTER Z WITH STROKE .. LATIN SMALL LETTER Z WITH STROKE
    (16#1B9#, 16#1B9#, -1), -- LATIN SMALL LETTER EZH REVERSED .. LATIN SMALL LETTER EZH REVERSED
    (16#1BD#, 16#1BD#, -1), -- LATIN SMALL LETTER TONE FIVE .. LATIN SMALL LETTER TONE FIVE
    (16#1BF#, 16#1BF#, 56), -- LATIN LETTER WYNN .. LATIN LETTER WYNN
    (16#1C5#, 16#1C5#, -1), -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
    (16#1C6#, 16#1C6#, -2), -- LATIN SMALL LETTER DZ WITH CARON .. LATIN SMALL LETTER DZ WITH CARON
    (16#1C8#, 16#1C8#, -1), -- LATIN CAPITAL LETTER L WITH SMALL LETTER J .. LATIN CAPITAL LETTER L WITH SMALL LETTER J
    (16#1C9#, 16#1C9#, -2), -- LATIN SMALL LETTER LJ .. LATIN SMALL LETTER LJ
    (16#1CB#, 16#1CB#, -1), -- LATIN CAPITAL LETTER N WITH SMALL LETTER J .. LATIN CAPITAL LETTER N WITH SMALL LETTER J
    (16#1CC#, 16#1CC#, -2), -- LATIN SMALL LETTER NJ .. LATIN SMALL LETTER NJ
    (16#1CE#, 16#1CE#, -1), -- LATIN SMALL LETTER A WITH CARON .. LATIN SMALL LETTER A WITH CARON
    (16#1D0#, 16#1D0#, -1), -- LATIN SMALL LETTER I WITH CARON .. LATIN SMALL LETTER I WITH CARON
    (16#1D2#, 16#1D2#, -1), -- LATIN SMALL LETTER O WITH CARON .. LATIN SMALL LETTER O WITH CARON
    (16#1D4#, 16#1D4#, -1), -- LATIN SMALL LETTER U WITH CARON .. LATIN SMALL LETTER U WITH CARON
    (16#1D6#, 16#1D6#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER U WITH DIAERESIS AND MACRON
    (16#1D8#, 16#1D8#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE
    (16#1DA#, 16#1DA#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND CARON .. LATIN SMALL LETTER U WITH DIAERESIS AND CARON
    (16#1DC#, 16#1DC#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE .. LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE
    (16#1DD#, 16#1DD#, -79), -- LATIN SMALL LETTER TURNED E .. LATIN SMALL LETTER TURNED E
    (16#1DF#, 16#1DF#, -1), -- LATIN SMALL LETTER A WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
    (16#1E1#, 16#1E1#, -1), -- LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
    (16#1E3#, 16#1E3#, -1), -- LATIN SMALL LETTER AE WITH MACRON .. LATIN SMALL LETTER AE WITH MACRON
    (16#1E5#, 16#1E5#, -1), -- LATIN SMALL LETTER G WITH STROKE .. LATIN SMALL LETTER G WITH STROKE
    (16#1E7#, 16#1E7#, -1), -- LATIN SMALL LETTER G WITH CARON .. LATIN SMALL LETTER G WITH CARON
    (16#1E9#, 16#1E9#, -1), -- LATIN SMALL LETTER K WITH CARON .. LATIN SMALL LETTER K WITH CARON
    (16#1EB#, 16#1EB#, -1), -- LATIN SMALL LETTER O WITH OGONEK .. LATIN SMALL LETTER O WITH OGONEK
    (16#1ED#, 16#1ED#, -1), -- LATIN SMALL LETTER O WITH OGONEK AND MACRON .. LATIN SMALL LETTER O WITH OGONEK AND MACRON
    (16#1EF#, 16#1EF#, -1), -- LATIN SMALL LETTER EZH WITH CARON .. LATIN SMALL LETTER EZH WITH CARON
    (16#1F2#, 16#1F2#, -1), -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z
    (16#1F3#, 16#1F3#, -2), -- LATIN SMALL LETTER DZ .. LATIN SMALL LETTER DZ
    (16#1F5#, 16#1F5#, -1), -- LATIN SMALL LETTER G WITH ACUTE .. LATIN SMALL LETTER G WITH ACUTE
    (16#1F9#, 16#1F9#, -1), -- LATIN SMALL LETTER N WITH GRAVE .. LATIN SMALL LETTER N WITH GRAVE
    (16#1FB#, 16#1FB#, -1), -- LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE .. LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
    (16#1FD#, 16#1FD#, -1), -- LATIN SMALL LETTER AE WITH ACUTE .. LATIN SMALL LETTER AE WITH ACUTE
    (16#1FF#, 16#1FF#, -1), -- LATIN SMALL LETTER O WITH STROKE AND ACUTE .. LATIN SMALL LETTER O WITH STROKE AND ACUTE
    (16#201#, 16#201#, -1), -- LATIN SMALL LETTER A WITH DOUBLE GRAVE .. LATIN SMALL LETTER A WITH DOUBLE GRAVE
    (16#203#, 16#203#, -1), -- LATIN SMALL LETTER A WITH INVERTED BREVE .. LATIN SMALL LETTER A WITH INVERTED BREVE
    (16#205#, 16#205#, -1), -- LATIN SMALL LETTER E WITH DOUBLE GRAVE .. LATIN SMALL LETTER E WITH DOUBLE GRAVE
    (16#207#, 16#207#, -1), -- LATIN SMALL LETTER E WITH INVERTED BREVE .. LATIN SMALL LETTER E WITH INVERTED BREVE
    (16#209#, 16#209#, -1), -- LATIN SMALL LETTER I WITH DOUBLE GRAVE .. LATIN SMALL LETTER I WITH DOUBLE GRAVE
    (16#20B#, 16#20B#, -1), -- LATIN SMALL LETTER I WITH INVERTED BREVE .. LATIN SMALL LETTER I WITH INVERTED BREVE
    (16#20D#, 16#20D#, -1), -- LATIN SMALL LETTER O WITH DOUBLE GRAVE .. LATIN SMALL LETTER O WITH DOUBLE GRAVE
    (16#20F#, 16#20F#, -1), -- LATIN SMALL LETTER O WITH INVERTED BREVE .. LATIN SMALL LETTER O WITH INVERTED BREVE
    (16#211#, 16#211#, -1), -- LATIN SMALL LETTER R WITH DOUBLE GRAVE .. LATIN SMALL LETTER R WITH DOUBLE GRAVE
    (16#213#, 16#213#, -1), -- LATIN SMALL LETTER R WITH INVERTED BREVE .. LATIN SMALL LETTER R WITH INVERTED BREVE
    (16#215#, 16#215#, -1), -- LATIN SMALL LETTER U WITH DOUBLE GRAVE .. LATIN SMALL LETTER U WITH DOUBLE GRAVE
    (16#217#, 16#217#, -1), -- LATIN SMALL LETTER U WITH INVERTED BREVE .. LATIN SMALL LETTER U WITH INVERTED BREVE
    (16#219#, 16#219#, -1), -- LATIN SMALL LETTER S WITH COMMA BELOW .. LATIN SMALL LETTER S WITH COMMA BELOW
    (16#21B#, 16#21B#, -1), -- LATIN SMALL LETTER T WITH COMMA BELOW .. LATIN SMALL LETTER T WITH COMMA BELOW
    (16#21D#, 16#21D#, -1), -- LATIN SMALL LETTER YOGH .. LATIN SMALL LETTER YOGH
    (16#21F#, 16#21F#, -1), -- LATIN SMALL LETTER H WITH CARON .. LATIN SMALL LETTER H WITH CARON
    (16#223#, 16#223#, -1), -- LATIN SMALL LETTER OU .. LATIN SMALL LETTER OU
    (16#225#, 16#225#, -1), -- LATIN SMALL LETTER Z WITH HOOK .. LATIN SMALL LETTER Z WITH HOOK
    (16#227#, 16#227#, -1), -- LATIN SMALL LETTER A WITH DOT ABOVE .. LATIN SMALL LETTER A WITH DOT ABOVE
    (16#229#, 16#229#, -1), -- LATIN SMALL LETTER E WITH CEDILLA .. LATIN SMALL LETTER E WITH CEDILLA
    (16#22B#, 16#22B#, -1), -- LATIN SMALL LETTER O WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER O WITH DIAERESIS AND MACRON
    (16#22D#, 16#22D#, -1), -- LATIN SMALL LETTER O WITH TILDE AND MACRON .. LATIN SMALL LETTER O WITH TILDE AND MACRON
    (16#22F#, 16#22F#, -1), -- LATIN SMALL LETTER O WITH DOT ABOVE .. LATIN SMALL LETTER O WITH DOT ABOVE
    (16#231#, 16#231#, -1), -- LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON
    (16#233#, 16#233#, -1), -- LATIN SMALL LETTER Y WITH MACRON .. LATIN SMALL LETTER Y WITH MACRON
    (16#253#, 16#253#, -210), -- LATIN SMALL LETTER B WITH HOOK .. LATIN SMALL LETTER B WITH HOOK
    (16#254#, 16#254#, -206), -- LATIN SMALL LETTER OPEN O .. LATIN SMALL LETTER OPEN O
    (16#256#, 16#257#, -205), -- LATIN SMALL LETTER D WITH TAIL .. LATIN SMALL LETTER D WITH HOOK
    (16#259#, 16#259#, -202), -- LATIN SMALL LETTER SCHWA .. LATIN SMALL LETTER SCHWA
    (16#25B#, 16#25B#, -203), -- LATIN SMALL LETTER OPEN E .. LATIN SMALL LETTER OPEN E
    (16#260#, 16#260#, -205), -- LATIN SMALL LETTER G WITH HOOK .. LATIN SMALL LETTER G WITH HOOK
    (16#263#, 16#263#, -207), -- LATIN SMALL LETTER GAMMA .. LATIN SMALL LETTER GAMMA
    (16#268#, 16#268#, -209), -- LATIN SMALL LETTER I WITH STROKE .. LATIN SMALL LETTER I WITH STROKE
    (16#269#, 16#269#, -211), -- LATIN SMALL LETTER IOTA .. LATIN SMALL LETTER IOTA
    (16#26F#, 16#26F#, -211), -- LATIN SMALL LETTER TURNED M .. LATIN SMALL LETTER TURNED M
    (16#272#, 16#272#, -213), -- LATIN SMALL LETTER N WITH LEFT HOOK .. LATIN SMALL LETTER N WITH LEFT HOOK
    (16#275#, 16#275#, -214), -- LATIN SMALL LETTER BARRED O .. LATIN SMALL LETTER BARRED O
    (16#280#, 16#280#, -218), -- LATIN LETTER SMALL CAPITAL R .. LATIN LETTER SMALL CAPITAL R
    (16#283#, 16#283#, -218), -- LATIN SMALL LETTER ESH .. LATIN SMALL LETTER ESH
    (16#288#, 16#288#, -218), -- LATIN SMALL LETTER T WITH RETROFLEX HOOK .. LATIN SMALL LETTER T WITH RETROFLEX HOOK
    (16#28A#, 16#28B#, -217), -- LATIN SMALL LETTER UPSILON .. LATIN SMALL LETTER V WITH HOOK
    (16#292#, 16#292#, -219), -- LATIN SMALL LETTER EZH .. LATIN SMALL LETTER EZH
    (16#3AC#, 16#3AC#, -38), -- GREEK SMALL LETTER ALPHA WITH TONOS .. GREEK SMALL LETTER ALPHA WITH TONOS
    (16#3AD#, 16#3AF#, -37), -- GREEK SMALL LETTER EPSILON WITH TONOS .. GREEK SMALL LETTER IOTA WITH TONOS
    (16#3B1#, 16#3C1#, -32), -- GREEK SMALL LETTER ALPHA .. GREEK SMALL LETTER RHO
    (16#3C2#, 16#3C2#, -31), -- GREEK SMALL LETTER FINAL SIGMA .. GREEK SMALL LETTER FINAL SIGMA
    (16#3C3#, 16#3CB#, -32), -- GREEK SMALL LETTER SIGMA .. GREEK SMALL LETTER UPSILON WITH DIALYTIKA
    (16#3CC#, 16#3CC#, -64), -- GREEK SMALL LETTER OMICRON WITH TONOS .. GREEK SMALL LETTER OMICRON WITH TONOS
    (16#3CD#, 16#3CE#, -63), -- GREEK SMALL LETTER UPSILON WITH TONOS .. GREEK SMALL LETTER OMEGA WITH TONOS
    (16#3D0#, 16#3D0#, -62), -- GREEK BETA SYMBOL .. GREEK BETA SYMBOL
    (16#3D1#, 16#3D1#, -57), -- GREEK THETA SYMBOL .. GREEK THETA SYMBOL
    (16#3D5#, 16#3D5#, -47), -- GREEK PHI SYMBOL .. GREEK PHI SYMBOL
    (16#3D6#, 16#3D6#, -54), -- GREEK PI SYMBOL .. GREEK PI SYMBOL
    (16#3D9#, 16#3D9#, -1), -- GREEK SMALL LETTER ARCHAIC KOPPA .. GREEK SMALL LETTER ARCHAIC KOPPA
    (16#3DB#, 16#3DB#, -1), -- GREEK SMALL LETTER STIGMA .. GREEK SMALL LETTER STIGMA
    (16#3DD#, 16#3DD#, -1), -- GREEK SMALL LETTER DIGAMMA .. GREEK SMALL LETTER DIGAMMA
    (16#3DF#, 16#3DF#, -1), -- GREEK SMALL LETTER KOPPA .. GREEK SMALL LETTER KOPPA
    (16#3E1#, 16#3E1#, -1), -- GREEK SMALL LETTER SAMPI .. GREEK SMALL LETTER SAMPI
    (16#3E3#, 16#3E3#, -1), -- COPTIC SMALL LETTER SHEI .. COPTIC SMALL LETTER SHEI
    (16#3E5#, 16#3E5#, -1), -- COPTIC SMALL LETTER FEI .. COPTIC SMALL LETTER FEI
    (16#3E7#, 16#3E7#, -1), -- COPTIC SMALL LETTER KHEI .. COPTIC SMALL LETTER KHEI
    (16#3E9#, 16#3E9#, -1), -- COPTIC SMALL LETTER HORI .. COPTIC SMALL LETTER HORI
    (16#3EB#, 16#3EB#, -1), -- COPTIC SMALL LETTER GANGIA .. COPTIC SMALL LETTER GANGIA
    (16#3ED#, 16#3ED#, -1), -- COPTIC SMALL LETTER SHIMA .. COPTIC SMALL LETTER SHIMA
    (16#3EF#, 16#3EF#, -1), -- COPTIC SMALL LETTER DEI .. COPTIC SMALL LETTER DEI
    (16#3F0#, 16#3F0#, -86), -- GREEK KAPPA SYMBOL .. GREEK KAPPA SYMBOL
    (16#3F1#, 16#3F1#, -80), -- GREEK RHO SYMBOL .. GREEK RHO SYMBOL
    (16#3F2#, 16#3F2#, -79), -- GREEK LUNATE SIGMA SYMBOL .. GREEK LUNATE SIGMA SYMBOL
    (16#3F5#, 16#3F5#, -96), -- GREEK LUNATE EPSILON SYMBOL .. GREEK LUNATE EPSILON SYMBOL
    (16#430#, 16#44F#, -32), -- CYRILLIC SMALL LETTER A .. CYRILLIC SMALL LETTER YA
    (16#450#, 16#45F#, -80), -- CYRILLIC SMALL LETTER IE WITH GRAVE .. CYRILLIC SMALL LETTER DZHE
    (16#461#, 16#461#, -1), -- CYRILLIC SMALL LETTER OMEGA .. CYRILLIC SMALL LETTER OMEGA
    (16#463#, 16#463#, -1), -- CYRILLIC SMALL LETTER YAT .. CYRILLIC SMALL LETTER YAT
    (16#465#, 16#465#, -1), -- CYRILLIC SMALL LETTER IOTIFIED E .. CYRILLIC SMALL LETTER IOTIFIED E
    (16#467#, 16#467#, -1), -- CYRILLIC SMALL LETTER LITTLE YUS .. CYRILLIC SMALL LETTER LITTLE YUS
    (16#469#, 16#469#, -1), -- CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS .. CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS
    (16#46B#, 16#46B#, -1), -- CYRILLIC SMALL LETTER BIG YUS .. CYRILLIC SMALL LETTER BIG YUS
    (16#46D#, 16#46D#, -1), -- CYRILLIC SMALL LETTER IOTIFIED BIG YUS .. CYRILLIC SMALL LETTER IOTIFIED BIG YUS
    (16#46F#, 16#46F#, -1), -- CYRILLIC SMALL LETTER KSI .. CYRILLIC SMALL LETTER KSI
    (16#471#, 16#471#, -1), -- CYRILLIC SMALL LETTER PSI .. CYRILLIC SMALL LETTER PSI
    (16#473#, 16#473#, -1), -- CYRILLIC SMALL LETTER FITA .. CYRILLIC SMALL LETTER FITA
    (16#475#, 16#475#, -1), -- CYRILLIC SMALL LETTER IZHITSA .. CYRILLIC SMALL LETTER IZHITSA
    (16#477#, 16#477#, -1), -- CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT .. CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
    (16#479#, 16#479#, -1), -- CYRILLIC SMALL LETTER UK .. CYRILLIC SMALL LETTER UK
    (16#47B#, 16#47B#, -1), -- CYRILLIC SMALL LETTER ROUND OMEGA .. CYRILLIC SMALL LETTER ROUND OMEGA
    (16#47D#, 16#47D#, -1), -- CYRILLIC SMALL LETTER OMEGA WITH TITLO .. CYRILLIC SMALL LETTER OMEGA WITH TITLO
    (16#47F#, 16#47F#, -1), -- CYRILLIC SMALL LETTER OT .. CYRILLIC SMALL LETTER OT
    (16#481#, 16#481#, -1), -- CYRILLIC SMALL LETTER KOPPA .. CYRILLIC SMALL LETTER KOPPA
    (16#48B#, 16#48B#, -1), -- CYRILLIC SMALL LETTER SHORT I WITH TAIL .. CYRILLIC SMALL LETTER SHORT I WITH TAIL
    (16#48D#, 16#48D#, -1), -- CYRILLIC SMALL LETTER SEMISOFT SIGN .. CYRILLIC SMALL LETTER SEMISOFT SIGN
    (16#48F#, 16#48F#, -1), -- CYRILLIC SMALL LETTER ER WITH TICK .. CYRILLIC SMALL LETTER ER WITH TICK
    (16#491#, 16#491#, -1), -- CYRILLIC SMALL LETTER GHE WITH UPTURN .. CYRILLIC SMALL LETTER GHE WITH UPTURN
    (16#493#, 16#493#, -1), -- CYRILLIC SMALL LETTER GHE WITH STROKE .. CYRILLIC SMALL LETTER GHE WITH STROKE
    (16#495#, 16#495#, -1), -- CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK
    (16#497#, 16#497#, -1), -- CYRILLIC SMALL LETTER ZHE WITH DESCENDER .. CYRILLIC SMALL LETTER ZHE WITH DESCENDER
    (16#499#, 16#499#, -1), -- CYRILLIC SMALL LETTER ZE WITH DESCENDER .. CYRILLIC SMALL LETTER ZE WITH DESCENDER
    (16#49B#, 16#49B#, -1), -- CYRILLIC SMALL LETTER KA WITH DESCENDER .. CYRILLIC SMALL LETTER KA WITH DESCENDER
    (16#49D#, 16#49D#, -1), -- CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE
    (16#49F#, 16#49F#, -1), -- CYRILLIC SMALL LETTER KA WITH STROKE .. CYRILLIC SMALL LETTER KA WITH STROKE
    (16#4A1#, 16#4A1#, -1), -- CYRILLIC SMALL LETTER BASHKIR KA .. CYRILLIC SMALL LETTER BASHKIR KA
    (16#4A3#, 16#4A3#, -1), -- CYRILLIC SMALL LETTER EN WITH DESCENDER .. CYRILLIC SMALL LETTER EN WITH DESCENDER
    (16#4A5#, 16#4A5#, -1), -- CYRILLIC SMALL LIGATURE EN GHE .. CYRILLIC SMALL LIGATURE EN GHE
    (16#4A7#, 16#4A7#, -1), -- CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK
    (16#4A9#, 16#4A9#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN HA .. CYRILLIC SMALL LETTER ABKHASIAN HA
    (16#4AB#, 16#4AB#, -1), -- CYRILLIC SMALL LETTER ES WITH DESCENDER .. CYRILLIC SMALL LETTER ES WITH DESCENDER
    (16#4AD#, 16#4AD#, -1), -- CYRILLIC SMALL LETTER TE WITH DESCENDER .. CYRILLIC SMALL LETTER TE WITH DESCENDER
    (16#4AF#, 16#4AF#, -1), -- CYRILLIC SMALL LETTER STRAIGHT U .. CYRILLIC SMALL LETTER STRAIGHT U
    (16#4B1#, 16#4B1#, -1), -- CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE .. CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE
    (16#4B3#, 16#4B3#, -1), -- CYRILLIC SMALL LETTER HA WITH DESCENDER .. CYRILLIC SMALL LETTER HA WITH DESCENDER
    (16#4B5#, 16#4B5#, -1), -- CYRILLIC SMALL LIGATURE TE TSE .. CYRILLIC SMALL LIGATURE TE TSE
    (16#4B7#, 16#4B7#, -1), -- CYRILLIC SMALL LETTER CHE WITH DESCENDER .. CYRILLIC SMALL LETTER CHE WITH DESCENDER
    (16#4B9#, 16#4B9#, -1), -- CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE
    (16#4BB#, 16#4BB#, -1), -- CYRILLIC SMALL LETTER SHHA .. CYRILLIC SMALL LETTER SHHA
    (16#4BD#, 16#4BD#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN CHE .. CYRILLIC SMALL LETTER ABKHASIAN CHE
    (16#4BF#, 16#4BF#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER .. CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER
    (16#4C2#, 16#4C2#, -1), -- CYRILLIC SMALL LETTER ZHE WITH BREVE .. CYRILLIC SMALL LETTER ZHE WITH BREVE
    (16#4C4#, 16#4C4#, -1), -- CYRILLIC SMALL LETTER KA WITH HOOK .. CYRILLIC SMALL LETTER KA WITH HOOK
    (16#4C6#, 16#4C6#, -1), -- CYRILLIC SMALL LETTER EL WITH TAIL .. CYRILLIC SMALL LETTER EL WITH TAIL
    (16#4C8#, 16#4C8#, -1), -- CYRILLIC SMALL LETTER EN WITH HOOK .. CYRILLIC SMALL LETTER EN WITH HOOK
    (16#4CA#, 16#4CA#, -1), -- CYRILLIC SMALL LETTER EN WITH TAIL .. CYRILLIC SMALL LETTER EN WITH TAIL
    (16#4CC#, 16#4CC#, -1), -- CYRILLIC SMALL LETTER KHAKASSIAN CHE .. CYRILLIC SMALL LETTER KHAKASSIAN CHE
    (16#4CE#, 16#4CE#, -1), -- CYRILLIC SMALL LETTER EM WITH TAIL .. CYRILLIC SMALL LETTER EM WITH TAIL
    (16#4D1#, 16#4D1#, -1), -- CYRILLIC SMALL LETTER A WITH BREVE .. CYRILLIC SMALL LETTER A WITH BREVE
    (16#4D3#, 16#4D3#, -1), -- CYRILLIC SMALL LETTER A WITH DIAERESIS .. CYRILLIC SMALL LETTER A WITH DIAERESIS
    (16#4D5#, 16#4D5#, -1), -- CYRILLIC SMALL LIGATURE A IE .. CYRILLIC SMALL LIGATURE A IE
    (16#4D7#, 16#4D7#, -1), -- CYRILLIC SMALL LETTER IE WITH BREVE .. CYRILLIC SMALL LETTER IE WITH BREVE
    (16#4D9#, 16#4D9#, -1), -- CYRILLIC SMALL LETTER SCHWA .. CYRILLIC SMALL LETTER SCHWA
    (16#4DB#, 16#4DB#, -1), -- CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS .. CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS
    (16#4DD#, 16#4DD#, -1), -- CYRILLIC SMALL LETTER ZHE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
    (16#4DF#, 16#4DF#, -1), -- CYRILLIC SMALL LETTER ZE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZE WITH DIAERESIS
    (16#4E1#, 16#4E1#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN DZE .. CYRILLIC SMALL LETTER ABKHASIAN DZE
    (16#4E3#, 16#4E3#, -1), -- CYRILLIC SMALL LETTER I WITH MACRON .. CYRILLIC SMALL LETTER I WITH MACRON
    (16#4E5#, 16#4E5#, -1), -- CYRILLIC SMALL LETTER I WITH DIAERESIS .. CYRILLIC SMALL LETTER I WITH DIAERESIS
    (16#4E7#, 16#4E7#, -1), -- CYRILLIC SMALL LETTER O WITH DIAERESIS .. CYRILLIC SMALL LETTER O WITH DIAERESIS
    (16#4E9#, 16#4E9#, -1), -- CYRILLIC SMALL LETTER BARRED O .. CYRILLIC SMALL LETTER BARRED O
    (16#4EB#, 16#4EB#, -1), -- CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS .. CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS
    (16#4ED#, 16#4ED#, -1), -- CYRILLIC SMALL LETTER E WITH DIAERESIS .. CYRILLIC SMALL LETTER E WITH DIAERESIS
    (16#4EF#, 16#4EF#, -1), -- CYRILLIC SMALL LETTER U WITH MACRON .. CYRILLIC SMALL LETTER U WITH MACRON
    (16#4F1#, 16#4F1#, -1), -- CYRILLIC SMALL LETTER U WITH DIAERESIS .. CYRILLIC SMALL LETTER U WITH DIAERESIS
    (16#4F3#, 16#4F3#, -1), -- CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE .. CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
    (16#4F5#, 16#4F5#, -1), -- CYRILLIC SMALL LETTER CHE WITH DIAERESIS .. CYRILLIC SMALL LETTER CHE WITH DIAERESIS
    (16#4F9#, 16#4F9#, -1), -- CYRILLIC SMALL LETTER YERU WITH DIAERESIS .. CYRILLIC SMALL LETTER YERU WITH DIAERESIS
    (16#501#, 16#501#, -1), -- CYRILLIC SMALL LETTER KOMI DE .. CYRILLIC SMALL LETTER KOMI DE
    (16#503#, 16#503#, -1), -- CYRILLIC SMALL LETTER KOMI DJE .. CYRILLIC SMALL LETTER KOMI DJE
    (16#505#, 16#505#, -1), -- CYRILLIC SMALL LETTER KOMI ZJE .. CYRILLIC SMALL LETTER KOMI ZJE
    (16#507#, 16#507#, -1), -- CYRILLIC SMALL LETTER KOMI DZJE .. CYRILLIC SMALL LETTER KOMI DZJE
    (16#509#, 16#509#, -1), -- CYRILLIC SMALL LETTER KOMI LJE .. CYRILLIC SMALL LETTER KOMI LJE
    (16#50B#, 16#50B#, -1), -- CYRILLIC SMALL LETTER KOMI NJE .. CYRILLIC SMALL LETTER KOMI NJE
    (16#50D#, 16#50D#, -1), -- CYRILLIC SMALL LETTER KOMI SJE .. CYRILLIC SMALL LETTER KOMI SJE
    (16#50F#, 16#50F#, -1), -- CYRILLIC SMALL LETTER KOMI TJE .. CYRILLIC SMALL LETTER KOMI TJE
    (16#561#, 16#586#, -48), -- ARMENIAN SMALL LETTER AYB .. ARMENIAN SMALL LETTER FEH
    (16#1E01#, 16#1E01#, -1), -- LATIN SMALL LETTER A WITH RING BELOW .. LATIN SMALL LETTER A WITH RING BELOW
    (16#1E03#, 16#1E03#, -1), -- LATIN SMALL LETTER B WITH DOT ABOVE .. LATIN SMALL LETTER B WITH DOT ABOVE
    (16#1E05#, 16#1E05#, -1), -- LATIN SMALL LETTER B WITH DOT BELOW .. LATIN SMALL LETTER B WITH DOT BELOW
    (16#1E07#, 16#1E07#, -1), -- LATIN SMALL LETTER B WITH LINE BELOW .. LATIN SMALL LETTER B WITH LINE BELOW
    (16#1E09#, 16#1E09#, -1), -- LATIN SMALL LETTER C WITH CEDILLA AND ACUTE .. LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
    (16#1E0B#, 16#1E0B#, -1), -- LATIN SMALL LETTER D WITH DOT ABOVE .. LATIN SMALL LETTER D WITH DOT ABOVE
    (16#1E0D#, 16#1E0D#, -1), -- LATIN SMALL LETTER D WITH DOT BELOW .. LATIN SMALL LETTER D WITH DOT BELOW
    (16#1E0F#, 16#1E0F#, -1), -- LATIN SMALL LETTER D WITH LINE BELOW .. LATIN SMALL LETTER D WITH LINE BELOW
    (16#1E11#, 16#1E11#, -1), -- LATIN SMALL LETTER D WITH CEDILLA .. LATIN SMALL LETTER D WITH CEDILLA
    (16#1E13#, 16#1E13#, -1), -- LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW
    (16#1E15#, 16#1E15#, -1), -- LATIN SMALL LETTER E WITH MACRON AND GRAVE .. LATIN SMALL LETTER E WITH MACRON AND GRAVE
    (16#1E17#, 16#1E17#, -1), -- LATIN SMALL LETTER E WITH MACRON AND ACUTE .. LATIN SMALL LETTER E WITH MACRON AND ACUTE
    (16#1E19#, 16#1E19#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW
    (16#1E1B#, 16#1E1B#, -1), -- LATIN SMALL LETTER E WITH TILDE BELOW .. LATIN SMALL LETTER E WITH TILDE BELOW
    (16#1E1D#, 16#1E1D#, -1), -- LATIN SMALL LETTER E WITH CEDILLA AND BREVE .. LATIN SMALL LETTER E WITH CEDILLA AND BREVE
    (16#1E1F#, 16#1E1F#, -1), -- LATIN SMALL LETTER F WITH DOT ABOVE .. LATIN SMALL LETTER F WITH DOT ABOVE
    (16#1E21#, 16#1E21#, -1), -- LATIN SMALL LETTER G WITH MACRON .. LATIN SMALL LETTER G WITH MACRON
    (16#1E23#, 16#1E23#, -1), -- LATIN SMALL LETTER H WITH DOT ABOVE .. LATIN SMALL LETTER H WITH DOT ABOVE
    (16#1E25#, 16#1E25#, -1), -- LATIN SMALL LETTER H WITH DOT BELOW .. LATIN SMALL LETTER H WITH DOT BELOW
    (16#1E27#, 16#1E27#, -1), -- LATIN SMALL LETTER H WITH DIAERESIS .. LATIN SMALL LETTER H WITH DIAERESIS
    (16#1E29#, 16#1E29#, -1), -- LATIN SMALL LETTER H WITH CEDILLA .. LATIN SMALL LETTER H WITH CEDILLA
    (16#1E2B#, 16#1E2B#, -1), -- LATIN SMALL LETTER H WITH BREVE BELOW .. LATIN SMALL LETTER H WITH BREVE BELOW
    (16#1E2D#, 16#1E2D#, -1), -- LATIN SMALL LETTER I WITH TILDE BELOW .. LATIN SMALL LETTER I WITH TILDE BELOW
    (16#1E2F#, 16#1E2F#, -1), -- LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE
    (16#1E31#, 16#1E31#, -1), -- LATIN SMALL LETTER K WITH ACUTE .. LATIN SMALL LETTER K WITH ACUTE
    (16#1E33#, 16#1E33#, -1), -- LATIN SMALL LETTER K WITH DOT BELOW .. LATIN SMALL LETTER K WITH DOT BELOW
    (16#1E35#, 16#1E35#, -1), -- LATIN SMALL LETTER K WITH LINE BELOW .. LATIN SMALL LETTER K WITH LINE BELOW
    (16#1E37#, 16#1E37#, -1), -- LATIN SMALL LETTER L WITH DOT BELOW .. LATIN SMALL LETTER L WITH DOT BELOW
    (16#1E39#, 16#1E39#, -1), -- LATIN SMALL LETTER L WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER L WITH DOT BELOW AND MACRON
    (16#1E3B#, 16#1E3B#, -1), -- LATIN SMALL LETTER L WITH LINE BELOW .. LATIN SMALL LETTER L WITH LINE BELOW
    (16#1E3D#, 16#1E3D#, -1), -- LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW
    (16#1E3F#, 16#1E3F#, -1), -- LATIN SMALL LETTER M WITH ACUTE .. LATIN SMALL LETTER M WITH ACUTE
    (16#1E41#, 16#1E41#, -1), -- LATIN SMALL LETTER M WITH DOT ABOVE .. LATIN SMALL LETTER M WITH DOT ABOVE
    (16#1E43#, 16#1E43#, -1), -- LATIN SMALL LETTER M WITH DOT BELOW .. LATIN SMALL LETTER M WITH DOT BELOW
    (16#1E45#, 16#1E45#, -1), -- LATIN SMALL LETTER N WITH DOT ABOVE .. LATIN SMALL LETTER N WITH DOT ABOVE
    (16#1E47#, 16#1E47#, -1), -- LATIN SMALL LETTER N WITH DOT BELOW .. LATIN SMALL LETTER N WITH DOT BELOW
    (16#1E49#, 16#1E49#, -1), -- LATIN SMALL LETTER N WITH LINE BELOW .. LATIN SMALL LETTER N WITH LINE BELOW
    (16#1E4B#, 16#1E4B#, -1), -- LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW
    (16#1E4D#, 16#1E4D#, -1), -- LATIN SMALL LETTER O WITH TILDE AND ACUTE .. LATIN SMALL LETTER O WITH TILDE AND ACUTE
    (16#1E4F#, 16#1E4F#, -1), -- LATIN SMALL LETTER O WITH TILDE AND DIAERESIS .. LATIN SMALL LETTER O WITH TILDE AND DIAERESIS
    (16#1E51#, 16#1E51#, -1), -- LATIN SMALL LETTER O WITH MACRON AND GRAVE .. LATIN SMALL LETTER O WITH MACRON AND GRAVE
    (16#1E53#, 16#1E53#, -1), -- LATIN SMALL LETTER O WITH MACRON AND ACUTE .. LATIN SMALL LETTER O WITH MACRON AND ACUTE
    (16#1E55#, 16#1E55#, -1), -- LATIN SMALL LETTER P WITH ACUTE .. LATIN SMALL LETTER P WITH ACUTE
    (16#1E57#, 16#1E57#, -1), -- LATIN SMALL LETTER P WITH DOT ABOVE .. LATIN SMALL LETTER P WITH DOT ABOVE
    (16#1E59#, 16#1E59#, -1), -- LATIN SMALL LETTER R WITH DOT ABOVE .. LATIN SMALL LETTER R WITH DOT ABOVE
    (16#1E5B#, 16#1E5B#, -1), -- LATIN SMALL LETTER R WITH DOT BELOW .. LATIN SMALL LETTER R WITH DOT BELOW
    (16#1E5D#, 16#1E5D#, -1), -- LATIN SMALL LETTER R WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
    (16#1E5F#, 16#1E5F#, -1), -- LATIN SMALL LETTER R WITH LINE BELOW .. LATIN SMALL LETTER R WITH LINE BELOW
    (16#1E61#, 16#1E61#, -1), -- LATIN SMALL LETTER S WITH DOT ABOVE .. LATIN SMALL LETTER S WITH DOT ABOVE
    (16#1E63#, 16#1E63#, -1), -- LATIN SMALL LETTER S WITH DOT BELOW .. LATIN SMALL LETTER S WITH DOT BELOW
    (16#1E65#, 16#1E65#, -1), -- LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE .. LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE
    (16#1E67#, 16#1E67#, -1), -- LATIN SMALL LETTER S WITH CARON AND DOT ABOVE .. LATIN SMALL LETTER S WITH CARON AND DOT ABOVE
    (16#1E69#, 16#1E69#, -1), -- LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE .. LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE
    (16#1E6B#, 16#1E6B#, -1), -- LATIN SMALL LETTER T WITH DOT ABOVE .. LATIN SMALL LETTER T WITH DOT ABOVE
    (16#1E6D#, 16#1E6D#, -1), -- LATIN SMALL LETTER T WITH DOT BELOW .. LATIN SMALL LETTER T WITH DOT BELOW
    (16#1E6F#, 16#1E6F#, -1), -- LATIN SMALL LETTER T WITH LINE BELOW .. LATIN SMALL LETTER T WITH LINE BELOW
    (16#1E71#, 16#1E71#, -1), -- LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW
    (16#1E73#, 16#1E73#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS BELOW .. LATIN SMALL LETTER U WITH DIAERESIS BELOW
    (16#1E75#, 16#1E75#, -1), -- LATIN SMALL LETTER U WITH TILDE BELOW .. LATIN SMALL LETTER U WITH TILDE BELOW
    (16#1E77#, 16#1E77#, -1), -- LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW
    (16#1E79#, 16#1E79#, -1), -- LATIN SMALL LETTER U WITH TILDE AND ACUTE .. LATIN SMALL LETTER U WITH TILDE AND ACUTE
    (16#1E7B#, 16#1E7B#, -1), -- LATIN SMALL LETTER U WITH MACRON AND DIAERESIS .. LATIN SMALL LETTER U WITH MACRON AND DIAERESIS
    (16#1E7D#, 16#1E7D#, -1), -- LATIN SMALL LETTER V WITH TILDE .. LATIN SMALL LETTER V WITH TILDE
    (16#1E7F#, 16#1E7F#, -1), -- LATIN SMALL LETTER V WITH DOT BELOW .. LATIN SMALL LETTER V WITH DOT BELOW
    (16#1E81#, 16#1E81#, -1), -- LATIN SMALL LETTER W WITH GRAVE .. LATIN SMALL LETTER W WITH GRAVE
    (16#1E83#, 16#1E83#, -1), -- LATIN SMALL LETTER W WITH ACUTE .. LATIN SMALL LETTER W WITH ACUTE
    (16#1E85#, 16#1E85#, -1), -- LATIN SMALL LETTER W WITH DIAERESIS .. LATIN SMALL LETTER W WITH DIAERESIS
    (16#1E87#, 16#1E87#, -1), -- LATIN SMALL LETTER W WITH DOT ABOVE .. LATIN SMALL LETTER W WITH DOT ABOVE
    (16#1E89#, 16#1E89#, -1), -- LATIN SMALL LETTER W WITH DOT BELOW .. LATIN SMALL LETTER W WITH DOT BELOW
    (16#1E8B#, 16#1E8B#, -1), -- LATIN SMALL LETTER X WITH DOT ABOVE .. LATIN SMALL LETTER X WITH DOT ABOVE
    (16#1E8D#, 16#1E8D#, -1), -- LATIN SMALL LETTER X WITH DIAERESIS .. LATIN SMALL LETTER X WITH DIAERESIS
    (16#1E8F#, 16#1E8F#, -1), -- LATIN SMALL LETTER Y WITH DOT ABOVE .. LATIN SMALL LETTER Y WITH DOT ABOVE
    (16#1E91#, 16#1E91#, -1), -- LATIN SMALL LETTER Z WITH CIRCUMFLEX .. LATIN SMALL LETTER Z WITH CIRCUMFLEX
    (16#1E93#, 16#1E93#, -1), -- LATIN SMALL LETTER Z WITH DOT BELOW .. LATIN SMALL LETTER Z WITH DOT BELOW
    (16#1E95#, 16#1E95#, -1), -- LATIN SMALL LETTER Z WITH LINE BELOW .. LATIN SMALL LETTER Z WITH LINE BELOW
    (16#1E9B#, 16#1E9B#, -59), -- LATIN SMALL LETTER LONG S WITH DOT ABOVE .. LATIN SMALL LETTER LONG S WITH DOT ABOVE
    (16#1EA1#, 16#1EA1#, -1), -- LATIN SMALL LETTER A WITH DOT BELOW .. LATIN SMALL LETTER A WITH DOT BELOW
    (16#1EA3#, 16#1EA3#, -1), -- LATIN SMALL LETTER A WITH HOOK ABOVE .. LATIN SMALL LETTER A WITH HOOK ABOVE
    (16#1EA5#, 16#1EA5#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
    (16#1EA7#, 16#1EA7#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
    (16#1EA9#, 16#1EA9#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
    (16#1EAB#, 16#1EAB#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
    (16#1EAD#, 16#1EAD#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
    (16#1EAF#, 16#1EAF#, -1), -- LATIN SMALL LETTER A WITH BREVE AND ACUTE .. LATIN SMALL LETTER A WITH BREVE AND ACUTE
    (16#1EB1#, 16#1EB1#, -1), -- LATIN SMALL LETTER A WITH BREVE AND GRAVE .. LATIN SMALL LETTER A WITH BREVE AND GRAVE
    (16#1EB3#, 16#1EB3#, -1), -- LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE .. LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
    (16#1EB5#, 16#1EB5#, -1), -- LATIN SMALL LETTER A WITH BREVE AND TILDE .. LATIN SMALL LETTER A WITH BREVE AND TILDE
    (16#1EB7#, 16#1EB7#, -1), -- LATIN SMALL LETTER A WITH BREVE AND DOT BELOW .. LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
    (16#1EB9#, 16#1EB9#, -1), -- LATIN SMALL LETTER E WITH DOT BELOW .. LATIN SMALL LETTER E WITH DOT BELOW
    (16#1EBB#, 16#1EBB#, -1), -- LATIN SMALL LETTER E WITH HOOK ABOVE .. LATIN SMALL LETTER E WITH HOOK ABOVE
    (16#1EBD#, 16#1EBD#, -1), -- LATIN SMALL LETTER E WITH TILDE .. LATIN SMALL LETTER E WITH TILDE
    (16#1EBF#, 16#1EBF#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE
    (16#1EC1#, 16#1EC1#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE
    (16#1EC3#, 16#1EC3#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE
    (16#1EC5#, 16#1EC5#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE
    (16#1EC7#, 16#1EC7#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW
    (16#1EC9#, 16#1EC9#, -1), -- LATIN SMALL LETTER I WITH HOOK ABOVE .. LATIN SMALL LETTER I WITH HOOK ABOVE
    (16#1ECB#, 16#1ECB#, -1), -- LATIN SMALL LETTER I WITH DOT BELOW .. LATIN SMALL LETTER I WITH DOT BELOW
    (16#1ECD#, 16#1ECD#, -1), -- LATIN SMALL LETTER O WITH DOT BELOW .. LATIN SMALL LETTER O WITH DOT BELOW
    (16#1ECF#, 16#1ECF#, -1), -- LATIN SMALL LETTER O WITH HOOK ABOVE .. LATIN SMALL LETTER O WITH HOOK ABOVE
    (16#1ED1#, 16#1ED1#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE
    (16#1ED3#, 16#1ED3#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE
    (16#1ED5#, 16#1ED5#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE
    (16#1ED7#, 16#1ED7#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE
    (16#1ED9#, 16#1ED9#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW
    (16#1EDB#, 16#1EDB#, -1), -- LATIN SMALL LETTER O WITH HORN AND ACUTE .. LATIN SMALL LETTER O WITH HORN AND ACUTE
    (16#1EDD#, 16#1EDD#, -1), -- LATIN SMALL LETTER O WITH HORN AND GRAVE .. LATIN SMALL LETTER O WITH HORN AND GRAVE
    (16#1EDF#, 16#1EDF#, -1), -- LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE
    (16#1EE1#, 16#1EE1#, -1), -- LATIN SMALL LETTER O WITH HORN AND TILDE .. LATIN SMALL LETTER O WITH HORN AND TILDE
    (16#1EE3#, 16#1EE3#, -1), -- LATIN SMALL LETTER O WITH HORN AND DOT BELOW .. LATIN SMALL LETTER O WITH HORN AND DOT BELOW
    (16#1EE5#, 16#1EE5#, -1), -- LATIN SMALL LETTER U WITH DOT BELOW .. LATIN SMALL LETTER U WITH DOT BELOW
    (16#1EE7#, 16#1EE7#, -1), -- LATIN SMALL LETTER U WITH HOOK ABOVE .. LATIN SMALL LETTER U WITH HOOK ABOVE
    (16#1EE9#, 16#1EE9#, -1), -- LATIN SMALL LETTER U WITH HORN AND ACUTE .. LATIN SMALL LETTER U WITH HORN AND ACUTE
    (16#1EEB#, 16#1EEB#, -1), -- LATIN SMALL LETTER U WITH HORN AND GRAVE .. LATIN SMALL LETTER U WITH HORN AND GRAVE
    (16#1EED#, 16#1EED#, -1), -- LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE
    (16#1EEF#, 16#1EEF#, -1), -- LATIN SMALL LETTER U WITH HORN AND TILDE .. LATIN SMALL LETTER U WITH HORN AND TILDE
    (16#1EF1#, 16#1EF1#, -1), -- LATIN SMALL LETTER U WITH HORN AND DOT BELOW .. LATIN SMALL LETTER U WITH HORN AND DOT BELOW
    (16#1EF3#, 16#1EF3#, -1), -- LATIN SMALL LETTER Y WITH GRAVE .. LATIN SMALL LETTER Y WITH GRAVE
    (16#1EF5#, 16#1EF5#, -1), -- LATIN SMALL LETTER Y WITH DOT BELOW .. LATIN SMALL LETTER Y WITH DOT BELOW
    (16#1EF7#, 16#1EF7#, -1), -- LATIN SMALL LETTER Y WITH HOOK ABOVE .. LATIN SMALL LETTER Y WITH HOOK ABOVE
    (16#1EF9#, 16#1EF9#, -1), -- LATIN SMALL LETTER Y WITH TILDE .. LATIN SMALL LETTER Y WITH TILDE
    (16#1F00#, 16#1F07#, 8), -- GREEK SMALL LETTER ALPHA WITH PSILI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI
    (16#1F10#, 16#1F15#, 8), -- GREEK SMALL LETTER EPSILON WITH PSILI .. GREEK SMALL LETTER EPSILON WITH DASIA AND OXIA
    (16#1F20#, 16#1F27#, 8), -- GREEK SMALL LETTER ETA WITH PSILI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI
    (16#1F30#, 16#1F37#, 8), -- GREEK SMALL LETTER IOTA WITH PSILI .. GREEK SMALL LETTER IOTA WITH DASIA AND PERISPOMENI
    (16#1F40#, 16#1F45#, 8), -- GREEK SMALL LETTER OMICRON WITH PSILI .. GREEK SMALL LETTER OMICRON WITH DASIA AND OXIA
    (16#1F51#, 16#1F51#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA .. GREEK SMALL LETTER UPSILON WITH DASIA
    (16#1F53#, 16#1F53#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA
    (16#1F55#, 16#1F55#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA
    (16#1F57#, 16#1F57#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI .. GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI
    (16#1F60#, 16#1F67#, 8), -- GREEK SMALL LETTER OMEGA WITH PSILI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI
    (16#1F70#, 16#1F71#, 74), -- GREEK SMALL LETTER ALPHA WITH VARIA .. GREEK SMALL LETTER ALPHA WITH OXIA
    (16#1F72#, 16#1F75#, 86), -- GREEK SMALL LETTER EPSILON WITH VARIA .. GREEK SMALL LETTER ETA WITH OXIA
    (16#1F76#, 16#1F77#, 100), -- GREEK SMALL LETTER IOTA WITH VARIA .. GREEK SMALL LETTER IOTA WITH OXIA
    (16#1F78#, 16#1F79#, 128), -- GREEK SMALL LETTER OMICRON WITH VARIA .. GREEK SMALL LETTER OMICRON WITH OXIA
    (16#1F7A#, 16#1F7B#, 112), -- GREEK SMALL LETTER UPSILON WITH VARIA .. GREEK SMALL LETTER UPSILON WITH OXIA
    (16#1F7C#, 16#1F7D#, 126), -- GREEK SMALL LETTER OMEGA WITH VARIA .. GREEK SMALL LETTER OMEGA WITH OXIA
    (16#1F80#, 16#1F87#, 8), -- GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
    (16#1F90#, 16#1F97#, 8), -- GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
    (16#1FA0#, 16#1FA7#, 8), -- GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
    (16#1FB0#, 16#1FB1#, 8), -- GREEK SMALL LETTER ALPHA WITH VRACHY .. GREEK SMALL LETTER ALPHA WITH MACRON
    (16#1FB3#, 16#1FB3#, 9), -- GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
    (16#1FBE#, 16#1FBE#, -7205), -- GREEK PROSGEGRAMMENI .. GREEK PROSGEGRAMMENI
    (16#1FC3#, 16#1FC3#, 9), -- GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
    (16#1FD0#, 16#1FD1#, 8), -- GREEK SMALL LETTER IOTA WITH VRACHY .. GREEK SMALL LETTER IOTA WITH MACRON
    (16#1FE0#, 16#1FE1#, 8), -- GREEK SMALL LETTER UPSILON WITH VRACHY .. GREEK SMALL LETTER UPSILON WITH MACRON
    (16#1FE5#, 16#1FE5#, 7), -- GREEK SMALL LETTER RHO WITH DASIA .. GREEK SMALL LETTER RHO WITH DASIA
    (16#1FF3#, 16#1FF3#, 9), -- GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
    (16#FF41#, 16#FF5A#, -32), -- FULLWIDTH LATIN SMALL LETTER A .. FULLWIDTH LATIN SMALL LETTER Z
    (16#10428#, 16#1044D#, -40) -- DESERET SMALL LETTER LONG I .. DESERET SMALL LETTER ENG
   );

*************************************************************

From: Randy Brukardt
Sent: Wednesday, November 27, 2002  11:01 AM

Thanks for doing this. Where are you finding the information that you are
using to do this? A quick search of the net didn't turn up anything
machine-readable...

*************************************************************

From: Michael F. Yoder
Sent: Wednesday, November 27, 2002  12:20 PM

The root link is www.unicode.org and the "latest version" link goes to

http://www.unicode.org/unicode/reports/tr28/

The "this version" link at the top goes to a page with some relevant
stuff. The page with the machine-readable files for V3.2 is:
http://www.unicode.org/Public/UNIDATA/ . The current organization seems
to be harder to navigate than it used to be; I'm unsure why.

N.B. version 3.2 of Unicode claims to be "fully synchronized" with ISO
10646, so it is strongly preferable to earlier versions.

*************************************************************

From: Robert I. Eachus
Sent: Tuesday, March 18, 2003  12:56 AM

I hate to reopen the character set can of worms, but I think we need to
do it.  In effect Latin 1 is being replaced by Latin 9 (ISO 8859-15).
Latin 9 adds the Euro sign, OE Ligatures, and S and Z with caron, and
capital Y with diaresis to Latin 1, removing the currency symbol, broken
bar, some accents and the vulgar fractions.  See
http://www.cs.tut.fi/~jkorpela/latin9.html for a fuller explanation.

Latin 9 is slowly being adopted.  Of course some countries in the Euro
zone are already using a "localized" version of Latin 1 with the
currency sign representation looking suspiciously like a Euro symbol.
So we could decide to leave this issue to Ada-1Z or whatever.

However, I think that at the least we should add a Latin9 package to Ada
with the correct character names.  What else should or could be done?

One possibility would be to redefine Ada.Characters.Handling to
correctly treat seven new codes as lower or upper case characters.  I
would much prefer to go for the whole nine yards so we never need to do
this again.  Add an enumeration type Sets to Ada Characters, or if you
prefer Character_Sets.  It should enumerate all the ISO 8859 character
sets.  (If you want to be clever, we could start with ISO646 so that
Sets'Pos(N) = ISO 8859-N.) In any case we should allow implementations
to extend the type.  This would allow both for new ISO 8859 character
sets, and for Unicode, EBCDIC, IBM code pages, and so on.

Now add procedure Set_Default_Character_Set, and function
Current_Character_Set to Ada.Characters.Handling.  (Or if you prefer to
Ada.Characters.)  As far as I am concerned the only required behavior
for Set_Character_Set should be to accept an argument of Latin_1.

It it probably a day or two of work to modify the functions in
Ada.Character.Handling to support all the current ISO 8859 mappings.  It
is at least ten times harder to actually test all possible combinations
of character set and Ada.Characters.Handling functions.  It could be
another five to ten times that to add tests to the validation suite,
with very little practical effect.  This is why I favor a minimalist
approach to the requirements.  (National bodies can of course require
supporting other values for Character_Sets.  For example, the Japanese
national body could require Shift-JIS support if they felt like it,
without requiring that compilers that comply to the Japanes national
standard be incompatible with ISO 8652, without the ARG spending all of
its time on character set issues.)

What about names in Ada programs?  ARGH!  If your compiler is written in
Ada and uses Ada.Characters.Handling, modifying the compiler is not a
problem.  Defining what it means to compile a program written using a
non-Latin-1 character set threatens to expand clause 2 (Lexical
Elements) to the size of a small telephone directory.  I would prefer to
just modify 2.1 to direct people to ISO 10646-1, which is the size of a
large telephone directory, plus currently five ammendments, for the
meaning of lexical elements in non-Latin_1 source representations, and
let national bodies decide what they want to define locally.

*************************************************************

From: Pascal Leroy
Sent: Tuesday, March 18, 2003  2:15 AM

> I hate to reopen the character set can of worms, but I think we need to
> do it.  In effect Latin 1 is being replaced by Latin 9 (ISO 8859-15).
> Latin 9 adds the Euro sign, OE Ligatures, and S and Z with caron, and
> capital Y with diaresis to Latin 1, removing the currency symbol, broken
> bar, some accents and the vulgar fractions.  See
> http://www.cs.tut.fi/~jkorpela/latin9.html for a fuller explanation.

This issue was discussed at some length as part of AI 285/01 (of which I am
the editor).  It is clear that adding support for Latin-9 in Ada.Characters
(and children) is relatively straightforward.  However there is the much
nastier question of type Standard.Character, (which has pretty much to
remain Latin-1 if you don't want to introduce awful incompatibilities) and
of the interactions between what happens at compile-time and what happens at
run-time.  Consider for instance the call:

    Ada.Characters.Latin_9.Handling.Is_Letter ('έ')

It has pretty much to return True (that's an S-caron in Latin-9), but that's
certainly surprising!  This amounts to breaking the Character abstraction
and interpreting characters as bytes/code points, which is likely to lead to
confusion in an Ada program that would deal with character sets having
different encodings.

Another interesting example is mentioned in the minutes of the Bedford
meeting (http://www.ada-auth.org/ai-files/minutes/min-0210.html#AI285):
"Consider the enumeration identifier "˜" (latin small letter y diaeresis).
E'Image(˜) = "˜" in Latin-1 (there is no upper case version), but "Y" in
Latin-9 (there is an upper case version). So we would need the identifier
semantics to be changed depending on the character set. Pascal claims that
this is important to reading French."

After giving it more thought, I have come to the conclusion that the entire
Latin-9 approach is misguided because:

1 - There is relatively little support in software out there for this
encoding (heck, I am even reading that some mail gateways bounce back
messages that use Latin-9 as their character encoding).  Most of the editors
that I have played with just go to Unicode when you type the Euro sign.
That provides support for this new character without causing endless
compatibility nightmares.

2 - I have gone through a similar "code point shuffle" mess at the
beginning of the 80s: at the time we only had 7 bits per character (as you
probably remember, the 8th bit was often used for parity) and some genius
had invented to encode the French accented characters using the code points
normally assigned to [, ], \, and the like.  I have written thousands of
lines of Pascal where an array indexing looked like Arr‡IŠ (instead of
Arr[I]) just because of this silliness. What was painful-but-tolerable 20
years ago is just not going to fly nowadays: I am ready to bet that the
world will go Unicode before it goes Latin-9.

Therefore, the latest version of AI 285 proposes to go to Unicode for the
text representation of programs, relying on the categorization work done by
the Unicode people so that we don't have to argue endlessly about which
characters can appear in identifiers, etc.  And it entirely ignores Latin-9,
or any other Latin-N for that matter.

*************************************************************

From: Robert I. Eachus
Sent: Tuesday, March 18, 2003  2:15 AM

Pascal Leroy wrote:

> This issue was discussed at some length as part of AI 285/01 (of which I am
> the editor).  It is clear that adding support for Latin-9 in Ada.Characters
> (and children) is relatively straightforward.  However there is the much
> nastier question of type Standard.Character, (which has pretty much to
> remain Latin-1 if you don't want to introduce awful incompatibilities) and
> of the interactions between what happens at compile-time and what happens at
> run-time.

I thought we had an AI on the subject, but searching for Latin in the
title didn't find it.  I see what happened is that the name of the AI
was changed.  (I don't want to make work for Randy, and this may be a
rare occurance or may not be.  Perhaps a set of links to "old" names
somewhere.)

So I guess that the title of the original post is correct, because as I
see it, the issue of Latin 9 support is completely separate from the
issues with 16 and 32 bit character sets.

Now to pull some magic by quoting from rev 1.4 of AI 285:

An implementation is allowed to provide a library package named
Ada.Characters.Latin_9.  This package shall be identical to
Ada.Characters.Latin_1, except for the following differences:

- It doesn't declare the constants Currency_Sign, Broken_Bar, Diaeresis,
Acute, Cedilla, Fraction_One_Quarter, Fraction_One_Half, and
Fraction_Three_Quarter.

- It declares the following constants:

     Euro_Sign : constant Character := '€'; -- Character'Val (164)
     UC_S_Caron : constant Character := 'S'; -- Character'Val (166)
     LC_S_Caron : constant Character := 's'; -- Character'Val (168)
     UC_Z_Caron : constant Character := 'Ž'; -- Character'Val (180)
     LC_Z_Caron : constant Character := 'ž'; -- Character'Val (184)
     UC_OE_Diphthong : constant Character := 'O'; -- Character'Val (188)
     LC_OE_Diphthong : constant Character := 'o'; -- Character'Val (189)
     UC_Y_Diaeresis : constant Character := 'Y'; -- Character'Val (190)

In Netscape 7.01, with the encoding set to Latin-1, this displays
(correctly) the Latin 9 representations!  As does OpenOffice.org,
Notepad and so on.  Now let me abstract from the Ada.Characters.Latin 1

     Currency_Sign : constant Character := '';  --Character'Val(164)
     Broken_Bar    : constant Character := 'έ';  --Character'Val(166)
     Diaeresis     : constant Character := '"';  --Character'Val(168)
     Acute         : constant Character := ''';  --Character'Val(180)
     Cedilla       : constant Character := ',';  --Character'Val(184)
     Fraction_One_Quarter : constant Character := '¬'; --Character'Val(188)
     Fraction_One_Half : constant Character := '«';  --Character'Val(189)
     Fraction_Three_Quarters : constant Character := '_';
--Character'Val(190)

How can this work?  Easy, other standards, in particular ISO/IEC 2022,
http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=22747
specify control characters and escape sequences which can be used with
Latin 1--or any ISO 8859 character set--to access characters from other
sets.

> Consider for instance the call:
 >
 >     Ada.Characters.Latin_9.Handling.Is_Letter ('έ')
 >
> It has pretty much to return True (that's an S-caron in Latin-9), but that's
> certainly surprising!  This amounts to breaking the Character abstraction
> and interpreting characters as bytes/code points, which is likely to lead to
> confusion in an Ada program that would deal with character sets having
> different encodings.

Why would you expect this call to work?  You could argue that a compiler
"should" raise Program_Error or Constraint_Error, but I would expect any
reasonable compiler to object at compile time to an invalid character
literal.  Remember, notationally Ada is written in Unicode/ISO 10646
BMP, however it is represented.  In context, that call is illegal and

     Ada.Characters.Latin_9.Handling.Is_Letter ('S')

is legal and should return true. (Assuming we recommend having or
allowing a package Ada.Characters.Latin_9.Handling.)  But this
discussion has done a lot to convince me that the best solution is to
add a function Ada.Characters.Current_Set to Ada.Characters.  The
required work is trival for compilers that want to stay in the Latin 1
only world, and for compilers that do want to implement support for
other 8-bit character sets, they really have to do most of the same work
anyway.

To repeat my proposal:

Add an enumeration type Sets to Ada Characters, or if you prefer
Character_Sets.  It should enumerate all the ISO 8859 character sets.
(If you want to be clever, we could start with ISO646 so that
Sets'Pos(N) = ISO 8859-N.) In any case we should allow implementations
to extend the type.  This would allow both for new ISO 8859 character
sets, and for Unicode, EBCDIC, IBM code pages, and so on.

Now add procedure Set_Default_Character_Set, and function
Current_Character_Set to Ada.Characters.Handling.  (Or if you prefer to
Ada.Characters.)  As far as I am concerned the only required behavior
for Set_Character_Set should be to accept an argument of Latin_1.

We should probably also add a library pragma to change the default
mapping of Character.  (Compilers will probably accept command line
setting of character mappings, but I think that a stanard pragma would
help standardization.)

If my proposal is accepted:

      Ada.Characters.Handling.Is_Letter ('έ')

should return false when Ada.Characters.Current_Set is Latin_1, and

     Ada.Characters.Handling.Is_Letter ('S')

should return true when Ada.Characters.Current_Set is Latin_9. The
behavior of

     Ada.Characters.Handling.Is_Letter (Character'Val(166)

Should depend on the current value of Ada.Characters.Current_Set. What
happens in the other cases will at best be implementation defined.  In
other words, if your program contains a (Unicode/BMP or UTF8) character
literal that is not in a supported character set, I expect
Program_Error, if a character in a literal is not a legal literal for
Character, it is an error, just like other misspellings of literals.

> Another interesting example is mentioned in the minutes of the Bedford
> meeting (http://www.ada-auth.org/ai-files/minutes/min-0210.html#AI285):
> "Consider the enumeration identifier "˜" (latin small letter y diaeresis).
> E'Image(˜) = "˜" in Latin-1 (there is no upper case version), but "Y" in
> Latin-9 (there is an upper case version). So we would need the identifier
> semantics to be changed depending on the character set. Pascal claims that
> this is important to reading French."

Exactly why I think a way is needed for the programmer to be able to
determine what the actual character set mapping is.  Almost no burden
for compilers that support Latin 1 only, and not that much additional
for compilers that do support other 8-bit mappings.  (Actually, I may be
wrong but I think all currently validated compilers accept source in
non-Latin 1 character sets.)

> After giving it more thought, I have come to the conclusion that the entire
> Latin-9 approach is misguided because:
>
> 1 - There is relatively little support in software out there for this
> encoding (heck, I am even reading that some mail gateways bounce back
> messages that use Latin-9 as their character encoding).  Most of the editors
> that I have played with just go to Unicode when you type the Euro sign.
> That provides support for this new character without causing endless
> compatibility nightmares.
>
> 2 - I have gone through a similar "code point shuffle" mess at the
> beginning of the 80s: at the time we only had 7 bits per character (as you
> probably remember, the 8th bit was often used for parity) and some genius
> had invented to encode the French accented characters using the code points
> normally assigned to [, ], \, and the like.  I have written thousands of
> lines of Pascal where an array indexing looked like Arr‡IŠ (instead of
> Arr[I]) just because of this silliness. What was painful-but-tolerable 20
> years ago is just not going to fly nowadays: I am ready to bet that the
> world will go Unicode before it goes Latin-9.
>
> Therefore, the latest version of AI 285 proposes to go to Unicode for the
> text representation of programs, relying on the categorization work done by
> the Unicode people so that we don't have to argue endlessly about which
> characters can appear in identifiers, etc.  And it entirely ignores Latin-9,
> or any other Latin-N for that matter.

Couldn't agree more.  The right solution is not to switch from Latin 1
to any other character set as a standard, but to supply a standard
method for localization, and keep with the assumption of current
Unicode/BMP for Wide_Character and for (notational) source.  Does any
implementor see a problem implementing the above recommendation?

We could also go to the extreme of adding another optional annex dealing
with character representation issues, but I think we all agree that the
ARG should stay away from piecemeal character set bindings.  On the
other hand, I can see having a standard Wide_Character categorization,
and allowing other characterizations to fall out from that.  But let's
keep that discussion in AI-285.

*************************************************************

From: Randy Brukardt
Sent: Tuesday, March 18, 2003  6:04 PM

> (Actually, I may be
> wrong but I think all currently validated compilers accept source in
> non-Latin 1 character sets.)

Since the only currently validated compilers are from Rational and DDC-I,
that isn't saying much at all. You have to at least talk about widely-used
compilers, but then you get into definitional problems.

*************************************************************

From: Robert I. Eachus
Sent: Tuesday, March 18, 2003  6:51 PM

We are in the standards business.  I think that this is an area where a
small extention to the standard will be very helpful in providing
portability.  But we can't really worry about the cost of conformity for
nonstandardized compilers. ;-)

That is why I think that a definition which names the various character
sets should be standardized:

type Character_Sets is (ISO_646, Latin_1, Latin_2,...Latin_Greek...);

This would help standarized the way that non-Latin 1 character sets are
named for compatibility.  But I think we should stay out of the business
of defining which characters are which for Latin_Greek, etc.  That is
ISO/IEC JTC1/SC2's job, and I think they do it pretty well.

Now if my proposal is accepted and say, GNAT, chooses to support the
function Ada.Characters.Current_Set in a useful manner.  However, ACT
sees no demand for Ada.Characters.Set_Default_Character_Set to do
anything useful, and therefore raises an exception if you try to change
the value.  (In other words Ada.Characters.Set_Default_Character_Set
(Ada.Characters.Current_Set) does not raise an exception, but actually
trying to change the value does.)

Some other vendor may have a customer who requires Latin_Hebrew support,
but could care less about Latin 9.  Fine.  Assigning Ada names to the
various 8859 character sets is in our area of competence.  Deciding
which sets compiler vendors support should be left up to their customers.

Is this useful progress towards standarization? Sure.  Is arguing over
whether there is demand for Linear_B support way out of the way of
anything that the ARG wants to get involved in?  Obviously.  Or worse,
whether a variable named with the Greek Alpha, should match a Latin A?
Arggh! (If you think that is bad what about CJK unification?  Do we want
to get into political cat fights about whether or not a Japanese Kanji
code point matches a (Korean) Hangul character with a different
appearence?  Please! Anything but that...)

That is why I think we should be in the business of defining how to
change character sets, but should stay well out of the politics of
whether, say, compilers purchased by the Canadian government must
support Latin 9.

*************************************************************

From: Pascal Leroy
Sent: Wednesday, March 19, 2003  3:54 AM

> In Netscape 7.01, with the encoding set to Latin-1, this displays
> (correctly) the Latin 9 representations!  As does OpenOffice.org,
> Notepad and so on.  Now let me abstract from the Ada.Characters.Latin
1

In the case of Notepad, it just goes to Unicode (encoded as UTF-8) as
soon as you type a non-Latin-1 character.  So I am not sure what your
point is.  (Didn't check the other software packages that you mention.)

> Add an enumeration type Sets to Ada Characters, or if you prefer
> Character_Sets.  It should enumerate all the ISO 8859 character sets.
> ...
> Now add procedure Set_Default_Character_Set, and function
> Current_Character_Set to Ada.Characters.Handling.
> ...
> We should probably also add a library pragma to change the default
> mapping of Character.  (Compilers will probably accept command line
> setting of character mappings, but I think that a stanard pragma would
> help standardization.)

I understand the usefulness of a pragma, but I don't really understand
what sense it makes to change the default character set (whatever that
is) at run-time.  Consider the case where you compile a program in
Latin-9 mode, and it has an enumeration literal with an S-caron in it.
Then at run-time you switch to Latin-1.  Would the 'Image attribute now
return a string including a broken bar?  That would be very strange.

I can imagine why a program might want to juggle with different
character encodings (by withing different Latin_N units) but it seems to
me that the default character set has to be fixed at compilation time.

Anyway none of this changes my opinion that the Latin-N sets are far too
unimportant to spend precious ARG time on them.

> Or worse,
> whether a variable named with the Greek Alpha, should match a Latin A?
> Arggh! (If you think that is bad what about CJK unification?  Do we
want
> to get into political cat fights about whether or not a Japanese Kanji
> code point matches a (Korean) Hangul character with a different
> appearence?  Please! Anything but that...)

As a matter of fact, the current AI 285 does exactly that, and I don't
see this as a political cat fight.  The idea is to just follow what the
Unicode folks are doing (and I suppose _they_ do quite a bit of
political cat fight).  So to answer your questions, a Latin A is not the
same thing as a Greek Alpha or a Cyrillic A.  And at this point the
kanjis and hanguls are not letters, so they are not allowed in
identifiers.  When the Unicode people decide that ideograms are letters,
we will update the definition in Ada.

*************************************************************

From: Jean-Pierre Rosen
Sent: Wednesday, March 19, 2003  4:19 AM

> I understand the usefulness of a pragma, but I don't really understand
> what sense it makes to change the default character set (whatever that
> is) at run-time.  Consider the case where you compile a program in
> Latin-9 mode, and it has an enumeration literal with an S-caron in it.
> Then at run-time you switch to Latin-1.  Would the 'Image attribute now
> return a string including a broken bar?  That would be very strange.
>
And if you go that way, you may want different tasks to use different
encodings.... Did I hear "can of worms" ?

*************************************************************

From: Robert I. Eachus
Sent: Wednesday, March 19, 2003  4:43 PM

First, let me get this out of the way.  I really like UTF-8, and for
that matter UTF-16.  I would also love to put real Unicode/BMP support
into Chapter (Clause) 2 and elsewhere in the RM. I would like to see a
(standard) Wide_Text_IO that supported UTF-1. But it is a lot of work.

However, even if users do eventually migrate toward 16-bit and 32-bit
character standards, we currently have an 8-bit character type in the
standard.  My reasons behind arguing for a minimal AI in this area is
that I think that it would "clear the decks" forever in the 8-bit area,
and let us concentrate on enhancing 16-bit support in the future.

Pascal Leroy wrote:

> In the case of Notepad, it just goes to Unicode (encoded as UTF-8) as
> soon as you type a non-Latin-1 character.  So I am not sure what your
> point is.  (Didn't check the other software packages that you mention.)

I guess you missed the point.  Windows actually uses a superset of Latin
1 that contains all the Latin 9 characters with different code-points.
Windows also has IANA-registered extended versions of some other Latin
sets.  (These are Windows-1291 et. seq.) See the MIME and HTML standards
for more details.  Notepad and other applications may switch to Unicode
internally when you enter non-Latin 1 (or non-Windows 1291) characters.
  But if you cut-and-paste into a text document from one with a
different mapping, most PC software seems to use ISO 2022 control
characters to avoid having to reprocess the entire document.  This can
be done as long as you use at most three ISO 8859 (or Windows) font
variants.

> I understand the usefulness of a pragma, but I don't really understand
> what sense it makes to change the default character set (whatever that
> is) at run-time.

> I can imagine why a program might want to juggle with different
> character encodings (by withing different Latin_N units) but it seems to
> me that the default character set has to be fixed at compilation time.

You may be right which is why I gave that hypothetical GNAT example.  I
think it would be almost trivial for them to support a current character
set enquiry function, but a procedure to change the character set at
run-time might take a lot more work.

Where you would want to be able to change the default character set at
run-time would be for things like Character to UTF-8 encoders and decoders.

 > Consider the case where you compile a program in Latin-9 mode, and
 > it has an enumeration literal with an S-caron in it. Then at
 > run-time you switch to Latin-1.  Would the 'Image attribute now
 > return a string including a broken bar?  That would be very strange.

Why?  The character or string literal gets translated from Latin 9 to
Character at compile time.  Then you conceptually remap all Character
and String values when you change the default character set at run-time.
If you convert the literal from Latin 9 to UTF-8 or Unicode at compile
time, then try to convert back with a default character set of Latin 1,
you can and should expect a Constraint_Error.

> Anyway none of this changes my opinion that the Latin-N sets are far too
> unimportant to spend precious ARG time on them.

In one sense, as I said I agree.  But I think that since we do have
compilers around that support remapping of Character, a standard way of
querying that setting is needed for standardization.  As I indicated, I
can easily be convinced that a way of setting the default mapping at
run-time is a bit too much.

Certainly though, the same issues will come up with respect to
Wide_Character if and when compilers support different Wide_Character
mappings.  In the Wide_Character case determining at run-time what the
actual mapping is may be important, but I certainly agree that requiring
support for changing the Wide_Character mapping at run-time (say from
Shift-JIS to Unicode) would be extreme.

Remember that all that my current proposal requires is that changing
from Latin 1 to Latin 1 succeed. I agree that anything else should be
left outside the scope of the (ISO) standard.  I have no trouble with
leaving the procedure to change the default character set out
altogether, or making it optional.

> As a matter of fact, the current AI 285 does exactly that, and I don't
> see this as a political cat fight.  The idea is to just follow what the
> Unicode folks are doing (and I suppose _they_ do quite a bit of
> political cat fight).  So to answer your questions, a Latin A is not the
> same thing as a Greek Alpha or a Cyrillic A.  And at this point the
> kanjis and hanguls are not letters, so they are not allowed in
> identifiers.  When the Unicode people decide that ideograms are letters,
> we will update the definition in Ada.

Exactly my point, except that I think we officially follow ISO 10646 not
Unicode.  So in theory we should update to Unicode 3.2 compatibility
when DIS 10646(2003) is accepted.  (Those battles come closer to
vendettas than cat fights.  The major battles are Japanese vs. Korean,
Chinese vs. Japanese, Russian vs. Georgian, Greeks vs. Macedonians, and
francophones vs. everybody.  Did I miss anyone?)

If any other ARG--or CRG--members really care about all this, you too
can join the madness in Prague next week.
(http://www.unicode.org/iuc/iuc23/ ;-)

*************************************************************

From: Randy Brukardt
Sent: Wednesday, March 19, 2003  7:25 PM

> Pascal Leroy wrote:
>
> > In the case of Notepad, it just goes to Unicode (encoded as UTF-8) as
> > soon as you type a non-Latin-1 character.  So I am not sure what your
> > point is.  (Didn't check the other software packages that you mention.)
>
> I guess you missed the point.  Windows actually uses a superset of Latin
> 1 that contains all the Latin 9 characters with different code-points.
> Windows also has IANA-registered extended versions of some other Latin
> sets.  (These are Windows-1291 et. seq.) See the MIME and HTML standards
> for more details.  Notepad and other applications may switch to Unicode
> internally when you enter non-Latin 1 (or non-Windows 1291) characters.

Humm, the messages you are sending are encoded as "Windows-1252", which is
the standard Windows character set. That hardly proves anything at all
(other than that Windows doesn't use Latin-1 itself). (I checked this out in
the spam filter.)

>   But if you cut-and-paste into a text document from one with a
> different mapping, most PC software seems to use ISO 2022 control
> characters to avoid having to reprocess the entire document. This can
> be done as long as you use at most three ISO 8859 (or Windows) font
> variants.

Nope, it doesn't change the text at all (if its in the standard Windows
character set, which most everything is). And if you paste it into the DOS
box (which uses the OEM character set - which is how I edit the AIs with my
circa-1986 text editor), it just gets converted to the nearest equivalents.
For instance, I get a capital Y for UC_Y_Diaeresis (which, BTW, is how your
note will appear in the !appendix to AI-285).

Generalizations about Windows are almost always wrong. :-)

*************************************************************

From: Robert I. Eachus
Sent: Thursday, March 20, 2003  12:20 AM

Randy Brukardt wrote:

> Humm, the messages you are sending are encoded as "Windows-1252", which is
> the standard Windows character set. That hardly proves anything at all
> (other than that Windows doesn't use Latin-1 itself). (I checked this out in
> the spam filter.)

(Sorry 1291 et. seq. instead of 1251 et. seq. was a typo.)

I guess I shouldn't be surprised that 1252 as succeeded 1251 as the
"standard" Windows binding in the US, but I hadn't noticed.  But that
more clearly makes my point.  Users might want to be able to use 8-bit
bindings that the ARG as a group should have little or no interest in.
But there is the IANA registry, and I think we can bind to a pointer to
those names with little difficulty, and leave it to compiler vendors and
others to do the "proper" binding to the character set they want to use.
  We should in no way require compilers to reject S or o (S-caron or the
oe ligature) in a name.  But we should fix that through references to
the Unicode & ISO/IEC 10646 standards, and let compiler vendors support
the 8-bit sets their users want to use. (Including 8-bit standards like
Shift-JIS and UTF-8.)

> Nope, it doesn't change the text at all (if its in the standard Windows
> character set, which most everything is).

Oh, there are those who would make you pay dearly for those comments,
unless you meant Unicode as the "standard" Windows character set.  But
the reality is that there is NO standard 8-bit character set for
Windows, versions for different countries use different character sets.

> And if you paste it into the DOS box (which uses the OEM character set -
 > which is how I edit the AIs with my circa-1986 text editor), it just gets
 > converted to the nearest equivalents. For instance, I get a capital Y for
 > UC_Y_Diaeresis (which, BTW, is how your note will appear in the !appendix
 > to AI-285).

Ouch, does that mean I should write the proposal up as a new draft AI,
so people can read it?

> Generalizations about Windows are almost always wrong. :-)

I have learned the hard way that generalizations about preferred
character sets are ALWAYS wrong.

*************************************************************

From: Randy Brukardt
Sent: Thursday, March 20, 2003  5:51 PM

> Randy Brukardt wrote:
>
> > Humm, the messages you are sending are encoded as "Windows-1252", which is
> > the standard Windows character set. That hardly proves anything at all
> > (other than that Windows doesn't use Latin-1 itself). (I checked this out in
> > the spam filter.)
>
> (Sorry 1291 et. seq. instead of 1251 et. seq. was a typo.)
>
> I guess I shouldn't be surprised that 1252 as succeeded 1251 as the
> "standard" Windows binding in the US, but I hadn't noticed.

FYI, that's confused. 1251 is "Cyrillic", while 1252 is "Western European".

...
>   We should in no way require compilers to reject S or o (S-caron or the
> oe ligature) in a name.  But we should fix that through references to
> the Unicode & ISO/IEC 10646 standards, and let compiler
> vendors support the 8-bit sets their users want to use. (Including 8-bit
> standards like Shift-JIS and UTF-8.)

Which is exactly what Pascal has proposed.

But it should be pointed out that this is a very pervasive change. It means
that the representation for names at runtime (in things like the tables for
'Image, for 'External_Tag, for exception information) has to be changed (at the
very least to UTF-8). For Janus/Ada, where most of the runtime code that deals
with those things is written in assembler, such a change will be very
expensive. And that will be true to some extent or other for all compilers.

> > Nope, it doesn't change the text at all (if its in the standard Windows
> > character set, which most everything is).
>
> Oh, there are those who would make you pay dearly for those comments,
> unless you meant Unicode as the "standard" Windows character
> set.  But
> the reality is that there is NO standard 8-bit character set for
> Windows, versions for different countries use different
> character sets.

Of course. I should have said "standard US Windows character set"; didn't mean
to imply that it is the same for everyone.

> > And if you paste it into the DOS box (which uses the OEM character set -
>  > which is how I edit the AIs with my circa-1986 text editor), it just gets
>  > converted to the nearest equivalents. For instance, I get a capital Y for
>  > UC_Y_Diaeresis (which, BTW, is how your note will appear in the !appendix
>  > to AI-285).
>
> Ouch, does that mean I should write the proposal up as a new draft AI,
> so people can read it?

Nope, AIs go through the same text editor. Using non-7-bit characters in AIs is
strongly discouraged. (If we wanted to start using HTML for AIs, then perhaps a
little more flexibility could be allowed.)

> > Generalizations about Windows are almost always wrong. :-)
>
> I have learned the hard way that generalizations about preferred
> character sets are ALWAYS wrong.

Correct. The less the standard says about character sets, the better. Your
proposal seems to require a lot of additional verbiage and support to solve a
problem that doesn't seem to actually exist. The Unicode/ISO 10646 problem does
exist, but once we support that fully, compilers can support anything they want
without us getting in the way.

(It would be nice to have a way to convert to and from UTF-8 in Ada programs.
But, that's one of many things that "easy enough to write yourself", so its
hard to say if it worth adding anything for that.)

*************************************************************

From: Robert Dewar
Sent: Saturday, March 22, 2003  11:38 AM

I find all this discussion of character sets going way off target. All we are
talking about here is some predefined names for some of the characters, nothing
more and nothing less.

*************************************************************

From: Robert I. Eachus
Sent: Sunday, March 23, 2003  11:43 AM

I'm confused.  I am certainly proposing using the IANA registry for the
names of character sets, and a way for programmers to determine which
set is in use.

As understand the 16-bit character set issues, in addition to character
names, there is characterization in terms of 2.2 Lexical Elements for
non-Latin 1 characters. In other words which characters can be used in
names and numeric literals.

I suspect that what Robert is referring to the fact that if someone uses
a non-Latin 1 eight-bit set by a command-line argument, that the names
won't match the characters as displayed.  If so, I am actually
recommending the permission for implementors to 'fix' more than that.

For example, I don't think we should require that implementations
support the Windows 1252 character set, but it would be nice to allow
implementations which choose to do so to get it right.  Some of that
will follow at compile time if implementations map Windows 1252 to the
appropriate Unicode/BMP characters.  But it would also be nice to allow
the Ada.Characters heirarchy and Ada.Text_IO features to be used with
non-Latin 1 character sets.

The keyword in the previous paragraph is "allow."  As I said, I think we
can go a bit farther and provide a standard way to determine the current
character set mapping.  But there is no reason for us to say you must
support these character sets (other than Latin 1!), but must not support
these other sets.  This really has to do with code points in the 00 to
3F and 80 to BF ranges being printable characters instead of control
characters.

*************************************************************

!topic To_Ada conversion in case of wchar_t'Size > 16
!reference RM95 B.3(58), RM95 B.3(60)
!from Vadim Godunko 2003-01-21
!discussion

At least one C library implementation (glibc) use 32-bit values for
wchar_t type. In this case the behavior of conversion functions To_Ada
is not determined.

I propose that those function must return Wide_Character'Val (16#FFFD#)
value (Replacement character) if value of Item is outside of BMP.

*************************************************************

From: Kiyoshi Ishihata
Sent: Monday, July 28, 2003  10:01 AM

> The next meeting of Japanese SC22 is on July 18.  After that, I will
> send you a brief report about our thought, but please understand that
> this may be a tentative position.

Sorry for the delay.  I summarized our discussion as follows.
If our position is accepted, the AI should go through major
rewrite process.

===============================================================================

(1) Do not refer to Unicode

The current AI frequently refers to Unicode and the Web site of the
Unicode Consortium.  It is not appropriate in ISO/IEC context.  Simply
changing the word "Unicode" to "ISO/IEC 10646" is not enough, since
two systems are much different than you might think.

Characters of 10646 and Unicode are identical, or at least intended to
be identical.  Their code positions are the same.  However, the following
Unicode products mentioned in this AI do not exist in the 10646 world.

   Character categorization
       http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html

   Recommendation of character repertoire for identifiers
       http://www.unicode.org/unicode/reports/tr15/tr15-22.html

   Normalization Form KC
       http://www.unicode.org/unicode/reports/tr15/tr15-22.html

   Full case folding
       http://www.unicode.org/Public/3.2-Update/CaseFolding-3.2.0.txt

   Characters terminating lines
       http://www.unicode.org/reports/tr13

These specifications are defined by the Unicode Consortium, and should
not be regarded as internationally agreed standards.  We do not agree
with the idea to define Ada rules based on these reports.

Yes, some languages like Java and C# refer to these Unicode reports.
But, they are "imported" languages in ISO/IEC standardization processes,
different from Ada in this respect.

(2) Recommendation of character repertoire for identifiers -- TR 10176

The following Technical Report has been published by JTC1.

   ISO/IEC TR 10176:2003
   Information technology -- Guidelines for the preparation of
   programming language standards (Fourth edition)

This report contains a table of characters which should be made usable
in identifiers (Annex A).  We believe that this document and ISO/IEC 10646
itself are the only possible references in the Ada standard.

The table of characters for identifiers is supposed to be identical to
the above mentioned Unicode report.  In fact, the term "Unicode Character
Database" does appear in Annex A of the TR.  The TR has been frequently
revised, probably following the requests from Unicode people.

Therefore, changing the reference from Unicode to TR 10176 does not
significantly change the definition of character repertoire.

The TR defines the character repertoire by enumerating all allowed
characters.  In a formal sense, it does not depend upon the concept
"character categorization".  Although Annex A of the TR gives categorization
of each allowed character, the categorization itself does not contribute
to the definition.

A possible demerit of referring to 10176 is the lack of timeliness of
revisions.  In the future, Unicode reports will be promptly revised.
Compared to this, the revision process of 10176 would be slower.
However, since 10176 is the only possible reference in the ISO/IEC world,
we have no other options.

Note that some languages including C++ and Cobol define characters
for identifiers based on the recommendation of TR 10176.

(3) No national variants of numeric literals

We do not like to extend the character repertoire for numeric literals.

Identifiers are used to denote entities, and often words or phrases of
natural languages are used to compose identifiers.  Therefore, it is quite
beneficial to use one's mother tongue in spelling identifiers.

Numeric literals are much more universal.  They denote numeric values,
and, unlike identifiers, do not have culture sensitive nuances in them.
At least we Japanese are happy to write numeric literals only using
ASCII characters.

We understand that there is no Unicode report recommending to extend
characters for numeric literals.

Numeric literals denote values, and their values should be computed
from the value denoted by each digit.  If national variants are allowed,
people not knowing other countries' characters cannot compute the values
of literals.  On the other hand, identifiers can be recognized by people
of other countries through the process of pattern matching of characters.
This is not easy, but anyway is possible.

In summary, we believe that extending characters for numeric literals
do more harm than the benefit gained.

Of course, national variants of number representation may be useful in
Input-Output.  But, this is a different issue.

(4) No normalization

We cannot refer to "Normalization Form KC" which is a Unicode term.
Neither 10646 nor 10176 provides substitute for this concept.  Therefore
we cannot introduce the character normalization process.

This is not bad, we think.  For example, the character "letter A with
umlaut" is regarded different from the combination of two characters
"letter A" and "umlaut".  But, in the first place, it is not a good idea
to have two representations of a single conceptual character.  People
would try to define their own canonical representation of characters.
Regarding "A with umlaut" and "A"+"umlaut" pair as different would not
be a severe burden for them.

Implementations would be much easier, since they can resort to simple
byte-to-byte comparison.

You say

> This is to ensure that identifiers which look visually the same are
> considered as identical, even if they are composed of different characters.

but this principle is not strongly enforced.  The obvious example is
Latin A and Greek Alpha.  They look identical but are distinguished
in identifiers.  We think that they are inherently different characters
and there are no reasons to consider them the same in identifiers.

(5) Uppercase-lowercase correspondence

In Ada, we must have one particular normalization process, which is
the uppercase-lowercase correspondence.  10176 does not say anything
on this topic, so we have to devise some feasible definition.

One possible way of definition is to utilize character names defined
in ISO/IEC 10646.  We can see the obvious correspondence between
"Latin capital letter A" and "Latin small letter A".  We do not know
whether this can easily be implemented in Ada compilers or not.

We notice that there are cases not covered by this simple correspondence.
For example, German "SS" corresponds to two lowercase sequences.  One
is the string "ss", and the other is the es-zett character.  We feel that
such complicated cases should be untouched in this time frame, waiting for
the future standardization of appropriate ISO/IEC standards or technical
reports.

(6) Miscellaneous

> "JTC 1/SC 22 believes that programming languages should offer the appropriate
> support for ISO/IEC 10646, and the Unicode character set where appropriate."

I like to have the reference information attached to this sentence.
This is "Resolution 02-24:  Recommendation on Coded Character Sets Support"
of SC22 2002 plenary.

*************************************************************

From: Pascal Leroy
Sent: Wednesday, July 30, 2003  8:43 AM

Thank you for the extensive feedback.  I will obviously need to give more
thought to your comments, and we will need to discuss them at a meeting.

However, clearly the most contentious issue is that of eliminating references
to Unicode.  As I am sure you realize, Unicode has much more technical "meat"
than 10646.  So the good thing about relying on the Unicode database and
similar documents is that we can just say "the Unicode folks did the work for
us, we trust that they know what they are doing".  After all, the Unicode
consortium has invested numerous man-years in their recommendations, and we
don't have the resources or the expertise to do similar work.

As I see it, we have three options:

1 - Do nothing, keep the language as it is.

2 - Base support of 16- and 32-bit characters on Unicode.

3 - Base support of 16- and 32-bit characters on 10646.

Evidently option #1 is easier, and frankly as a vendor I have not seen a lot of
interest for the existing 16-bit character support, so adding a sizeable
implementation complexity is quite hard to justify from an economical point of
view.  The problem with this option is that it might make SC22 unhappy.

Option #2 is the simplest technically, as we can merely reference the Unicode
documents, and avoid having to dig into the properties of each character.  But
as you point out, it is not kosher for an ISO standard to reference a non-ISO
document.  So politically it is probably not going to work.

Option #3 is evidently ISO-compliant, but 10646 says very little regarding the
properties of characters (others than their name and code points).  I realize
that 10176 has a list of allowed characters, but then it's a TR so it has
relatively little teeth.  Of course we could just do what 10176 does in its
annex A, i.e. list all the characters that we allow (and the case-conversion
tables, and possibly the normalization tables) but that would add 50 pages of
gibberish to the RM.  The problem with this option is that it would take a lot
of work, and it would probably degenerate into cat fight about how case
conversion or normalization or whatever ought to work.

At this point I am going to consult with Jim to see how he thinks we should
proceed.  If need be I'll refer the issue to WG9 to get guidance.

*************************************************************

From: Pascal Leroy
Sent: Wednesday, August 6, 2003  10:20 AM

>(4) No normalization

>We cannot refer to "Normalization Form KC" which is a Unicode term.
>Neither 10646 nor 10176 provides substitute for this concept.  Therefore
>we cannot introduce the character normalization process.

>This is not bad, we think.  For example, the character "letter A with
>umlaut" is regarded different from the combination of two characters
>"letter A" and "umlaut".  But, in the first place, it is not a good idea
>to have two representations of a single conceptual character.  People
>would try to define their own canonical representation of characters.
>Regarding "A with umlaut" and "A"+"umlaut" pair as different would not
>be a severe burden for them.

>Implementations would be much easier, since they can resort to simple
>byte-to-byte comparison.

I have given more thought to normalization, and I believe that it is important
for practical use.

Ignore for a moment the issue of referencing Unicode.  Assume that we have no
difficulties in describing normalization.  The question is: is normalization
good for users?

The problem I see is that when using a Unicode editor you have generally no
idea how it represents a character internally.  When you type "letter A with
umlaut" it may represent this with a single character or as "letter A" +
"umlaut" and that's hidden to the user of the editor.  That's true regardless
of whether you typed one or two characters on your keyboard.

Now imagine the situation where two people write distinct compilation units
with different editors (or maybe with different settings in a single editor).
You might end up with the situation where the declaration of an entity has (in
the file stored on disk) "letter A" + "umlaut" and the usage has "letter A with
umlaut" (or vice-versa).  And that would be invisible to the user because both
editors would merely display Ž.  In this situation, in order to avoid utmost
confusion and bewilderment, I think it is necessary to specify that the
compiler treats the two sequences the same.  That's the purpose of
normalization.

As Unicode editing is going to become more and more common in years to come,
and editors will undoubtedly become more and more fancy, I think it's important
to deal with usability issues like this one.  Incidentally, it seems to me that
this issue is particularly important for Korean Hangul.

*************************************************************

From: Pascal Leroy
Sent: Friday, August 8, 2003  4:39 AM

Kiyoshi said:
> In short, I do not agree.

Fine.  Let's get the discussion started, then.  (I am not sending this on the
ARG mailing list as I don't want to start an endless chatter about whether we
should be doing this at all, etc.  At some point I'll want Randy to record our
discussion, though, to make sure it gets appended to the AI.)

> (1) design of character code
>
> I believe that a single logical character should not have two
> different representations.  If 10646 or Unicode have two
> representations for A with umlaut, it is the fault of the
> character code system.  It should be remedied.

In an ideal world, you are evidently right.

(Irrelevant comment here: in an ideal world, Latin-1, the character set for
Western European languages, would be suitable for writing French.)

The Unicode folks explain (and I agree with them) that the "right"
representation is "letter A"+"umlaut".  The reason is that you have many
diacritical signs used by existing languages (mostly based on the Latin
alphabet) and that assigning code points to all the combinations is impractical
(code points are a scarce resource, especially if you want commonly used
languages to remain in the BMP).  Unicode currently has more than 110
diacritical signs.  The Western European languages only use very few of these,
and they mostly combine with vowels, but still that consumes most of the upper
half of Latin-1.  Greek and Vietnamese, among others, can combine two
diacriticals, and that's a sizeable number of code points.

Now the Unicode documents explain that there are marginal languages (they
mention Navajo) which make complex use of diacriticals and would require many
more code points, for a very small community of users.  Using combining
diacriticals is the right way to go for these languages.  And of course there
may be particular applications where people want to create unanticipated
combination of characters (when I was trying to learn Chinese many many years
ago, my textbook had a diacritical on top of each ideogram to indicate the
tone; that would seem like a perfect application of combining diacriticals).

Finally, there is the issue of fonts: developing a font that contains all the
combinations like "letter A with umlaut" is expensive, and the resulting font
is bulky.  Again, combining diacriticals are better.

So why assign code points to characters like "letter A with umlaut" in the
first place?  I suppose that the answer is compatibility to some extent
(Latin-1 existed before Unicode, and you have to support files coded with
Latin-1 with as little perturbation as possible), and political catfight to
some extent (if German has specific code points, why not do the same for Polish
or Greek; if you do it for Greek, why not Macedonian? etc., etc., ad nauseam).
There may also have been a concern, when Unicode started, that uniformly using
combining diacriticals would require more complex text handling algorithms,
which would have been too costly for the computers of the time.

> (2) role of programming language
>
> Let's denote "A with umlaut" by A", and the sequence "A" and
> umlaut by A+.  If one likes to search the character "A with
> umlaut", he must perform two search operations, one with A"
> and then with A+. This is very tedious, and if the target
> string contains many such special characters, the operation
> is nearly impossible.

I entirely agree, but I would view this as a bug.  If you search for "letter A
with umlaut", if should actually catch both representation (this is not hard to
do, just normalize during search).  I noticed that Internet Explorer behaves as
you describe, and this is really a pain as the two sequences look exactly the
same.

> So, my opinion is that normalization is not a role of
> programming languages or compilers.  It should be performed
> in some lower layers in order to maximize the convenience of
> text file handling.

Unfortunately, there are different forms of normalization, and Unicode
recommends distinct normalization depending on whether the programming language
is case-sensitive or not (I am not exactly sure why; I need to study this).  So
the file system cannot do the normalization, it has to be done by a programming
language tool, and the only one for which we can impose a behavior is the
compiler.

*************************************************************

From: Pascal Leroy
Sent: Tuesday, August 5, 2003  6:33 AM

> At this point I am going to consult with Jim to see how he thinks we
> should proceed.  If need be I'll refer the issue to WG9 to get guidance.

I have talked to Jim.  He said that, from a procedural point of view, there is
no intrinsic problem in referencing the Unicode standard in an ISO standard.
All we need to do is some paperwork to justify the decision.  Of course, one or
several countries could always vote "no" on the amendment because they don't
like references to Unicode, but at least there is no procedural impossibility.

I have also talked to John Benito, the convener of WG14 (C language).  The C
folks are in the process of doing more-or-less what we are doing, only it's
part of a technical report, not of an amendment.  They have been running into
the exact same issue, i.e. opposition at the SC22 level from a number of
countries which don't want to see references to Unicode (Japan, Canada, Norway
and Germany are the countries he named).  John believes that it is nearly
impossible to properly integrate 16- and 32-bit character in his standard
without referencing Unicode.  His plan is to try to convince the SC22
delegations of the aforementioned countries that this issue is moot because
Unicode and 10646:2003 should be indistinguishable (and 10646:2003 references
Unicode anyway).

*************************************************************

From: Christophe Grein
Sent: Thursday, July 15, 2004  5:03 PM

It appears to me that in the document 04-06-03  AI95-00285/07, the category
identifier_letter is no longer defined.

The section 2.1(4-14) were replaced with new wording that does not include
identifier_letter, yet this category is used to define "identifier".

I think

   identifier_extend ::= identifier_start |           <-------
                         mark_non_spacing |
                         mark_spacing_combining |
                         number_decimal_digit |
                         other_format

is meant.

*************************************************************


Questions? Ask the ACAA Technical Agent