!standard A.3.2(49) 03-06-12 AI95-00285/04 !class amendment 02-01-23 !status work item 02-09-24 !status received 02-01-15 !priority Medium !difficulty Hard !subject Support for 16-bit and 32-bit characters !summary Support is added for program text using the entire set of characters from ISO/IEC 10646, and for operating on characters outside of the BMP at run-time. !problem SC22 directed its working groups to provide support for the ISO/IEC 10646 character set: "JTC 1/SC 22 believes that programming languages should offer the appropriate support for ISO/IEC 10646, and the Unicode character set where appropriate." Moreover, the working draft of ISO/IEC 10646:2003 makes use of planes other than the BMP. !proposal The essence of this proposal is to allow the source of the program to be written using 16-bit characters (from the BMP) or 32-bit characters. Also, it makes it possible to operate on 32-bit characters at run-time The main difficulty in supporting characters beyond Row 00 of the BMP in the program text is to define how identifiers and literals are built (which characters are letters, digits, etc.) and to define the lower/upper case equivalence rules. Fortunately, the people developing ISO/IEC 10646 have already done most of the work for us, so it's only a matter of defining how we want to piggyback on their categorization and conversion rules. ISO/IEC defines a "character database" which describes all the properties of each character. The most important property for our purposes is the "General Category". General categories are disjoint. The following categories are of interest for describing Ada program text: - Letter, Uppercase -- e.g., LATIN CAPITAL LETTER A - Letter, Lowercase -- e.g., LATIN SMALL LETTER A - Letter, Titlecase -- e.g., LATIN CAPITAL LETTER L WITH SMALL LETTER J - Letter, Modifier -- e.g., MODIFIER LETTER APOSTROPHE - Letter, Other -- e.g., HEBREW LETTER ALEF - Mark, Non-Spacing -- e.g., COMBINING GRAVE ACCENT - Mark, Spacing Combining -- e.g., MUSICAL SYMBOL COMBINING AUGMENTATION DOT - Number, Decimal Digit -- e.g., DIGIT ZERO - Number, Letter -- e.g., ROMAN NUMERAL TWO - Other, Control -- e.g., NULL - Other, Format -- e.g., ACTIVATE ARABIC FORM SHAPING - Other, Private Use -- e.g., - Other, Surrogate -- e.g., - Punctuation, Connector -- e.g., LOW LINE - Separator, Space -- e.g., SPACE - Separator, Line -- e.g., LINE SEPARATOR - Separator, Paragraph -- e.g., PARAGRAPH SEPARATOR (See http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html for details on the categorization.) In paragraph 2.1 we define a non-terminal of the grammar for each of the above categories, e.g., letter_uppercase, letter_lowercase, etc. The characters in the category other_format are effectively ignored in most lexical elements, with the exception that they are illegal in string_literals and character_literals. Throughout the syntax rules, we specify which characters are allowed for the lexical elements. For instance, the E in the exponent part of a numeric literal may not be a "Greek Capital Letter Epsilon", even though a capital E and a capital epsilon look very much the same. Similar considerations apply to the extended digits, the point, etc. So this means that we are not changing which characters may be used to build numeric_literals, based_literals, and so on. ISO/IEC 10646 proposes to define identifiers for programming languages as follows (see http://www.unicode.org/unicode/reports/tr15/tr15- 22.html#Programming_Language_Identifiers): identifier ::= identifier_start {identifier_start | identifier_extend} identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= mark_non_spacing | mark_spacing_combining | number_decimal_digit | punctuation_connector | other_format This definition was made with C in mind, and is not exactly appropriate for Ada, as it would allow consecutive underlines. Because the underline is the only character of Row 00 of the BMP which is a punctuation_connector, it seems sensible to remain close to the existing syntax rules of 2.3(2-3), and to use the following definitions: identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= identifier_letter | mark_non_spacing | mark_spacing_combining | number_decimal_digit | other_format identifier ::= identifier_start {[punctuation_connector] identifier_extend} ISO/IEC 10646 recommends that, before storing or comparing identifiers, the following transformations be applied: o Characters in category other_format are filtered out. o For languages which have case insensitive identifiers, Normalization Form KC is applied (see http://www.unicode.org/unicode/reports/tr15/tr15-22.html#Specification). This is to ensure that identifiers which look visually the same are considered as identical, even if they are composed of different characters. o _Full_ case folding, as described in the table http://www.unicode.org/Public/3.2-Update/CaseFolding-3.2.0.txt, is used to find the uppercase version of each character. ISO/IEC 10646 doesn't provide guidance for the composition of numeric literals, but it is apparent that we can use the character categories above. So we define: numeral ::= number_decimal_digit {[punctuation_connector] numeral_extend} numeral_extend ::= number_decimal_digit | other_format Again, characters in category other_format (and punctuation_connector) are ignored when computing the value of a decimal literal. The numerical value of each character that is a number_decimal_digit is defined by the field "Decimal digit value" of the ISO/IEC 10646 character database. The definition and role of format_effectors is modified to include the characters at positions 16#85#, 16#2028# and 16#2029#. These characters may be used to terminate lines, as recommended by http://www.unicode.org/reports/tr13. We are not changing the definition of character_literals and string_literals. In particular, we _do_not_ apply Normalization Form KC to such literals. This means in particular that two string literals which look alike may not compare equal. Also note that characters in category other_format are forbidden in character_literals and string_literals, because their sole purpose is to affect the presentation of characters. If a program needs to operate on these characters, it can do that by using Wide_Wide_Character'Val (...). Private use characters are not considered to be graphic characters (even though for some applications they may actually turn out to be graphic). The reason is that we wouldn't be able to define the normalization and case folding rules for these characters, so it seems better to disallow them, except in comments where they cannot do any harm. We are removing 3.5.2(5) since an implementation may want to provide a nonstandard mode where the set of graphic characters is not a proper subset of that defined in ISO/IEC 10646, for instance to deal with private use characters. We don't want to prevent implementations from doing anything useful. This paragraph has no force anyway, since in a non-standard mode an implementation may do pretty much what it likes. In order to represent 32-bit characters at run-time, we add new declarations to Standard. We also provide the following new predefined packages for 32-bit characters: Ada.Strings.Wide_Wide_Bounded Ada.Strings.Wide_Wide_Fixed Ada.Strings.Wide_Wide_Maps Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants Ada.Strings.Wide_Wide_Unbounded Ada.Wide_Wide_Text_IO Ada.Wide_Wide_Text_IO.Text_Streams Ada.Wide_Wide_Text_IO.Complex_IO Ada.Wide_Wide_Text_IO.Editing These packages are similar to their Wide_ equivalents, with Wide_Wide_ substituted for Wide_ everywhere. In addition the following declaration is present in Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants: Wide_Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set; It contains each Wide_Wide_Character value in the BMP of ISO/IEC 10646. The attributes Wide_Wide_Image, Wide_Wide_Value and Wide_Wide_Width are also provided. Their definition is similar to that of Wide_Image, Wide_Value and Wide_Width, respectively, with Wide_Character and Wide_String replaced by Wide_Wide_Character and Wide_Wide_String. <> There is a specific problem with spaces, and it is unclear what is the right thing to do. The dynamic semantics of a number of operations (attribute Value, procedures Get in Text_IO, procedures Trim in the string packages, etc.) are defined in terms of "space" and "blank". A space is the character at position 16#20# and a blank is either a space or a horizontal tabulation. For the purposes of this AI, it would be more consistent to replace space by separator_space, and let Get skip any separator_space, and Value and Trim trim leading and trailing separator_space. For instance, in a program operating on ideographs, it would be nice to skip/trim any Ideographic Space. Unfortunately, this would be an incompatibility. In the case of Value and Get, the incompatibility would only show up in cases which currently raise Constraint_Error, so it is probably acceptable. But in the case of Trim, this would be a silent change of the dynamic semantics... <> Do we need types corresponding to Wide_Wide_Character and Wide_Wide_String in Interfaces.C? What does C do about 32-bit characters? !wording In (32) change: ... Character, [and Wide_Character]{Wide_Character and Wide_Wide_Character} ... In (34) change: ... String [and Wide_String]{, Wide_String and Wide_Wide_String} ... Add after 1.1.4(14): The nonterminals of the grammar, including reserved words and components of lexical elements, are exclusively made of the characters whose Code Point is between 16#20# and 16#7E#, inclusively. For example, the character E in the definition of exponent is the character whose name is "Latin Capital Letter E", not "Greek Capital Letter Epsilon". Replace 2.1(1) by: The characters whose Code Point is 16#FFFE# or 16#FFFF# are not allowed anywhere in the text of a program. The characters in categories other_control, other_private_use and other_surrogate are only allowed in comments. Delete 2.1(2-3). Replace 2.1(4-14) by: The character repertoire for the text of an Ada program consists of the collection of characters described by the ISO/IEC 10646 Universal Multiple-Octet Coded Character Set [Author's note: I am actually using Unicode 3.2.0]. The coded representation for these characters is implementation defined (it need not be a representation defined within ISO/IEC 10646). The description of the language definition in this International Standard uses the fields Code Point, Character Name, General Category, Decimal Digit Value and Unicode 1.0 Name of the character database defined by ISO/IEC 10646. The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified. The categories of characters are defined as follows: letter_uppercase Any character whose General Category is defined by ISO/IEC 10646 to be "Letter, Uppercase". letter_lowercase Any character whose General Category is defined by ISO/IEC 10646 to be "Letter, Lowercase". letter_titlecase Any character whose General Category is defined by ISO/IEC 10646 to be "Letter, Titlecase". letter_modifier Any character whose General Category is defined by ISO/IEC 10646 to be "Letter, Modifier". letter_other Any character whose General Category is defined by ISO/IEC 10646 to be "Letter, Other". mark_non_spacing Any character whose General Category is defined by ISO/IEC 10646 to be "Mark, Non-Spacing". mark_spacing_combining Any character whose General Category is defined by ISO/IEC 10646 to be "Mark, Spacing Combining". number_decimal_digit Any character whose General Category is defined by ISO/IEC 10646 to be "Number, Decimal Digit". number_letter Any character whose General Category is defined by ISO/IEC 10646 to be "Number, Letter". other_control Any character whose General Category is defined by ISO/IEC 10646 to be "Other, Control". other_format Any character whose General Category is defined by ISO/IEC 10646 to be "Other, Format". other_private_use Any character whose General Category is defined by ISO/IEC 10646 to be "Other, Private Use". other_surrogate Any character whose General Category is defined by ISO/IEC 10646 to be "Other, Surrogate". punctuation_connector Any character whose General Category is defined by ISO/IEC 10646 to be "Punctuation, Connector". separator_space Any character whose General Category is defined by ISO/IEC 10646 to be "Separator, Space". separator_line Any character whose General Category is defined by ISO/IEC 10646 to be "Separator, Line". separator_paragraph Any character whose General Category is defined by ISO/IEC 10646 to be "Separator, Paragraph". format_effector The characters whose Unicode 1.0 name is "Character Tabulation", "Line Tabulation", "Carriage Return (CR)", "Line Feed (LF)", "Form Feed (FF)" and "Next Line (NEL)", and the characters in categories separator_line and separator_paragraph. graphic_character Any character which is not in the categories other_control, other_private_use, other_surrogate, other_format, format_effector, and whose Code Point is neither 16#FFFE# nor 16#FFFF#. (This includes all the characters that have not yet been classified by ISO/IEC 10646.) Delete 2.1(15). Add after 2.1(16): Documentation Requirement As the ISO/IEC 10646 character set is constantly evolving (in particular by the addition of new languages), an implementation shall document to which version of ISO/IEC 10646 it conforms. Delete 2.1(17). Replace 2.2(3-5) by: In some cases an explicit _separator_ is required to separate adjacent lexical elements. A separator is any of a separator_space, a format_effector or the end of a line, as follows: o A separator_space is a separator except within a comment, a string_literal, or a character_literal. o Character Tabulation is a separator except within a comment. Replace 2.2(8-9) by: A delimiter is either one of the characters whose Character Name is: o Ampersand o Apostrophe o Left Parenthesis o Right Parenthesis o Asterisk o Plus Sign o Comma o Hyphen-Minus o Full Stop o Solidus o Colon o Semicolon o Less-Than Sign o Equals Sign o Greater-Than Sign o Vertical Line Replace 2.3(2-3) by: identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= identifier_letter | mark_non_spacing | mark_spacing_combining | number_decimal_digit | other_format identifier ::= identifier_start {[punctuation_connector] identifier_extend} Replace 2.3(5) by: Two identifiers are considered the same if they consist of same sequence of characters after applying the following transformations (in this order): o The characters in category other_format are eliminated. o Normalization Form KC of ISO/IEC 10646 is applied to the identifier. o Full case folding, as defined by ISO/IEC 10646, is applied to obtain the uppercase version of each character. Replace 2.4.1(3) by: numeral ::= number_decimal_digit {[punctuation_connector] numeral_extend} numeral_extend ::= number_decimal_digit | other_format Replace 2.4.1(6) by: In determining the meaning of a numeric_literal, the following transformations are applied: o The characters in categories punctuation_connector and other_format are eliminated. o The numerical value of each character in category number_decimal_digit is given by its Decimal Digit Value. Replace 2.4.2(8) by: In determining the meaning of a based_literal, the following transformations are applied: o The characters in categories punctuation_connector and other_format are eliminated. o The numerical value of each character in category number_decimal_digit is given by its its Decimal Digit Value. o The numerical values of the letters A through F are 10 through 15, respectively. Add after 2.6(7): No modification is performed on the sequence of characters in a string_literal. In particular, Normalization Form KC is _not_ applied. Therefore, two strings which look alike may not compare equal. In particular, Normalization Form KC is _not_ applied. Therefore, two strings which look alike may not compare equal. Replace 3.5(28-29) by: S'Wide_Wide_Image S'Wide_Wide_Image denotes a function with the following specification: function S'Wide_Wide_Image (Arg : S'Base) return Wide_Wide_String; Add after 3.5(34): S'Wide_Image S'Wide_Image denotes a function with the following specification: function S'Wide_Image (Arg : S'Base) return Wide_String; The function returns an image of the value of Arg as a Wide_String. The lower bound of the result is one. The image has the same sequence of character as defined for S'Wide_Wide_Image if all the graphic characters are defined in Wide_Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of Arg). Replace 3.5(37) by: The function returns an image of the value of Arg as a String. The lower bound of the result is one. The image has the same sequence of character as defined for S'Wide_Wide_Image if all the graphic characters are defined in Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of Arg). Add after 3.5(37): S'Wide_Wide_Width S'Wide_Wide_Width denotes the maximum length of a Wide_Wide_String returned by S'Wide_Wide_Image over all the values of S. It denotes zero for a subtype that has a null range. Its type is universal_integer. Replace 3.5(40-45) by: S'Wide_Wide_Value S'Wide_Wide_Value denotes a function with the following specification: function S'Wide_Wide_Value (Arg : Wide_Wide_String) return S'Base; This function returns a value given an image of the value as a Wide_Wide_String, ignoring any leading or trailing spaces. For the evaluation of a call on S'Wide_Wide_Value for an enumeration subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an enumeration literal and if it corresponds to a literal of the type of S (or corresponds to the result of S'Wide_Image for a nongraphic character of the type), the result is the corresponding enumeration value; otherwise Constraint_Error is raised. For the evaluation of a call on S'Wide_Wide_Value for an integer subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an integer literal, with an optional leading sign character (plus or minus for a signed type; only plus for a modular type), and the corresponding numeric value belongs to the base range of the type of S, then that value is the result; otherwise Constraint_Error is raised. For the evaluation of a call on S'Wide_Wide_Value for a real subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of one of the following: Add after 3.5(51): S'Wide_Value S'Wide_Value denotes a function with the following specification: function S'Wide_Value(Arg : Wide_String) return S'Base This function returns a value given an image of the value as a Wide_String, ignoring any leading or trailing spaces. For the evaluation of a call on S'Wide_Value for an enumeration subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an enumeration literal and if it corresponds to a literal of the type of S (or corresponds to the result of S'Wide_Image for a value of the type), the result is the corresponding enumeration value; otherwise Constraint_Error is raised. For a numeric subtype S, the evaluation of a call on S'Wide_Value with Arg of type String is equivalent to a call on S'Wide_Wide_Value for a corresponding Arg of type Wide_Wide_String. At the end of 3.5(55) change: ... to a call on [S'Wide_Value]{S'Wide_Wide_Value} for a corresponding Arg of type [Wide_String]{Wide_Wide_String}. In 3.5(56) change: ... {Wide_Wide_Value,} Wide_Value, Value, {Wide_Wide_Image,} Wide_Image, and Image ... In 3.5(59) change: ... as [does]{do} S'Wide_Value (S'Wide_Image (V)) {and S'Wide_Wide_Value (S'Wide_Wide_Image (V))} ... In the middle of 3.5.2(2), change: ... the attributes [(Wide_)Image and (Wide_)Value]{Image, Wide_Image, Wide_Wide_Image, Value, Wide_Value and Wide_Wide_Value} Add after 3.5.2(3): The predefined type Wide_Wide_Character is a character type whose values correspond to the 2147483648 code points of the ISO/IEC 10646 character set. Each of the graphic_characters has a corresponding character_literal in Wide_Wide_Character. The first 65536 values of Wide_Wide_Character have the same character_literal or language-defined name as defined for Wide_Character. In types Wide_Character and Wide_Wide_Characters, the characters whose Code Points are 16#FFFE# and 16#FFFF# are assigned the language-defined names FFFE and FFFF. The other characters whose Code Point is larger than 16#FF# and which are not graphic_characters have language-defined names which are formed by appending to the string "Character_" the representation of their Code Point in hexadecimal as four extended digits (in the case of Wide_Character) or eight extended digits (in the case of Wide_Wide_Character). As with other language-defined names, these names are usable only with the attributes (Wide_)Wide_Image and (Wide_)Wide_Value; they are not usable as enumeration literals. In 3.5.2(4) change: ... Character [and Wide_Character]{, Wide_Character and Wide_Wide_Character} ... Delete 3.5.2(5). Replace 3.6.3(2) by: There are three predefined string types, String, Wide_String and Wide_Wide_String, each indexed by the value of the predefined subtype Positive; these are declared in the visible part of package Standard: Replace 3.6.3(4) by: type String is array (Positive range <>) of Character; type Wide_String is array (Positive range <>) of Wide_Character; type Wide_Wide_String is array (Positive range <>) of Wide_Wide_Character; Fix the list in A(2) [Author's note: I hope it's auto-generated...] Add in the middle of A.1(36) -- The declaration of type Wide_Wide_Character is based on the full ISO/IEC -- character set. The first 2 ** 16 positions have the same contents as type -- Wide_Character. See 3.5.2. type Wide_Wide_Character is (nul, soh, ..., FFFE, FFFF, ...); Add after A.1(42): type Wide_Wide_String is array (Positive range <>) of Wide_Wide_Character; pragma Pack (Wide_Wide_String); -- The predefined operators for this type correspond to those for String. Replace the beginning of A.1(49) by: In each of the type Character [and Wide_Character]{, Wide_Character and Wide_Wide_Character} ... In A.3(1) change: ... with Wide_Character {and Wide_Wide_Character} data ... In A.3.2(13) change: ... between {Wide_Wide_Character, } Wide_Character ... Add after A.3.2(14): function Is_Character (Item : in Wide_Wide_Character) return Boolean; function Is_String (Item : in Wide_Wide_String) return Boolean; function Is_Wide_Character (Item : in Wide_Wide_Character) return Boolean; function Is_Wide_String (Item : in Wide_Wide_String) return Boolean; Add after A.3.2(16): function To_Character (Item : in Wide_Wide_Character; Substitute : in Character := ' ') return Character; function To_String (Item : in Wide_Wide_String; Substitute : in Character := ' ') return String; Add after A.3.2(18): function To_Wide_Character (Item : in Wide_Wide_Character; Substitute : in Wide_Character := ' ') return Wide_Character; function To_Wide_String (Item : in Wide_Wide_String; Substitute : in Wide_Character := ' ') return Wide_String; function To_Wide_Wide_Character (Item : in Character) return Wide_Wide_Character; function To_Wide_Wide_String (Item : in String) return Wide_Wide_String; function To_Wide_Wide_Character (Item : in Wide_Character) return Wide_Wide_Character; function To_Wide_Wide_String (Item : in Wide_String) return Wide_Wide_String; Replace A.3.2(42-48) by: The following functions test Wide_Wide_Character or Wide_Character values for membership in Wide_Character or Character, or convert between corresponding characters of Wide_Wide_Character, Wide_Character and Character. function Is_Character (Item : in Wide_Character) return Boolean; Returns True if Wide_Character'Pos(Item) <= Character'Pos(Character'Last). function Is_Character (Item : in Wide_Wide_Character) return Boolean; Returns True if Wide_Wide_Character'Pos(Item) <= Character'Pos(Character'Last). function Is_Wide_Character (Item : in Wide_Wide_Character) return Boolean; Returns True if Wide_Wide_Character'Pos(Item) <= Wide_Character'Pos(Wide_Character'Last). function Is_String (Item : in Wide_String) return Boolean; function Is_String (Item : in Wide_Wide_String) return Boolean; Returns True if Is_Character(Item(I)) is True for each I in Item'Range. function Is_Wide_String (Item : in Wide_Wide_String) return Boolean; Returns True if Is_Wide_Character(Item(I)) is True for each I in Item'Range. function To_Character (Item : in Wide_Character; Substitute : in Character := ' ') return Character; function To_Character (Item : in Wide_Wide_Character; Substitute : in Character := ' ') return Character; Returns the Character corresponding to Item if Is_Character(Item), and returns the Substitute Character otherwise. function To_Wide_Character (Item : in Character) return Wide_Character; Returns the Wide_Character X such that Character'Pos(Item) = Wide_Character'Pos (X). function To_Wide_Character (Item : in Wide_Wide_Character; Substitute : in Wide_Character := ' ') return Wide_Character; Returns the Wide_Character corresponding to Item if Is_Wide_Character(Item), and returns the Substitute Wide_Character otherwise. function To_Wide_Wide_Character (Item : in Character) return Wide_Wide_Character; Returns the Wide_Wide_Character X such that Character'Pos(Item) = Wide_Wide_Character'Pos (X). function To_Wide_Wide_Character (Item : in Wide_Character) return Wide_Wide_Character; Returns the Wide_Wide_Character X such that Wide_Character'Pos(Item) = Wide_Wide_Character'Pos (X). function To_String (Item : in Wide_String; Substitute : in Character := ' ') return String; function To_String (Item : in Wide_Wide_String; Substitute : in Character := ' ') return String; Returns the String whose range is 1..Item'Length and each of whose elements is given by To_Character of the corresponding element in Item. function To_Wide_String (Item : in String) return Wide_String; Returns the Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Character of the corresponding element in Item. function To_Wide_String (Item : in Wide_Wide_String; Substitute : in Wide_Character := ' ') return Wide_String; Returns the Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Character of the corresponding element in Item with the given Substitute Wide_Character. function To_Wide_Wide_String (Item : in String) return Wide_Wide_String; function To_Wide_Wide_String (Item : in Wide_String) return Wide_Wide_String; Returns the Wide_Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Wide_Character of the corresponding element in Item. Delete A.3.2(49). In A.4(1) change: ... both String [and Wide_String]{, Wide_String and Wide_Wide_String} ... Add after A.4.1(4): Wide_Wide_Space : constant Wide_Wide_Character := ' '; Add after A.4.7 a new section, A.4.8: A.4.7 Wide_Wide_String Handling Facilities for handling strings of Wide_Wide_Character elements are found in the packages Strings.Wide_Wide_Maps, Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, Strings.Wide_Wide_Unbounded, and Strings.Wide_Wide_Maps.Wide_Wide_Constants. They provide the same string-handling operations as the corresponding packages for strings of Character elements. Static Semantics The package Strings.Wide_Wide_Maps has the following declaration. package Ada.Strings.Wide_Wide_Maps is pragma Preelaborate(Wide_Wide_Maps); -- Representation for a set of Wide_Wide_Character values: type Wide_Wide_Character_Set is private; Null_Set : constant Wide_Wide_Character_Set; type Wide_Wide_Character_Range is record Low : Wide_Wide_Character; High : Wide_Wide_Character; end record; -- Represents Wide_Wide_Character range Low..High type Wide_Wide_Character_Ranges is array (Positive range <>) of Wide_Wide_Character_Range; function To_Set (Ranges : in Wide_Wide_Character_Ranges) return Wide_Wide_Character_Set; function To_Set (Span : in Wide_Wide_Character_Range) return Wide_Wide_Character_Set; function To_Ranges (Set : in Wide_Wide_Character_Set) return Wide_Wide_Character_Ranges; function "=" (Left, Right : in Wide_Wide_Character_Set) return Boolean; function "not" (Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "and" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "or" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "xor" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "–" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function Is_In (Element : in Wide_Wide_Character; Set : in Wide_Wide_Character_Set) return Boolean; function Is_Subset (Elements : in Wide_Wide_Character_Set; Set : in Wide_Wide_Character_Set) return Boolean; function "<=" (Left : in Wide_Wide_Character_Set; Right : in Wide_Wide_Character_Set) return Boolean renames Is_Subset; -- Alternative representation for a set of Wide_Wide_Character values: subtype Wide_Wide_Character_Sequence is Wide_Wide_String; function To_Set (Sequence : in Wide_Wide_Character_Sequence) return Wide_Wide_Character_Set; function To_Set (Singleton : in Wide_Wide_Character) return Wide_Wide_Character_Set; function To_Sequence (Set : in Wide_Wide_Character_Set) return Wide_Wide_Character_Sequence; -- Representation for a Wide_Wide_Character to Wide_Wide_Character mapping: type Wide_Wide_Character_Mapping is private; function Value (Map : in Wide_Wide_Character_Mapping; Element : in Wide_Wide_Character) return Wide_Wide_Character; Identity : constant Wide_Wide_Character_Mapping; function To_Mapping (From, To : in Wide_Wide_Character_Sequence) return Wide_Wide_Character_Mapping; function To_Domain (Map : in Wide_Wide_Character_Mapping) return Wide_Wide_Character_Sequence; function To_Range (Map : in Wide_Wide_Character_Mapping) return Wide_Wide_Character_Sequence; type Wide_Wide_Character_Mapping_Function is access function (From : in Wide_Wide_Character) return Wide_Wide_Character; private ... -- not specified by the language end Ada.Strings.Wide_Wide_Maps; The context clause for each of the packages Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, and Strings.Wide_Wide_Unbounded identifies Strings.Wide_Wide_Maps instead of Strings.Maps. For each of the packages Strings.Fixed, Strings.Bounded, Strings.Unbounded, and Strings.Maps.Constants the corresponding wide string package has the same contents except that o Wide_Wide_Space replaces Space o Wide_Wide_Character replaces Character o Wide_Wide_String replaces String o Wide_Wide_Character_Set replaces Character_Set o Wide_Wide_Character_Mapping replaces Character_Mapping o Wide_Wide_Character_Mapping_Function replaces Character_Mapping_Function o Wide_Wide_Maps replaces Maps o Bounded_Wide_Wide_String replaces Bounded_String o Null_Bounded_Wide_Wide_String replaces Null_Bounded_String o To_Bounded_Wide_Wide_String replaces To_Bounded_String o To_Wide_Wide_String replaces To_String o Unbounded_Wide_Wide_String replaces Unbounded_String o Null_Unbounded_Wide_Wide_String replaces Null_Unbounded_String o Wide_Wide_String_Access replaces String_Access o To_Unbounded_Wide_Wide_String replaces To_Unbounded_String The following additional declaration is present in Strings.Wide_Wide_Maps.Wide_Wide_Constants: Wide_Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set; -- Contains each Wide_Wide_Character value WWC such that -- Characters.Handling.Is_Wide_Character (WWC) is True [Author's note: the preceding comment is missing ".Handling" in A.4.7(46).] NOTES If a null Wide_Wide_Character_Mapping_Function is passed to any of the Wide_Wide_String handling subprograms, Constraint_Error is propagated. In A.6(1) change: ... packages Text_IO [and Wide_Text_IO]{, Wide_Text_IO and Wide_Wide_Text_IO} ... In A.7(4) change: ... data, [and] Wide_Text_IO for Wide_Character and Wide_String data {, and Wide_Wide_Text_IO for Wide_Wide_Character and Wide_Wide_String data} ... In A.7(10) change: ... Text_IO, Wide_Text_IO {, Wide_Wide_Text_IO}, and Stream_IO ... In A.7(13) change: ... Direct_IO, Text_IO [and Wide_Text_IO]{, Wide_Text_IO and Wide_Wide_Text_IO} ... In A.7(15) change: ... Text_IO, Wide_Text_IO {, Wide_Wide_Text_IO}, and Stream_IO ... Replace A.11 by: A.11 Wide Text Input-Output and Wide Wide Text Input-Output The packages Wide_Text_IO and Wide_Wide_Text_IO provide facilities for input and output in human-readable form. Each file is read or written sequentially, as a sequence of wide characters (or wide wide characters) grouped into lines, and as a sequence of lines grouped into pages. Static Semantics The specification of package Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Character, and any occurrence of String is replaced by Wide_String. Nongeneric equivalents of Wide_Text_IO.Integer_IO and Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Text_IO, Ada.Long_Integer_Wide_Text_IO, Ada.Float_Wide_Text_IO, Ada.Long_Float_Wide_Text_IO. The specification of package Wide_Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Wide_Character, and any occurrence of String is replaced by Wide_Wide_String. Nongeneric equivalents of Wide_Wide_Text_IO.Integer_IO and Wide_Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Wide_Text_IO, Ada.Long_Integer_Wide_Wide_Text_IO, Ada.Float_Wide_Wide_Text_IO, Ada.Long_Float_Wide_Wide_Text_IO. In A.12(1) change: ... Text_IO.Text_Streams [and Wide_Text_IO.Text_Streams]{, Wide_Text_IO.Text_Streams and Wide_Wide_Text_IO.Text_Streams} ... Add a new section after A.12.3: A.12.4 The Package Wide_Wide_Text_IO.Text_Streams The package Wide_Wide_Text_IO.Text_Streams provides a function for treating a wide wide text file as a stream. Static Semantics The library package Wide_Wide_Text_IO.Text_Streams has the following declaration: with Ada.Streams; package Ada.Wide_Wide_Text_IO.Text_Streams is type Stream_Access is access all Streams.Root_Stream_Type'Class; function Stream (File : in File_Type) return Stream_Access; end Ada.Wide_Wide_Text_IO.Text_Streams; The Stream function has the same effect as the corresponding function in Streams.Stream_IO. At the beginning of C.5(7) change: If the pragma applies to an enumeration type, then the semantics of the Wide_Wide_Image and Wide_Wide_Value attributes are implementation defined for that type; the semantics of Image, Wide_Image, Value and Wide_Value are still defined in terms of Wide_Wide_Image and Wide_Wide_Value... In F(4) change: ... Text_IO.Editing [and Wide_Text_IO.Editing]{, Wide_Text_IO.Editing and Wide_Wide_Text_IO.Editing} ... In F.3(1) change: ... Text_IO.Editing [and Wide_Text_IO.Editing]{, Wide_Text_IO.Editing and Wide_Wide_Text_IO.Editing} ... At the beginning of F.3(1) change: The child packages Text_IO.Editing [and Wide_Text_IO.Editing]{, Wide_Text_IO.Editing and Wide_Wide_Text_IO.Editing}... Add at the end of F.3(6): ... For Wide_Wide_Text_IO.Editing their types are Wide_Wide_String and Wide_Wide_Character, respectively. In F.3(19) change: ... Text_IO.Decimal_IO [and Wide_Text_IO.Decimal_IO]{, Wide_Text_IO.Decimal_IO and Wide_Wide_Text_IO.Decimal_IO} In F.3(20) change: ... Text_IO.Editing [and Wide_Text_IO.Editing]{, Wide_Text_IO.Editing and Wide_Wide_Text_IO.Editing} ... Add a new section after F.3.4: F.3.5 The Package Wide_Wide_Text_IO.Editing Static Semantics The child package Wide_Wide_Text_IO.Editing has the same contents as Text_IO.Editing, except that: o each occurrence of Character is replaced by Wide_Wide_Character, o each occurrence of Text_IO is replaced by Wide_Wide_Text_IO, o the subtype of Default_Currency is Wide_Wide_String rather than String, and o each occurrence of String in the generic package Decimal_Output is replaced by Wide_Wide_String. NOTES Each of the functions Wide_Wide_Text_IO.Editing.Valid, To_Picture, and Pic_String has String (versus Wide_Wide_String) as its parameter or result subtype, since a picture String is not localizable. Add a new section after G.1.4: G.1.5 The Package Wide_Wide_Text_IO.Complex_IO Static Semantics Implementations shall also provide the generic library package Wide_Wide_Text_IO.Complex_IO. Its declaration is obtained from that of Text_IO.Complex_IO by systematically replacing Text_IO by Wide_Wide_Text_IO and String by Wide_Wide_String; the description of its behavior is obtained by additionally replacing references to particular characters (commas, parentheses, etc.) by those for the corresponding wide characters. In H.4(20) change: ... Text_IO, Wide_Text_IO {, Wide_Wide_Text_IO}, or Stream_IO ... Fix annex K. [Author's note: I'm pretty sure it's auto-generated...] !discussion See proposal. !example !ACATS test !appendix From: Gary Dismukes Sent: Tuesday, January 15, 2002 4:14 PM Ben Brosgol recently pointed out to us (ACT) the introduction of a variant of the Latin 1 character set that is designated Latin 9. A web page describing Latin 9 can be viewed at: http://www.cs.tut.fi/~jkorpela/latin9.html Here's the summary blurb on that page describing the relatively minor differences between Latin 1 and Latin 9: ISO Latin 9 as compared with ISO Latin 1 The ISO Latin 9 (ISO 8859-15) character set differs from the well-known ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro sign and some national letters used e.g. in French and Finnish have been introduced and some rarely used special characters omitted. We've added a new package to the GNAT library named Ada.Characters.Latin_9, analogous to Ada.Characters.Latin_1, to define character constants for this new character set. Robert Dewar asked me to post the following remarks from him re Latin-9 and Ada.Characters.Handling: ---------- Note that the Ada package Latin-1 did not exactly follow the official names of all characters, and I have copied its abbreviated naming style for the new characters in Latin-9. I have a gripe with the RM here. The setup for Ada.Characters.Latin_1 is to have separate packages for separate character sets, which makes perfectly good sense: 27 An implementation may provide additional packages as children of Ada.Characters, to declare names for the symbols of the local character set or other character sets. But for Characters.Handling, we have the odd statement: 49 If an implementation provides a localized definition of Character or Wide_Character, then the effects of the subprograms in Characters.Handling should reflect the localizations. See also 3.5.2. which implies that some mysterious transformation happens on this package (under what circumstnaces?) I think this is a bad idea for two reasons: a) it requires specialized mechanisms in the compiler, and it seems odd for the meaning of this package to depend on some compiler switch etc. b) it precludes handling multiple character sets in the same program, whereas the design for Ada.Characters.Latin_1 etc seems to accomodate this. My recommendation is that an implementation generate separate packages, called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling being a renaming of Ada.Characters.Handling_Latin_1 perhaps?) Robert Dewar ************************************************************* From: Pascal Leroy Sent: Tuesday, January 15, 2002 5:05 PM > The ISO Latin 9 (ISO 8859-15) character set differs from the well-known > ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro > sign and some national letters used e.g. in French and Finnish have been > introduced and some rarely used special characters omitted. Oh boy, good to see that the OE and oe ligatures are now available, and that we now can write French without having to use Unicode! ************************************************************* From: John Barnes Sent: Wednesday, January 16, 2002 1:44 AM Better put that on the agenda for the next ARG. Ada 2005 should use Latin 9 rather than Latin 1. A minor change. Might be a few incompatibilities. ************************************************************* From: Pascal Leroy Sent: Wednesday, January 16, 2002 12:53 PM As I mentioned in a mail yesterday, the fact that you can use Latin 9 to write French makes it look very interesting to me. On the other hand, it is not too useful for Ada to support Latin 9 if the OSes don't: if I emit the character OE and it print out as 1/4 on my screen, I didn't gain much. So while I agree that we should consider supporting Latin 9 _in_addition_ to Latin 1 in Ada 05, I don't think Latin 9 should _replace_ Latin 1, because I am ready to bet that we will still have Latin 1 OSes ten years from now. ************************************************************* From: John Barnes Sent: Thursday, January 17, 2002 1:33 AM It was somewhat of a jokey suggestion as I am sure you are aware. Indeed I had a big problem when writing my book and displaying the type Character. I wrote it in QuarkXpress on a PC and it was fine. The publishers moved it to a Mac before printing and some characters came out wrong. One of them came out as a picture of an apple. Moreover, someone had bitten a lump out of it. So much for standards I thought. But supporting Latin-9 would be nice. All those adverts on the Paris Metro for eating an oeuf can then be printed properly. ************************************************************* From: Bob Duff Sent: Thursday, January 17, 2002 1:14 PM > Indeed I had a big problem when writing my book and > displaying the type Character. I had a great deal of trouble writing the part of the Reference Manual where type Character lives. I think Randy had some trouble with the updated RM, too. At least we didn't try to show type Wide_Character in its full glory. ;-) 7-bit ascii will live forever, I suppose. ************************************************************* From: Bob Duff Sent: Wednesday, January 16, 2002 2:15 PM > Ben Brosgol recently pointed out to us (ACT) the introduction of a > variant of the Latin 1 character set that is designated Latin 9. The nice thing about standards is that there are so many to choose from. ;-) > My recommendation is that an implementation generate separate packages, > called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling > being a renaming of Ada.Characters.Handling_Latin_1 perhaps?) That makes sense. But I think the RM statement you complain about is envisioning a nonstandard version of Standard.[Wide_]Character, which is a separate issue. I don't see that as a big deal -- if you don't think it's a good idea, don't implement any such thing. I tend to agree that compiler switches and the like shouldn't normally be meddling with the semantics of packages Standard and Characters.Handling without a very good reason. ************************************************************* From: Florian Weimer Sent: Friday, January 18, 2002 6:58 AM > But I think the RM statement you complain about is envisioning a > nonstandard version of Standard.[Wide_]Character, which is a separate > issue. If you use Latin 9 for Standard.Character, this is certainly a non-standard version, and Ada.Characters.Handling has to be modified to remain useful. ************************************************************* From: Florian Weimer Sent: Friday, January 18, 2002 6:58 AM > Better put that on the agenda for the next ARG. Ada 2005 > should use Latin 9 rather than Latin 1. A minor change. > Might be a few incompatibilities. I disagree. With Latin 9, the mapping from Character to Wide_Character is less straightforward, and this could have unexpected results. OTOH, it seems that Wide_Character is not widely used (unless you are forced to do so by ASIS), so this might not matter much. In addition, we really should add Wide_Wide_Character (which covers the sixteen additional planes), or make Wide_Character itself wider. Otherwise, using Unicode with standard Ada will be rather painful. ************************************************************* From: Florian Weimer Sent: Saturday, April 20, 2002 3:18 AM ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits, and 10646-2:2001 allocates characters outside the Basic Multilingual Plane. Not too long ago, quite a few people assumed that characters beyond the BMP would be interesting only for rather esoteric scholarly use (Linear B is a perfect example). However, we now have got at least different sets of code positions outside the BMP which will see more widespread use eventually: the mathematical alphabets and Plane 14 Language Tags (which are required to make some Japanese people happy who fear that Japanese characters are rendered using Chinese glyphs). Therefore, I think Ada 200X should somehow support characters outside the BMP. A few random thoughts (sorry, I'm probably not using strict ISO 10646 terminology): * Several major vendors have adopted ISO 10646-1:1993 early, using a 16 bit representation for characters (i.e. wchar_t in C is 16 bits). These vendors include Sun (Java) and Microsoft (Windows), and probably most proprietary UNIX vendors. These vendor implementations now cover the code positions beyond the BMP using UTF-16, which uses surrogate pairs (a single character is represented using two 16 bit values from reserved ranges in the BMP). UTF-16 has got a few drawbacks: the ordering (in terms UCS code positions) is no longer lexicographic (which leads us to such brain damage as CESU-8), dealing with individual characters is complicated, and you cannot implement the C wide character functions properly. For Ada, numerous changes would be required if we want to expose the UTF-16 representation to programmers, for example by declaring Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no longer be arrays of characters indexed by position). GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS characters in a single 32 bit value, that is, UTF-32), and while this is certainly not the "industry standard" (it is encouraged by ISO 9899:1999, though), I really hope we can use this approach (UTF-32 internal representation) for Ada, as it simplifies things considerably, especially if we want to add character properties support (see below). * We could add Wide_Wide_Character and Wide_Wide_String types to pacakge Standard (and extending the Ada.Strings hierarchy), which are encoded in UTF-32. I don't know if this is necessary. IIRC, Robert Dewar once told that the only applications using Wide_Character are based on ASIS, where using Wide_Character is not really voluntarily. Maybe it is possible to bump Wide_Character'Size to 32 bits instead, without really breaking backwards compatibility. Of course, we would need a way to converted UTF-32 strings to UTF-16 strings and vice versa (the UTF-16 string type could become a second-class citizen, though, without full support in the Ada.Strings hierarchy). * External representation of UCS characters is rapidly moving towards UTF-8 (especially in Internet standards). Ada should provide an interface for converting between the wide string type(s) and UTF-8 octet sequences. It should be possible to use string literals where UTF-8 strings are expected. * Supporting higher levels of Unicode (e.g. accessing the character properties database, normalization forms) would be interesting, too. Such documents will eventually follow in the ISO 10646 series, but I don't know if the ISO standard will be ready for Ada 200X. Currently, only the Unicode Consortium has standardized or documented issues like character properties or terminal behavior in detail. I don't know how ISO reacts if ISO standards refer to competing standardization efforts. IEEE POSIX.1 (and probably, or already, ISO POSIX) standardizes the BSD sockets interface, and not OSI, so maybe this isn't an issue. In any case, this point is mostly a library issue which can be addressed by a community implementation effort, it does not require changes in the Ada language (adding Wide_Wide_Character does, for example). ************************************************************* From: Pascal Leroy Sent: Monday, April 22, 2002 8:32 AM > ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits, > and 10646-2:2001 allocates characters outside the Basic Multilingual > Plane. > > Therefore, I think Ada 200X should somehow support characters outside > the BMP. The normalization of new character sets (both as part of 10646 and of 8859) was actually discussed at the last ARG meeting, and I was given an action item to somehow integrate them in the language, probably as some kind of amendment AI. > A few random thoughts (sorry, I'm probably not using strict ISO 10646 > terminology): > > * Several major vendors have adopted ISO 10646-1:1993 early, using a > 16 bit representation for characters (i.e. wchar_t in C is 16 > bits). Which is fine as it maps directly to Ada's wide character. I still think that we want to retain the capacity of using 16-bit blobs to represent characters in the BMP, as 99.5% of practical applications will only need the BMP. > For Ada, numerous changes would be required if we want to expose the > UTF-16 representation to programmers, for example by declaring > Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no > longer be arrays of characters indexed by position). Changes to Wide_Character and Wide_String are pretty much out of the question. On the other hand, the type that is intended for interfacing with C is Interfaces.C.wchar_array, and it would be straightforward to provide (in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C compiler does). > I really hope we can use this approach (UTF-32 > internal representation) for Ada, as it simplifies things > considerably, especially if we want to add character properties > support (see below). I would think that we would want to use UCS-4, since it's an ISO standard. Moreover, UTF-32 has a number of consistency rules (eg code points below 16#10ffff#) which seem irrelevant for internal manipulation of strings. > * We could add Wide_Wide_Character and Wide_Wide_String types to > pacakge Standard (and extending the Ada.Strings hierarchy), which > are encoded in UTF-32. Wide_Wide_ types seem like the natural way to add this capability to the language, except that some compilers may not be quite prepared to deal with enumeration types with 2 ** 32 literals (ours isn't). > (the UTF-16 string type could become a > second-class citizen, though, without full support in the Ada.Strings > hierarchy). As far as I can tell, there is no support for UTF-16, only for UCS-2. Anyway, I don't think it is reasonable to force applications to go to the full 32-bit overhead just because they use, say, the french OE ligature. > * External representation of UCS characters is rapidly moving > towards UTF-8 (especially in Internet standards). > > Ada should provide an interface for converting between the wide string > type(s) and UTF-8 octet sequences. It should be possible to use > string literals where UTF-8 strings are expected. External representation is best handled by Text_IO and friends, typically by using a form parameter to specify the encoding (and there are many more encodings than just UCS and UTF). The ARG won't get into the business of specifying the details of the form parameter, so this is something that will remain non-portable for the foreseeable future. (Where do we stop? Do we want to require all validated compilers to support UTF-8? What about the chinese Big5 or the JIS encodings?) > * Supporting higher levels of Unicode (e.g. accessing the character > properties database, normalization forms) would be interesting, > too. We certainly don't want to get into that business. The designers of Ada 95 wisely decided to lump all of the characters in the range 16#0100# .. 16#FFFD# into the category special_character, so that they don't have to decide which is a letter, a number, etc. Similarly they didn't provide classification functions or upper/lower conversions for wide characters. This seems reasonable if we don't want to have to amend Ada each time a bunch of characters are added to 10646. ************************************************************* From: Nick Roberts Sent: Wednesday, April 24, 2002 7:31 PM > Therefore, I think Ada 200X should somehow support characters outside > the BMP. I agree. > GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS > characters in a single 32 bit value, that is, UTF-32), and while this is > certainly not the "industry standard" (it is encouraged by ISO 9899:1999, > though), I really hope we can use this approach (UTF-32 internal > representation) for Ada, as it simplifies things considerably, especially > if we want to add character properties support (see below). I agree very strongly! > * We could add Wide_Wide_Character and Wide_Wide_String types to > pacakge Standard (and extending the Ada.Strings hierarchy), which > are encoded in UTF-32. I must say I would prefer the identifiers Universal_Character and Universal_String. I see the logic of Wide_Wide_ but it seems clumsy! > I don't know if this is necessary. IIRC, Robert Dewar once told that > the only applications using Wide_Character are based on ASIS, where > using Wide_Character is not really voluntarily. Maybe it is possible > to bump Wide_Character'Size to 32 bits instead, without really > breaking backwards compatibility. I disagree with this idea. > Of course, we would need a way to converted UTF-32 strings to UTF-16 > strings and vice versa (the UTF-16 string type could become a > second-class citizen, though, without full support in the Ada.Strings > hierarchy). Possibly these support packages should be in an optional annex. > * External representation of UCS characters is rapidly moving > towards UTF-8 (especially in Internet standards). > > Ada should provide an interface for converting between the wide string > type(s) and UTF-8 octet sequences. It should be possible to use string > literals where UTF-8 strings are expected. > > * Supporting higher levels of Unicode (e.g. accessing the character > properties database, normalization forms) would be interesting, > too. Again, perhaps all this should really be in (or moved into) an optional annex. ************************************************************* From: Robert Dewar Sent: Wednesday, April 24, 2002 9:50 PM I suspect that the work on wide_wide_character will in practice turn out to be nearly useless in the short or medium term. We certainly put in a lot of work in GNAT in implementing wide character with many different representation schemes, but this feature has been very little used (ASIS being the main use :-). In practice I think the 16-bit character type defined in Ada now will be adequate for almost all use, and I see no reason in requring implementations to go beyond this in the absence of real market demand. Yes, it's fun to talk about character set issues (after all I was chair of the CRG, so I appreciate this), but there is no point in increasing implementation burdens unless it's really valuable. I would just give clear permission for an implementation to add additional character types in standard (indeed that permission exists today in Ada 95), and leave it at that. ************************************************************* From: John Barnes Sent: Thursday, April 25, 2002 1:46 AM The BSI is looking at character set issues across languages and your message reminded me of the CRG. Was there ever a final report that I could refer to? ************************************************************* From: Robert Dewar Sent: Thursday, April 26, 2002 10:25 PM I think there was a final report, perhaps Jim could track it down. ************************************************************* From: Randy Brukardt Sent: Thursday, April 25, 2002 3:44 PM > We certainly put in a lot of work in GNAT in implementing wide > character with many different representation schemes, but this > feature has been very little used (ASIS being the main use :-). To add another data point: Claw was designed so that a wide character version could be easily created. But we've never implemented that version, mainly because we've never had a paying customer ask for it. So I have to wonder how important "Really_Wide_Character" would be. ************************************************************* From: Florian Weimer Sent: Saturday, May 18, 2002 5:41 AM > I suspect that the work on wide_wide_character will in practice turn > out to be nearly useless in the short or medium term. Using Ada for internationalized applications on GNU systems (using GNU facilities) almost requires 32 bit Wide_Wide_Character support, since GNU uses a 32 bit wchar_t internally. (See a similar discussion on the GCC development list.) ************************************************************* From: Robert Dewar Sent: Saturday, May 18, 2002 7:32 AM We have seen zero demand for such functionality, so would not invest any time at all in either design or implementation work here. If such a feature is added to Ada, I would definitely suggest it be optional. ************************************************************* From: Florian Weimer Sent: Saturday, May 18, 2002 6:00 AM >> * Several major vendors have adopted ISO 10646-1:1993 early, using a >> 16 bit representation for characters (i.e. wchar_t in C is 16 >> bits). > > Which is fine as it maps directly to Ada's wide character. I still think > that we want to retain the capacity of using 16-bit blobs to represent > characters in the BMP, as 99.5% of practical applications will only need the > BMP. Quite a few people have already changed their minds about the 99.5% figure (mathematical characters and Plane 14 Language being the reason). Maybe it's true for the character count, but I doubt it for the application count. > Changes to Wide_Character and Wide_String are pretty much out of the > question. Okay, accepted. > On the other hand, the type that is intended for interfacing with > C is Interfaces.C.wchar_array, and it would be straightforward to provide > (in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit > Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C > compiler does). I doubt that C compilers can use UTF-16 for wchar_t. You cannot apply iswlower() to a single surrogate character. :-/ > I would think that we would want to use UCS-4, since it's an ISO standard. > Moreover, UTF-32 has a number of consistency rules (eg code points below > 16#10ffff#) which seem irrelevant for internal manipulation of strings. Yes, UCS-4 is indeed the correct encoding form to use. >> * We could add Wide_Wide_Character and Wide_Wide_String types to >> pacakge Standard (and extending the Ada.Strings hierarchy), which >> are encoded in UTF-32. > > Wide_Wide_ types seem like the natural way to add this capability to the > language, except that some compilers may not be quite prepared to deal with > enumeration types with 2 ** 32 literals (ours isn't). Ah, this could be a problem indeed, together with the large universal_integer returned by Wide_Wide_Character'Pos. >> (the UTF-16 string type could become a >> second-class citizen, though, without full support in the Ada.Strings >> hierarchy). > > As far as I can tell, there is no support for UTF-16, only for UCS-2. At the moment, yes, but I think we need some UTF-16 support, too, because many operating system interfaces use it. > Anyway, I don't think it is reasonable to force applications to go to the > full 32-bit overhead just because they use, say, the french OE ligature. Most people apparently refuse to use Wide_Character, too, for the same reason. They either go for ISO 8859-15 or Windows 1252, or don't use the OE ligature at all. > External representation is best handled by Text_IO and friends, typically by > using a form parameter to specify the encoding (and there are many more > encodings than just UCS and UTF). There was a recent discussion to add other I/O facilities. UTF-8 is becoming more and more common in the Internet context, and often, you can determine the encoding of a file only after reading the first couple of lines (think of a MIME-encoded mail message). Furthermore, UTF-8 already plays an important role in interacting with other libraries (not written in Ada). > (Where do we stop? Do we want to require all validated compilers to > support UTF-8? Yes, why not? Why shall all compilers support ISO 8859-1? Why UCS-2? > What about the chinese Big5 or the JIS encodings?) If there is support for UCS-4, handling these encodings could be performed by a mechanism similar to POSIX iconv(). ************************************************************* From: Robert Dewar Sent: Saturday, May 18, 2002 7:43 AM > Yes, why not? Why shall all compilers support ISO 8859-1? Why UCS-2? Why not = because there is no real demand. Especially this time around we need to be very careful not to require things that no one is really interested in. If we do this, the vendors will simply ignore any new standard. In fact I think if there is a new standard, it will only be implemented as a result of direct customer interest in features in this standard. The value of formal conformance and validation has largely disappeared from the Ada marketplace at this stage (in terms of customer demand). That's not to say that the Ada marketplace is not very vital and dynamic, we get dozens of requests for enhancements from our users every month, but there is precious little intersection between the things users seem to need and want and these kind of discussions. In GNAT, we put a lot of effort into implementing multiple character sets (we just added the new Latin set with the Euro symbol, because customers needed that for example). Some of it has been useful (like this Euro addition), but mostly these features are of entertainment and advertising value only. In fact the only serious user that we have for Wide_Character and Wide_String is us (from ASIS :-) One thing to remember here is that very little is needed in the way of language support for fancy character sets (most of the effort in GNAT for example for 8-bit sets is in csets, which gives proper case mapping for identifiers, and it is easy enough to add new tables to this -- someone contributed a new Cyrillic table just a few months ago). Most of the issues are representational issues, and the Ada standard has nothing to say about source representation (and this should not change in any new standard). ************************************************************* From: Pascal Leroy Sent: Tuesday, May 21, 2002 4:03 AM > > Which is fine as it maps directly to Ada's wide character. I still think > > that we want to retain the capacity of using 16-bit blobs to represent > > characters in the BMP, as 99.5% of practical applications will only need the > > BMP. > > Quite a few people have already changed their minds about the 99.5% > figure (mathematical characters and Plane 14 Language being the > reason). Maybe it's true for the character count, but I doubt it for > the application count. Remember, we are talking Ada applications here. There are probably many applications out there that deal with mathematical symbols or with Tengwar, but I doubt that they are written in Ada. > > External representation is best handled by Text_IO and friends, typically by > > using a form parameter to specify the encoding (and there are many more > > encodings than just UCS and UTF). > > There was a recent discussion to add other I/O facilities. UTF-8 is > becoming more and more common in the Internet context, and often, you > can determine the encoding of a file only after reading the first > couple of lines (think of a MIME-encoded mail message). Furthermore, > UTF-8 already plays an important role in interacting with other > libraries (not written in Ada). Maybe we need a predefined unit to convert UCS-2 to/from UTF-8. But then such conversion functions could easily be written by the user, too, or provided by some public domain stuff. > > (Where do we stop? Do we want to require all validated compilers to > > support UTF-8? > > Yes, why not? Why shall all compilers support ISO 8859-1? Why UCS-2? You don't sell many compilers if you don't support 8859-1. As for UCS-2, well, that's pretty much the default representation of wide characters anyway. Other than that, it would seem that we should let the market decide. Speaking for Rational, we have had wide character support for about 7 years, and I don't recall seeing a single bug report or request for enhancement on this topic. This may indicate that our technology is perfect, but there are other explanation ;-) . (As a matter of fact we probably have very few licenses installed in countries where 8859-1 is not sufficient to write the native language -- ignoring the problem with the OE ligature in French.) One option would be to add Wide_Wide_Character in a new annex, and let users decide if they want their vendors to support this annex. Of course, chances are that nobody would care, in which case that would be a lot of standardization effort for nothing. ************************************************************* From: Robert Dewar Sent: Tuesday, May 21, 2002 4:39 AM I agree with everything Pascal had to say about wide character. We do have one Japanese customer using wide characters, and as I mentioned earlier, ASIS uses wide strings to represent source texts, but other than that, we have heard very little about wide strings. The only real input we have got from customers on character set issues was the request to support Latin-9 with the new Euro symbol and we got contributed tables for Cyrillic from a Russian enthusiast (not a customer, but it seemed a harmless addition :-) ************************************************************* From: Florian Weimer Sent: Tuesday, May 21, 2002 1:42 PM > I agree with everything Pascal had to say about wide character. We do have > one Japanese customer using wide characters, and as I mentioned earlier, > ASIS uses wide strings to represent source texts, but other than that, > we have heard very little about wide strings. I guess this customer doesn't use Wide_Character in the way it was intended (for storing ISO 10646 code position), so this example is a bit dubious. > The only real input we have got from customers on character set > issues was the request to support Latin-9 with the new Euro symbol Even in this rather innocent case, Wide_Character is no longer using UCS-2 with GNAT. ************************************************************* From: Michael F. Yoder Sent: Monday, October 21, 2002 10:58 AM This is one of the items on my homework list. UTF = UCS Transformation Format. UCS = Universal Multiple-Octet Coded Character Set. I guess the MOC is silent. :-) UTF-8 encodes 31-bit values as 8-bit values, as follows. 0xxxxxxx encodes itself (the coding is ASCII-compatible) 110xxxxx 10Y encodes xxxxxY where Y stands for yyyyyy 1110xxxx 10Y 10Z encodes xxxxYZ 11110xxx 10Y 10Z 10U encodes xxxYZU 111110xx 10Y 10Z 10U 10V encodes xxYZUV 1111110x 10Y 10Z 10U 10V 10W encodes xYZUVW The octets 11111110 and 11111111 aren't used in the encoding. So, excepting these 2, octets starting with 11 are headers, those starting with 10 are trailers, and those starting with 0 are singletons. It's forbidden to use the redundant encodings (you must use the shortest encoding allowed). There are security reasons for this, aside from the fact that doing so breaks the string search property mentioned below. The encoding is self-synchronizing: if you start in the middle of a string of octets, you skip octets of the form 10xxxxxx to get to the next start of character. If the encoding is proper, string searches for an encoded pattern within an encoded string will work as desired to yield all occurrences of the pattern. (For case-folded searches and the like this only works if the string is mapped before being converted to UTF-8.) ************************************************************* From: Robert Dewar Sent: Monday, October 21, 2002 11:03 AM Is anyone using UTF-8 encoding with Ada. We have some customers using wide character encodings but none to our knowledge uses UTF-8. ************************************************************* From: Robert A. Duff Sent: Monday, October 21, 2002 11:43 AM > It's forbidden to use the redundant encodings (you must use the shortest > encoding allowed). There are security reasons for this,... I'm curious: why is that? (Not quite curious enough to go RTFM. ;-)) >... aside from the > fact that doing so breaks the string search property mentioned below. Yes, I understand that. ************************************************************* From: Michael F. Yoder Sent: Monday, October 21, 2002 1:15 PM This problem is one my previous employer is having to deal with. Basically, it's that redundant encodings can be used to sneak things past filters if the redundant encodings aren't rejected; if redundant encodings are allowed, writing (say) a regular expression that will match exactly all possible encoded forms is a pain, is error-prone, and is probably significantly slower to check. Here's a contrived case. A program reads a command, and if it's the special command 'shazam' it checks the user's authorization; otherwise it passes on the command unmodified, because all other commands are safe. If there's a redundant encoding of 'shazam' that the filter misses, an unauthorized user can bypass the checking if he can arrange to supply that encoding. ************************************************************* From: Michael F. Yoder Sent: Thursday, October 24, 2002 5:46 PM This is the easy part of my homework. The identifier character ranges are defined in terms of multiple character categories (see below), so I can't get the harder part without a little coding. This is using Unicode version 3.2. A "space" is itself a normative category. It is anything in the range U+2000 to U+200B, plus 5 other scattered characters. A "separator" is any space plus the two characters "Line Separator" U+2028 and "Paragraph Separator" U+2029. These are each in a normative category containing just 1 value. A "decimal digit" is itself a normative category. There are 25 ranges of these, 23 including the digits 0 through 9 and 2 with only the digits 1 through 9. (These two scripts use the ASCII zero rather than encoding a separate one.) Five of these ranges are above U+FFFF, that is, out of the BMP (their character descriptions all start with "mathematical"). The digits 1 through 9 in these scripts don't in general look much like our 1 through 9. The rules for identifiers say (I'm condensing and interpreting) that the syntax for identifiers should start with their basic definition and fiddle it as appropriate to include extra characters (for Ada, that means underscore). Their basic definition is identifier ::= id-start { id-start | id-extend } id-start is any letter (which come in 5 subcategories) or a "letter number." There are a lot of letters outside the BMP, including the large range "CJK Ideograph Extension B." id-extend is decimal digits plus nonspacing marks, spacing combining marks, connector punctuation, and formatting codes. ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 7:19 PM I am completely confused, why are we discussing this eactly can you be clear as to the goals of this discussion? ************************************************************* From: Randy Brukardt Sent: Thursday, October 24, 2002 2:50 PM I know I don't count, :-) but I've had several requests to extend my spam filter to support UTF-8 encodings. Because I'm not asking for any money for the filter, and I haven't had any signficant amount of UTF-8 mail, I haven't done anything about it yet. But it seems likely that I will need to do this at some point (I've seen occassional UTF-8 encoded mail, but not enough good mail that handling it manually is a problem.) ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 4:29 PM Oh sure, UTF-8 encoded spam is common indeed, but that was not what I was talking about (unless you have some spam messages written in Ada source code :-) ************************************************************* From: Randy Brukardt Sent: Thursday, October 24, 2002 4:59 PM I think you misunderstand. I have written an anti-spam plugin for the IMS mailserver that I use. It is written in Ada, of course, and I've had requests for it to be able to handle UTF-8 encoded mail. For me, it's fine to treat such mail as all spam, but that is not true for some of the other users of it. (I've made it available to the community of IMS users, as they have made many useful plugins available that I have been using for years.) In order to properly support UTF-8 mail, I'd need at least to convert the search patterns (in Latin-1, of course) into UTF-8. I'd also need to verify that the rules that Mike noted are followed (a common trick of spammers is to violate basic encoding rules, as most decoders don't check. But the illegal encodings tend to get ignored by filters, because they don't match exactly. That was one of the prime reasons I wrote the plugin in the first place, because a lot of spam is now coming encoded in one way or another, and thus is not picked up by a plain text scan). ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 7:17 PM Oh! I was confused then, I thought this was something to do with Ada. ************************************************************* From: Randy Brukardt Sent: Thursday, October 24, 2002 7:46 PM Of course it has to do with Ada. You asked "Is anyone using UTF-8 encoding with Ada." And I answered that I have an Ada program that needs to process UTF-8 text (but doesn't yet). And I tried to explain what the program is and why it needs to process UTF-8 text and what support from Ada would be valuable. Perhaps I should have just answered your original question "Yes"? :-) ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 8:09 PM Sorry, when I meant "using UTF-8 encoding with Ada", I was talking about language features for wide character representation. The fact that your program is in Ada does not seem to be particularly informative. I am completely confused here, what ARG-related language problem is this thread addressing? ************************************************************* From: Randy Brukardt Sent: Thursday, October 24, 2002 8:32 PM As I recall, one of the facets of UTF-8 support in Ada would be the equivalent of Ada.Characters.Handling for UTF-8 represented Strings. Those operations would be valuable for this application, particularly To_Wide_String (UTF_8_String) or To_UTF_8_String (String). A UTF-8 Text_IO would also be valuable, although I'd find that overkill for this application (usually the text has to be decoded to UTF-8 from some 7-bit representation anyway). I'm not sure where else UTF-8 would appear in the standard. Source representation and external file representations are outside of the scope of the standard. The regular string operations seem to work for most (all?) operations. Everything else seems to already be covered by the existing wide character support. ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 8:45 PM Well, harmless I suppose, but I doubt worth the effort. Again, I would generate packages on the basis of packages that exist, have proved useful and are actually widely used. It seems a mistake to get into the "here's a neat idea for a package that would help with something I happen to be doing". ************************************************************* From: Michael F. Yoder Sent: Thursday, October 24, 2002 5:46 PM > I am completely confused here, what ARG-related language >problem is this thread addressing? Kiyoshi Ishihata stated at the last meeting that there was in interest in some countries in being able to write programs as much as possible in native languages, the primary deficit in this regard being that identifiers are entirely in Latin-1 characters. He didn't specify which countries to my recollection, but Japan, Russia, China, and India are obvious cases where the commonly used scripts are disjoint from Latin-1. The information being supplied is exploratory in nature: the idea is to find out how hard it would be to extend existing compilers so as to satisfy all the national groups at once, and whether and to what extent the ARG should be involved in specifying standards for such extensions. There was a separate issue involving the fact that ISO 10646-n (I forget what n is) now has mapped characters outside the BMP. This had to happen, given that the code now maps some 70,000 Han characters. ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 8:54 PM Well I would just allow arbitrary wide characters in identifiers, why not, it does not cause any problems. GNAT has implemented an option for this for ever. I would specify that there is no upper/lower case equivalence in this case, since otherwise you get into a huge mess that is simply not worth the effort. ************************************************************* From: Tucker Taft Sent: Thursday, October 24, 2002 10:10 PM I suggest you read the ARG minutes when they are available. Kiyoshi indicated specifically that they wanted to restrict usage to characters that "make sense" as identifier characters. I will admit I was in your camp that the simplest is to just allow anything. However, I will leave it to Kiyoshi to explain his reasoning. He certainly knows more than I do about the requirements. You should perhaps discuss it direclty with Kiyoshi if you don't agree. Mike indicated that UTF-8 encoding makes it easy to support even very wide characters in identifiers, because it provides a canonical representation, as a stream of bytes. We asked him to share his knowledge in this area, so we didn't all have to become experts in ISO-10646 to evaluate the implemenation issues in this area. ************************************************************* From: Randy Brukardt Sent: Thursday, October 24, 2002 10:29 PM Here is my notes on the Wide_Character in identifiers issue, which will be turned into the minutes. "What about full source representation of the language in Wide_Character? Kiyoshi reports that there is a push in SC22 to allow full wide characters in identifiers. How do you define which characters are letters? How do you define case equivalence? Mike says just use "letter" in the character standard. But this is likely to be very complex in the compiler and in the run-time. Tucker suggests use anything out of row 00 be treated a letter. Kiyoshi says that would not be acceptable to Japan, which is preparing a standard for which characters are allowed in identifiers." ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 4:11 AM > I suggest you read the ARG minutes when they are available. Kiyoshi > indicated specifically that they wanted to restrict usage to > characters that "make sense" as identifier characters. I will admit > I was in your camp that the simplest is to just allow anything. > However, I will leave it to Kiyoshi to explain his reasoning. > He certainly knows more than I do about the requirements. You > should perhaps discuss it direclty with Kiyoshi if you don't agree. I would leave such restrictions up to either local coding standards, enforced e.g. by ASIS tools, or enforced by compiler restrictions. Getting into what makes sense in different languages is way way out of scope (I speak as the former chair of the CRG, character issues are very difficult to deal with. In the context of the CRG work, we spent ages discussing the issue of whether E and E-acute should be equivalent in identifiers, and came to the conclusion that the answer might be different in different languages. There is no point in adding a huge national dependent mess here. Indeed I would consider in the ISO standard saying specifically that national bodies are welcome to devise local sub-standards for identifiers and character set requirements and leave it at that. I perfectly well understand where Kiyoshi is coming from. I am sure he feels as strongly that only certain characters be used as Jean Ichbiah felt about the E/E-acute issue. But it just is not practical for the international standard to get into the business of deciding what are and what are not useful identifier names in all the languages of the world, or even just for the P members :-) ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 4:16 AM OK, so great, very appropriate, there can be a Japanese National standard that specifies that for Ada compilers to meet this standard, there must be a mode in which identifiers are only allowed to contain bla bla characters. Other countries in the world are free to devise similar national standards but I fail to see why they should be a matter for an international standard. What would be marginally useful in the international standard would be to devise a general framework for those national standards, and make it clear that it is an acceptable thing for Ada compilers to implement one or more of these standards. Frankly I think that the standard already does that, but it would be fine to make it explicit. GNAT for example allows lots of localization of identifier characters sets, e.g. Latin-2, Cyrillic etc. ************************************************************* From: Pascal Leroy Sent: Friday, October 25, 2002 6:54 AM > But it just is not practical for the international > standard to get into the business of deciding what are and what are not > useful identifier names in all the languages of the world... It has certainly never been the intent to have the ARG discuss the identifier characters for all the languages in the world. However, there is an ISO working group in charge of developing and maintaining the ISO 10646 standard, and the intent was to piggyback on the work done there. 10646 defines precisely what is a character (and so yes, E and E-acute are distinct, as are uppercase A and uppercase alpha, even though they really look the same), what is a letter, a digit, how the uppercase/lowercase conversions work, etc. I see no reason why the Ada standard couldn't use these definitions. (And Mike gave us a feeling of what this would look like, and it doesn't seem unreasonably complicated to me.) Note that Java does exactly that, and defines letters and digits in a way which is derived from Unicode (itself a close approximation to 10646). I don't see why Ada would lag behind in this area: it would not be a big implementation effort, and it would improve usability of the language. I don't buy the notion that national bodies have a role to play here (except of course that they probably want to influence 10646). It's already hard to define one language standard and ensure that it's implemented with a minimum of consistency, I don't see how users or implementers could live with the coexistence of "Japanese Ada" and "Hebrew Ada" and "Russian Ada". Pascal PS: Note that the E vs. E-acute discussion is moot, since this is already settled by Latin-1 and yes, they are different. ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 7:55 PM > I don't buy the notion that national bodies have a role to play here (except > of course that they probably want to influence 10646). It's already hard to > define one language standard and ensure that it's implemented with a minimum > of consistency, I don't see how users or implementers could live with the > coexistence of "Japanese Ada" and "Hebrew Ada" and "Russian Ada". Well GNAT implements lots of different localized character sets, and noone seems to have dropped dead :-) ************************************************************* From: Robert A. Duff Sent: Friday, October 25, 2002 9:13 AM > Kiyoshi Ishihata stated at the last meeting that there was in interest > in some countries in being able to write programs as much as possible in > native languages, the primary deficit in this regard being that > identifiers are entirely in Latin-1 characters. Yes, but it was also mentioned at the meeting that SC22 is trying to get programming languages to do something-or-other related to this. I.e. allow 31-bit characters in identifiers, and have some uniformity across programming languages about which characters are allowed in identifiers. I suppose WG9 is supposed to "obey" SC22 on this point? By the way, let's mention the AI number being discussed in these messages, so we don't get the "What the heck are you talking about?" kinds of messages from Robert or others who might have missed part of the discussion. ;-) I believe Pascal raised the issue many months ago, and it has an AI number, and one can presumably search for that AI number in the meeting minutes (once Randy publishes them). ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 8:32 PM I tried, I could not find the AI number on this one Of course if there are uniform rules at the SC22 level, then it is fine to adopt them in Ada. I just think it is not something we should expend our own very limited resources on. ************************************************************* From: Randy Brukardt Sent: Friday, October 25, 2002 8:59 PM This was discussed as part of AI-285, which started life as an AI about Latin-9. That discussion took up the entire afternoon of the third day of the meeting. These other issues came up since it was felt that better Wide_Character support would (might?) make it unnecessary for the standard to directly deal with Latin-9. (Implementations still would have to, in all likelyhood.) There are a lot of notes in this area, and I haven't gotten that far in the minutes yet. So my summary might be suspect... (And I haven't posted the mail yet, either, but it's likely that it will all got on AI-285.) ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 9:12 PM > This was discussed as part of AI-285, which started life as an AI about > Latin-9. That discussion took up the entire afternoon of the third day of > the meeting. Be careful not to be eaten alive by character discussions. It was quite intentional that we banned discussion of these issues from the main group in the Ada 9X effort and shoveled them off to the CRG. Spending one of six sessions on this issue alone to me says that things are already getting out of control :-) I quite understand how this happens (remember I was chair of the CRG!) > > These other issues came up since it was felt that better Wide_Character > > support would (might?) make it unnecessary for the standard to directly deal > > with Latin-9. (Implementations still would have to, in all likelyhood.) Well of course in practice Latin-9 is barely interesting, it just introduces a different name for the Euro character. But for sure most computing with Ada will be done using latin-9 whatever the Ada standard says :-) ************************************************************* From: Randy Brukardt Sent: Friday, October 25, 2002 10:14 PM Well, it sounds worse that it is. The afternoon session of the last day is typically short. We didn't get back from lunch until about 2:15, and we adjorned at 3:28. Still, I probably would have dozed off during this discussion if I hadn't been taking notes... ************************************************************* From: Robert A. Duff Sent: Friday, October 25, 2002 9:19 AM I agree that the ARG should not spend time thinking about characters. And we should not add all kinds of verbiage about character sets to the RM. But if there is a character-set standard that can be simply referred to, why not. Apparently, there *is* a definition of which 31-bit characters are "letters". I thought the intent was to simply refer to that definition (which of course changes from year to year). ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 8:45 PM Probably that's reasonable, although I worry that this will generate a lot of busy work in implementations for extraordinarily little gain. ************************************************************* From: Robert A. Duff Sent: Saturday, October 26, 2002 9:58 AM Yes. The purpose of Mike Yoder's "homework assignment" was to determine how difficult it is to write the "Is_Letter" function that the Ada lexer would need. And a case conversion routine, I guess. And how inefficient these would have to be. (People at the meeting were concerned about huge character-set tables having to be in the compiler.) I'm not at all interested in these character set issues. If folks can make an AI that is trivial to implement (efficiently), and invokes all character-set junk by reference to other standards, then I suppose it's OK with me. [ Insert my usual rant about what's important, here. ;-) ] ************************************************************* From: Robert A. Duff Sent: Saturday, October 26, 2002 10:14 AM I agree with Bob in all respects, including the parenthetical comment ************************************************************* From: Pascal Leroy Sent: Wednesday, November 27, 2002 4:27 AM During the last meeting we discussed the possibility of allowing any Unicode character (er, I mean, ISO 10646) in Ada source. Some people were concerned that the classification tables and the uppercase translation tables would be huge and complex to produce. Mike Y provided some input on this topic a while back, but since I (and probably other people) prefer to see the real tables, I spent a couple of hours writing a little Ada program to parse the Unicode database and spit out aggregates for these tables. I am attaching to this message three classification tables (letters, digits, and spaces) as well as the table that converts to uppercase. The latter is the largest one, and it only has 419 entries, for a total of 5028 bytes. And that's with a representation that is not particularly compact: a more space-efficient representation could be obtained for instance by storing the ranges as (First, Length) instead of (First, Last). The tables would change slightly depending on the rules that we choose (e.g. for the syntax of identifiers) but their size would not be substantially modified. This demonstrates two things: 1 - The tables are easy to produce from the Unicode database. 2 - The tables are small. --- Digits : constant Ranges := ( (16#30#, 16#39#), -- DIGIT ZERO .. DIGIT NINE (16#B2#, 16#B3#), -- SUPERSCRIPT TWO .. SUPERSCRIPT THREE (16#B9#, 16#B9#), -- SUPERSCRIPT ONE .. SUPERSCRIPT ONE (16#660#, 16#669#), -- ARABIC-INDIC DIGIT ZERO .. ARABIC-INDIC DIGIT NINE (16#6F0#, 16#6F9#), -- EXTENDED ARABIC-INDIC DIGIT ZERO .. EXTENDED ARABIC-INDIC DIGIT NINE (16#966#, 16#96F#), -- DEVANAGARI DIGIT ZERO .. DEVANAGARI DIGIT NINE (16#9E6#, 16#9EF#), -- BENGALI DIGIT ZERO .. BENGALI DIGIT NINE (16#A66#, 16#A6F#), -- GURMUKHI DIGIT ZERO .. GURMUKHI DIGIT NINE (16#AE6#, 16#AEF#), -- GUJARATI DIGIT ZERO .. GUJARATI DIGIT NINE (16#B66#, 16#B6F#), -- ORIYA DIGIT ZERO .. ORIYA DIGIT NINE (16#BE7#, 16#BEF#), -- TAMIL DIGIT ONE .. TAMIL DIGIT NINE (16#C66#, 16#C6F#), -- TELUGU DIGIT ZERO .. TELUGU DIGIT NINE (16#CE6#, 16#CEF#), -- KANNADA DIGIT ZERO .. KANNADA DIGIT NINE (16#D66#, 16#D6F#), -- MALAYALAM DIGIT ZERO .. MALAYALAM DIGIT NINE (16#E50#, 16#E59#), -- THAI DIGIT ZERO .. THAI DIGIT NINE (16#ED0#, 16#ED9#), -- LAO DIGIT ZERO .. LAO DIGIT NINE (16#F20#, 16#F29#), -- TIBETAN DIGIT ZERO .. TIBETAN DIGIT NINE (16#1040#, 16#1049#), -- MYANMAR DIGIT ZERO .. MYANMAR DIGIT NINE (16#1369#, 16#1371#), -- ETHIOPIC DIGIT ONE .. ETHIOPIC DIGIT NINE (16#17E0#, 16#17E9#), -- KHMER DIGIT ZERO .. KHMER DIGIT NINE (16#1810#, 16#1819#), -- MONGOLIAN DIGIT ZERO .. MONGOLIAN DIGIT NINE (16#2070#, 16#2070#), -- SUPERSCRIPT ZERO .. SUPERSCRIPT ZERO (16#2074#, 16#2079#), -- SUPERSCRIPT FOUR .. SUPERSCRIPT NINE (16#2080#, 16#2089#), -- SUBSCRIPT ZERO .. SUBSCRIPT NINE (16#FF10#, 16#FF19#), -- FULLWIDTH DIGIT ZERO .. FULLWIDTH DIGIT NINE (16#1D7CE#, 16#1D7FF#) -- MATHEMATICAL BOLD DIGIT ZERO .. MATHEMATICAL MONOSPACE DIGIT NINE ); --- Letters : constant Ranges := ( (16#41#, 16#5A#), -- LATIN CAPITAL LETTER A .. LATIN CAPITAL LETTER Z (16#61#, 16#7A#), -- LATIN SMALL LETTER A .. LATIN SMALL LETTER Z (16#AA#, 16#AA#), -- FEMININE ORDINAL INDICATOR .. FEMININE ORDINAL INDICATOR (16#B5#, 16#B5#), -- MICRO SIGN .. MICRO SIGN (16#BA#, 16#BA#), -- MASCULINE ORDINAL INDICATOR .. MASCULINE ORDINAL INDICATOR (16#C0#, 16#D6#), -- LATIN CAPITAL LETTER A WITH GRAVE .. LATIN CAPITAL LETTER O WITH DIAERESIS (16#D8#, 16#F6#), -- LATIN CAPITAL LETTER O WITH STROKE .. LATIN SMALL LETTER O WITH DIAERESIS (16#F8#, 16#2B8#), -- LATIN SMALL LETTER O WITH STROKE .. MODIFIER LETTER SMALL Y (16#2BB#, 16#2C1#), -- MODIFIER LETTER TURNED COMMA .. MODIFIER LETTER REVERSED GLOTTAL STOP (16#2D0#, 16#2D1#), -- MODIFIER LETTER TRIANGULAR COLON .. MODIFIER LETTER HALF TRIANGULAR COLON (16#2E0#, 16#2E4#), -- MODIFIER LETTER SMALL GAMMA .. MODIFIER LETTER SMALL REVERSED GLOTTAL STOP (16#2EE#, 16#2EE#), -- MODIFIER LETTER DOUBLE APOSTROPHE .. MODIFIER LETTER DOUBLE APOSTROPHE (16#37A#, 16#37A#), -- GREEK YPOGEGRAMMENI .. GREEK YPOGEGRAMMENI (16#386#, 16#386#), -- GREEK CAPITAL LETTER ALPHA WITH TONOS .. GREEK CAPITAL LETTER ALPHA WITH TONOS (16#388#, 16#3F5#), -- GREEK CAPITAL LETTER EPSILON WITH TONOS .. GREEK LUNATE EPSILON SYMBOL (16#400#, 16#481#), -- CYRILLIC CAPITAL LETTER IE WITH GRAVE .. CYRILLIC SMALL LETTER KOPPA (16#48A#, 16#559#), -- CYRILLIC CAPITAL LETTER SHORT I WITH TAIL .. ARMENIAN MODIFIER LETTER LEFT HALF RING (16#561#, 16#587#), -- ARMENIAN SMALL LETTER AYB .. ARMENIAN SMALL LIGATURE ECH YIWN (16#5D0#, 16#5F2#), -- HEBREW LETTER ALEF .. HEBREW LIGATURE YIDDISH DOUBLE YOD (16#621#, 16#64A#), -- ARABIC LETTER HAMZA .. ARABIC LETTER YEH (16#66E#, 16#66F#), -- ARABIC LETTER DOTLESS BEH .. ARABIC LETTER DOTLESS QAF (16#671#, 16#6D3#), -- ARABIC LETTER ALEF WASLA .. ARABIC LETTER YEH BARREE WITH HAMZA ABOVE (16#6D5#, 16#6D5#), -- ARABIC LETTER AE .. ARABIC LETTER AE (16#6E5#, 16#6E6#), -- ARABIC SMALL WAW .. ARABIC SMALL YEH (16#6FA#, 16#6FC#), -- ARABIC LETTER SHEEN WITH DOT BELOW .. ARABIC LETTER GHAIN WITH DOT BELOW (16#710#, 16#710#), -- SYRIAC LETTER ALAPH .. SYRIAC LETTER ALAPH (16#712#, 16#72C#), -- SYRIAC LETTER BETH .. SYRIAC LETTER TAW (16#780#, 16#7A5#), -- THAANA LETTER HAA .. THAANA LETTER WAAVU (16#7B1#, 16#7B1#), -- THAANA LETTER NAA .. THAANA LETTER NAA (16#905#, 16#939#), -- DEVANAGARI LETTER A .. DEVANAGARI LETTER HA (16#93D#, 16#93D#), -- DEVANAGARI SIGN AVAGRAHA .. DEVANAGARI SIGN AVAGRAHA (16#950#, 16#950#), -- DEVANAGARI OM .. DEVANAGARI OM (16#958#, 16#961#), -- DEVANAGARI LETTER QA .. DEVANAGARI LETTER VOCALIC LL (16#985#, 16#9B9#), -- BENGALI LETTER A .. BENGALI LETTER HA (16#9DC#, 16#9E1#), -- BENGALI LETTER RRA .. BENGALI LETTER VOCALIC LL (16#9F0#, 16#9F1#), -- BENGALI LETTER RA WITH MIDDLE DIAGONAL .. BENGALI LETTER RA WITH LOWER DIAGONAL (16#A05#, 16#A39#), -- GURMUKHI LETTER A .. GURMUKHI LETTER HA (16#A59#, 16#A5E#), -- GURMUKHI LETTER KHHA .. GURMUKHI LETTER FA (16#A72#, 16#A74#), -- GURMUKHI IRI .. GURMUKHI EK ONKAR (16#A85#, 16#AB9#), -- GUJARATI LETTER A .. GUJARATI LETTER HA (16#ABD#, 16#ABD#), -- GUJARATI SIGN AVAGRAHA .. GUJARATI SIGN AVAGRAHA (16#AD0#, 16#AE0#), -- GUJARATI OM .. GUJARATI LETTER VOCALIC RR (16#B05#, 16#B39#), -- ORIYA LETTER A .. ORIYA LETTER HA (16#B3D#, 16#B3D#), -- ORIYA SIGN AVAGRAHA .. ORIYA SIGN AVAGRAHA (16#B5C#, 16#B61#), -- ORIYA LETTER RRA .. ORIYA LETTER VOCALIC LL (16#B83#, 16#BB9#), -- TAMIL SIGN VISARGA .. TAMIL LETTER HA (16#C05#, 16#C39#), -- TELUGU LETTER A .. TELUGU LETTER HA (16#C60#, 16#C61#), -- TELUGU LETTER VOCALIC RR .. TELUGU LETTER VOCALIC LL (16#C85#, 16#CB9#), -- KANNADA LETTER A .. KANNADA LETTER HA (16#CDE#, 16#CE1#), -- KANNADA LETTER FA .. KANNADA LETTER VOCALIC LL (16#D05#, 16#D39#), -- MALAYALAM LETTER A .. MALAYALAM LETTER HA (16#D60#, 16#D61#), -- MALAYALAM LETTER VOCALIC RR .. MALAYALAM LETTER VOCALIC LL (16#D85#, 16#DC6#), -- SINHALA LETTER AYANNA .. SINHALA LETTER FAYANNA (16#E01#, 16#E30#), -- THAI CHARACTER KO KAI .. THAI CHARACTER SARA A (16#E32#, 16#E33#), -- THAI CHARACTER SARA AA .. THAI CHARACTER SARA AM (16#E40#, 16#E46#), -- THAI CHARACTER SARA E .. THAI CHARACTER MAIYAMOK (16#E81#, 16#EB0#), -- LAO LETTER KO .. LAO VOWEL SIGN A (16#EB2#, 16#EB3#), -- LAO VOWEL SIGN AA .. LAO VOWEL SIGN AM (16#EBD#, 16#EC6#), -- LAO SEMIVOWEL SIGN NYO .. LAO KO LA (16#EDC#, 16#F00#), -- LAO HO NO .. TIBETAN SYLLABLE OM (16#F40#, 16#F6A#), -- TIBETAN LETTER KA .. TIBETAN LETTER FIXED-FORM RA (16#F88#, 16#F8B#), -- TIBETAN SIGN LCE TSA CAN .. TIBETAN SIGN GRU MED RGYINGS (16#1000#, 16#102A#), -- MYANMAR LETTER KA .. MYANMAR LETTER AU (16#1050#, 16#1055#), -- MYANMAR LETTER SHA .. MYANMAR LETTER VOCALIC LL (16#10A0#, 16#10F8#), -- GEORGIAN CAPITAL LETTER AN .. GEORGIAN LETTER ELIFI (16#1100#, 16#135A#), -- HANGUL CHOSEONG KIYEOK .. ETHIOPIC SYLLABLE FYA (16#13A0#, 16#166C#), -- CHEROKEE LETTER A .. CANADIAN SYLLABICS CARRIER TTSA (16#166F#, 16#1676#), -- CANADIAN SYLLABICS QAI .. CANADIAN SYLLABICS NNGAA (16#1681#, 16#169A#), -- OGHAM LETTER BEITH .. OGHAM LETTER PEITH (16#16A0#, 16#16EA#), -- RUNIC LETTER FEHU FEOH FE F .. RUNIC LETTER X (16#1700#, 16#1711#), -- TAGALOG LETTER A .. TAGALOG LETTER HA (16#1720#, 16#1731#), -- HANUNOO LETTER A .. HANUNOO LETTER HA (16#1740#, 16#1751#), -- BUHID LETTER A .. BUHID LETTER HA (16#1760#, 16#1770#), -- TAGBANWA LETTER A .. TAGBANWA LETTER SA (16#1780#, 16#17B3#), -- KHMER LETTER KA .. KHMER INDEPENDENT VOWEL QAU (16#17D7#, 16#17D7#), -- KHMER SIGN LEK TOO .. KHMER SIGN LEK TOO (16#17DC#, 16#17DC#), -- KHMER SIGN AVAKRAHASANYA .. KHMER SIGN AVAKRAHASANYA (16#1820#, 16#18A8#), -- MONGOLIAN LETTER A .. MONGOLIAN LETTER MANCHU ALI GALI BHA (16#1E00#, 16#1FBC#), -- LATIN CAPITAL LETTER A WITH RING BELOW .. GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI (16#1FBE#, 16#1FBE#), -- GREEK PROSGEGRAMMENI .. GREEK PROSGEGRAMMENI (16#1FC2#, 16#1FCC#), -- GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI .. GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI (16#1FD0#, 16#1FDB#), -- GREEK SMALL LETTER IOTA WITH VRACHY .. GREEK CAPITAL LETTER IOTA WITH OXIA (16#1FE0#, 16#1FEC#), -- GREEK SMALL LETTER UPSILON WITH VRACHY .. GREEK CAPITAL LETTER RHO WITH DASIA (16#1FF2#, 16#1FFC#), -- GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI .. GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI (16#2071#, 16#2071#), -- SUPERSCRIPT LATIN SMALL LETTER I .. SUPERSCRIPT LATIN SMALL LETTER I (16#207F#, 16#207F#), -- SUPERSCRIPT LATIN SMALL LETTER N .. SUPERSCRIPT LATIN SMALL LETTER N (16#2102#, 16#2102#), -- DOUBLE-STRUCK CAPITAL C .. DOUBLE-STRUCK CAPITAL C (16#2107#, 16#2107#), -- EULER CONSTANT .. EULER CONSTANT (16#210A#, 16#2113#), -- SCRIPT SMALL G .. SCRIPT SMALL L (16#2115#, 16#2115#), -- DOUBLE-STRUCK CAPITAL N .. DOUBLE-STRUCK CAPITAL N (16#2119#, 16#211D#), -- DOUBLE-STRUCK CAPITAL P .. DOUBLE-STRUCK CAPITAL R (16#2124#, 16#2124#), -- DOUBLE-STRUCK CAPITAL Z .. DOUBLE-STRUCK CAPITAL Z (16#2126#, 16#2126#), -- OHM SIGN .. OHM SIGN (16#2128#, 16#2128#), -- BLACK-LETTER CAPITAL Z .. BLACK-LETTER CAPITAL Z (16#212A#, 16#212D#), -- KELVIN SIGN .. BLACK-LETTER CAPITAL C (16#212F#, 16#2131#), -- SCRIPT SMALL E .. SCRIPT CAPITAL F (16#2133#, 16#2139#), -- SCRIPT CAPITAL M .. INFORMATION SOURCE (16#213D#, 16#213F#), -- DOUBLE-STRUCK SMALL GAMMA .. DOUBLE-STRUCK CAPITAL PI (16#2145#, 16#2149#), -- DOUBLE-STRUCK ITALIC CAPITAL D .. DOUBLE-STRUCK ITALIC SMALL J (16#3005#, 16#3006#), -- IDEOGRAPHIC ITERATION MARK .. IDEOGRAPHIC CLOSING MARK (16#3031#, 16#3035#), -- VERTICAL KANA REPEAT MARK .. VERTICAL KANA REPEAT MARK LOWER HALF (16#303B#, 16#303C#), -- VERTICAL IDEOGRAPHIC ITERATION MARK .. MASU MARK (16#3041#, 16#3096#), -- HIRAGANA LETTER SMALL A .. HIRAGANA LETTER SMALL KE (16#309D#, 16#309F#), -- HIRAGANA ITERATION MARK .. HIRAGANA DIGRAPH YORI (16#30A1#, 16#30FA#), -- KATAKANA LETTER SMALL A .. KATAKANA LETTER VO (16#30FC#, 16#318E#), -- KATAKANA-HIRAGANA PROLONGED SOUND MARK .. HANGUL LETTER ARAEAE (16#31A0#, 16#31FF#), -- BOPOMOFO LETTER BU .. KATAKANA LETTER SMALL RO (16#3400#, 16#A48C#), -- .. YI SYLLABLE YYR (16#AC00#, 16#D7A3#), -- .. (16#F900#, 16#FB1D#), -- CJK COMPATIBILITY IDEOGRAPH-F900 .. HEBREW LETTER YOD WITH HIRIQ (16#FB1F#, 16#FB28#), -- HEBREW LIGATURE YIDDISH YOD YOD PATAH .. HEBREW LETTER WIDE TAV (16#FB2A#, 16#FD3D#), -- HEBREW LETTER SHIN WITH SHIN DOT .. ARABIC LIGATURE ALEF WITH FATHATAN ISOLATED FORM (16#FD50#, 16#FDFB#), -- ARABIC LIGATURE TEH WITH JEEM WITH MEEM INITIAL FORM .. ARABIC LIGATURE JALLAJALALOUHOU (16#FE70#, 16#FEFC#), -- ARABIC FATHATAN ISOLATED FORM .. ARABIC LIGATURE LAM WITH ALEF FINAL FORM (16#FF21#, 16#FF3A#), -- FULLWIDTH LATIN CAPITAL LETTER A .. FULLWIDTH LATIN CAPITAL LETTER Z (16#FF41#, 16#FF5A#), -- FULLWIDTH LATIN SMALL LETTER A .. FULLWIDTH LATIN SMALL LETTER Z (16#FF66#, 16#FFDC#), -- HALFWIDTH KATAKANA LETTER WO .. HALFWIDTH HANGUL LETTER I (16#10300#, 16#1031E#), -- OLD ITALIC LETTER A .. OLD ITALIC LETTER UU (16#10330#, 16#10349#), -- GOTHIC LETTER AHSA .. GOTHIC LETTER OTHAL (16#10400#, 16#1044D#), -- DESERET CAPITAL LETTER LONG I .. DESERET SMALL LETTER ENG (16#1D400#, 16#1D6C0#), -- MATHEMATICAL BOLD CAPITAL A .. MATHEMATICAL BOLD CAPITAL OMEGA (16#1D6C2#, 16#1D6DA#), -- MATHEMATICAL BOLD SMALL ALPHA .. MATHEMATICAL BOLD SMALL OMEGA (16#1D6DC#, 16#1D6FA#), -- MATHEMATICAL BOLD EPSILON SYMBOL .. MATHEMATICAL ITALIC CAPITAL OMEGA (16#1D6FC#, 16#1D714#), -- MATHEMATICAL ITALIC SMALL ALPHA .. MATHEMATICAL ITALIC SMALL OMEGA (16#1D716#, 16#1D734#), -- MATHEMATICAL ITALIC EPSILON SYMBOL .. MATHEMATICAL BOLD ITALIC CAPITAL OMEGA (16#1D736#, 16#1D74E#), -- MATHEMATICAL BOLD ITALIC SMALL ALPHA .. MATHEMATICAL BOLD ITALIC SMALL OMEGA (16#1D750#, 16#1D76E#), -- MATHEMATICAL BOLD ITALIC EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD CAPITAL OMEGA (16#1D770#, 16#1D788#), -- MATHEMATICAL SANS-SERIF BOLD SMALL ALPHA .. MATHEMATICAL SANS-SERIF BOLD SMALL OMEGA (16#1D78A#, 16#1D7A8#), -- MATHEMATICAL SANS-SERIF BOLD EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL OMEGA (16#1D7AA#, 16#1D7C2#), -- MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL ALPHA .. MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL OMEGA (16#1D7C4#, 16#1D7C9#), -- MATHEMATICAL SANS-SERIF BOLD ITALIC EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL (16#20000#, 16#2FA1D#) -- .. CJK COMPATIBILITY IDEOGRAPH-2FA1D ); --- Spaces : constant Ranges := ( (16#20#, 16#20#), -- SPACE .. SPACE (16#A0#, 16#A0#), -- NO-BREAK SPACE .. NO-BREAK SPACE (16#1680#, 16#1680#), -- OGHAM SPACE MARK .. OGHAM SPACE MARK (16#2000#, 16#200B#), -- EN QUAD .. ZERO WIDTH SPACE (16#202F#, 16#202F#), -- NARROW NO-BREAK SPACE .. NARROW NO-BREAK SPACE (16#205F#, 16#205F#), -- MEDIUM MATHEMATICAL SPACE .. MEDIUM MATHEMATICAL SPACE (16#3000#, 16#3000#) -- IDEOGRAPHIC SPACE .. IDEOGRAPHIC SPACE ); --- Uppercase_Mapping : constant Mapping_Ranges := ( (16#61#, 16#7A#, -32), -- LATIN SMALL LETTER A .. LATIN SMALL LETTER Z (16#B5#, 16#B5#, 743), -- MICRO SIGN .. MICRO SIGN (16#E0#, 16#F6#, -32), -- LATIN SMALL LETTER A WITH GRAVE .. LATIN SMALL LETTER O WITH DIAERESIS (16#F8#, 16#FE#, -32), -- LATIN SMALL LETTER O WITH STROKE .. LATIN SMALL LETTER THORN (16#FF#, 16#FF#, 121), -- LATIN SMALL LETTER Y WITH DIAERESIS .. LATIN SMALL LETTER Y WITH DIAERESIS (16#101#, 16#101#, -1), -- LATIN SMALL LETTER A WITH MACRON .. LATIN SMALL LETTER A WITH MACRON (16#103#, 16#103#, -1), -- LATIN SMALL LETTER A WITH BREVE .. LATIN SMALL LETTER A WITH BREVE (16#105#, 16#105#, -1), -- LATIN SMALL LETTER A WITH OGONEK .. LATIN SMALL LETTER A WITH OGONEK (16#107#, 16#107#, -1), -- LATIN SMALL LETTER C WITH ACUTE .. LATIN SMALL LETTER C WITH ACUTE (16#109#, 16#109#, -1), -- LATIN SMALL LETTER C WITH CIRCUMFLEX .. LATIN SMALL LETTER C WITH CIRCUMFLEX (16#10B#, 16#10B#, -1), -- LATIN SMALL LETTER C WITH DOT ABOVE .. LATIN SMALL LETTER C WITH DOT ABOVE (16#10D#, 16#10D#, -1), -- LATIN SMALL LETTER C WITH CARON .. LATIN SMALL LETTER C WITH CARON (16#10F#, 16#10F#, -1), -- LATIN SMALL LETTER D WITH CARON .. LATIN SMALL LETTER D WITH CARON (16#111#, 16#111#, -1), -- LATIN SMALL LETTER D WITH STROKE .. LATIN SMALL LETTER D WITH STROKE (16#113#, 16#113#, -1), -- LATIN SMALL LETTER E WITH MACRON .. LATIN SMALL LETTER E WITH MACRON (16#115#, 16#115#, -1), -- LATIN SMALL LETTER E WITH BREVE .. LATIN SMALL LETTER E WITH BREVE (16#117#, 16#117#, -1), -- LATIN SMALL LETTER E WITH DOT ABOVE .. LATIN SMALL LETTER E WITH DOT ABOVE (16#119#, 16#119#, -1), -- LATIN SMALL LETTER E WITH OGONEK .. LATIN SMALL LETTER E WITH OGONEK (16#11B#, 16#11B#, -1), -- LATIN SMALL LETTER E WITH CARON .. LATIN SMALL LETTER E WITH CARON (16#11D#, 16#11D#, -1), -- LATIN SMALL LETTER G WITH CIRCUMFLEX .. LATIN SMALL LETTER G WITH CIRCUMFLEX (16#11F#, 16#11F#, -1), -- LATIN SMALL LETTER G WITH BREVE .. LATIN SMALL LETTER G WITH BREVE (16#121#, 16#121#, -1), -- LATIN SMALL LETTER G WITH DOT ABOVE .. LATIN SMALL LETTER G WITH DOT ABOVE (16#123#, 16#123#, -1), -- LATIN SMALL LETTER G WITH CEDILLA .. LATIN SMALL LETTER G WITH CEDILLA (16#125#, 16#125#, -1), -- LATIN SMALL LETTER H WITH CIRCUMFLEX .. LATIN SMALL LETTER H WITH CIRCUMFLEX (16#127#, 16#127#, -1), -- LATIN SMALL LETTER H WITH STROKE .. LATIN SMALL LETTER H WITH STROKE (16#129#, 16#129#, -1), -- LATIN SMALL LETTER I WITH TILDE .. LATIN SMALL LETTER I WITH TILDE (16#12B#, 16#12B#, -1), -- LATIN SMALL LETTER I WITH MACRON .. LATIN SMALL LETTER I WITH MACRON (16#12D#, 16#12D#, -1), -- LATIN SMALL LETTER I WITH BREVE .. LATIN SMALL LETTER I WITH BREVE (16#12F#, 16#12F#, -1), -- LATIN SMALL LETTER I WITH OGONEK .. LATIN SMALL LETTER I WITH OGONEK (16#131#, 16#131#, -232), -- LATIN SMALL LETTER DOTLESS I .. LATIN SMALL LETTER DOTLESS I (16#133#, 16#133#, -1), -- LATIN SMALL LIGATURE IJ .. LATIN SMALL LIGATURE IJ (16#135#, 16#135#, -1), -- LATIN SMALL LETTER J WITH CIRCUMFLEX .. LATIN SMALL LETTER J WITH CIRCUMFLEX (16#137#, 16#137#, -1), -- LATIN SMALL LETTER K WITH CEDILLA .. LATIN SMALL LETTER K WITH CEDILLA (16#13A#, 16#13A#, -1), -- LATIN SMALL LETTER L WITH ACUTE .. LATIN SMALL LETTER L WITH ACUTE (16#13C#, 16#13C#, -1), -- LATIN SMALL LETTER L WITH CEDILLA .. LATIN SMALL LETTER L WITH CEDILLA (16#13E#, 16#13E#, -1), -- LATIN SMALL LETTER L WITH CARON .. LATIN SMALL LETTER L WITH CARON (16#140#, 16#140#, -1), -- LATIN SMALL LETTER L WITH MIDDLE DOT .. LATIN SMALL LETTER L WITH MIDDLE DOT (16#142#, 16#142#, -1), -- LATIN SMALL LETTER L WITH STROKE .. LATIN SMALL LETTER L WITH STROKE (16#144#, 16#144#, -1), -- LATIN SMALL LETTER N WITH ACUTE .. LATIN SMALL LETTER N WITH ACUTE (16#146#, 16#146#, -1), -- LATIN SMALL LETTER N WITH CEDILLA .. LATIN SMALL LETTER N WITH CEDILLA (16#148#, 16#148#, -1), -- LATIN SMALL LETTER N WITH CARON .. LATIN SMALL LETTER N WITH CARON (16#14B#, 16#14B#, -1), -- LATIN SMALL LETTER ENG .. LATIN SMALL LETTER ENG (16#14D#, 16#14D#, -1), -- LATIN SMALL LETTER O WITH MACRON .. LATIN SMALL LETTER O WITH MACRON (16#14F#, 16#14F#, -1), -- LATIN SMALL LETTER O WITH BREVE .. LATIN SMALL LETTER O WITH BREVE (16#151#, 16#151#, -1), -- LATIN SMALL LETTER O WITH DOUBLE ACUTE .. LATIN SMALL LETTER O WITH DOUBLE ACUTE (16#153#, 16#153#, -1), -- LATIN SMALL LIGATURE OE .. LATIN SMALL LIGATURE OE (16#155#, 16#155#, -1), -- LATIN SMALL LETTER R WITH ACUTE .. LATIN SMALL LETTER R WITH ACUTE (16#157#, 16#157#, -1), -- LATIN SMALL LETTER R WITH CEDILLA .. LATIN SMALL LETTER R WITH CEDILLA (16#159#, 16#159#, -1), -- LATIN SMALL LETTER R WITH CARON .. LATIN SMALL LETTER R WITH CARON (16#15B#, 16#15B#, -1), -- LATIN SMALL LETTER S WITH ACUTE .. LATIN SMALL LETTER S WITH ACUTE (16#15D#, 16#15D#, -1), -- LATIN SMALL LETTER S WITH CIRCUMFLEX .. LATIN SMALL LETTER S WITH CIRCUMFLEX (16#15F#, 16#15F#, -1), -- LATIN SMALL LETTER S WITH CEDILLA .. LATIN SMALL LETTER S WITH CEDILLA (16#161#, 16#161#, -1), -- LATIN SMALL LETTER S WITH CARON .. LATIN SMALL LETTER S WITH CARON (16#163#, 16#163#, -1), -- LATIN SMALL LETTER T WITH CEDILLA .. LATIN SMALL LETTER T WITH CEDILLA (16#165#, 16#165#, -1), -- LATIN SMALL LETTER T WITH CARON .. LATIN SMALL LETTER T WITH CARON (16#167#, 16#167#, -1), -- LATIN SMALL LETTER T WITH STROKE .. LATIN SMALL LETTER T WITH STROKE (16#169#, 16#169#, -1), -- LATIN SMALL LETTER U WITH TILDE .. LATIN SMALL LETTER U WITH TILDE (16#16B#, 16#16B#, -1), -- LATIN SMALL LETTER U WITH MACRON .. LATIN SMALL LETTER U WITH MACRON (16#16D#, 16#16D#, -1), -- LATIN SMALL LETTER U WITH BREVE .. LATIN SMALL LETTER U WITH BREVE (16#16F#, 16#16F#, -1), -- LATIN SMALL LETTER U WITH RING ABOVE .. LATIN SMALL LETTER U WITH RING ABOVE (16#171#, 16#171#, -1), -- LATIN SMALL LETTER U WITH DOUBLE ACUTE .. LATIN SMALL LETTER U WITH DOUBLE ACUTE (16#173#, 16#173#, -1), -- LATIN SMALL LETTER U WITH OGONEK .. LATIN SMALL LETTER U WITH OGONEK (16#175#, 16#175#, -1), -- LATIN SMALL LETTER W WITH CIRCUMFLEX .. LATIN SMALL LETTER W WITH CIRCUMFLEX (16#177#, 16#177#, -1), -- LATIN SMALL LETTER Y WITH CIRCUMFLEX .. LATIN SMALL LETTER Y WITH CIRCUMFLEX (16#17A#, 16#17A#, -1), -- LATIN SMALL LETTER Z WITH ACUTE .. LATIN SMALL LETTER Z WITH ACUTE (16#17C#, 16#17C#, -1), -- LATIN SMALL LETTER Z WITH DOT ABOVE .. LATIN SMALL LETTER Z WITH DOT ABOVE (16#17E#, 16#17E#, -1), -- LATIN SMALL LETTER Z WITH CARON .. LATIN SMALL LETTER Z WITH CARON (16#17F#, 16#17F#, -300), -- LATIN SMALL LETTER LONG S .. LATIN SMALL LETTER LONG S (16#183#, 16#183#, -1), -- LATIN SMALL LETTER B WITH TOPBAR .. LATIN SMALL LETTER B WITH TOPBAR (16#185#, 16#185#, -1), -- LATIN SMALL LETTER TONE SIX .. LATIN SMALL LETTER TONE SIX (16#188#, 16#188#, -1), -- LATIN SMALL LETTER C WITH HOOK .. LATIN SMALL LETTER C WITH HOOK (16#18C#, 16#18C#, -1), -- LATIN SMALL LETTER D WITH TOPBAR .. LATIN SMALL LETTER D WITH TOPBAR (16#192#, 16#192#, -1), -- LATIN SMALL LETTER F WITH HOOK .. LATIN SMALL LETTER F WITH HOOK (16#195#, 16#195#, 97), -- LATIN SMALL LETTER HV .. LATIN SMALL LETTER HV (16#199#, 16#199#, -1), -- LATIN SMALL LETTER K WITH HOOK .. LATIN SMALL LETTER K WITH HOOK (16#19E#, 16#19E#, 130), -- LATIN SMALL LETTER N WITH LONG RIGHT LEG .. LATIN SMALL LETTER N WITH LONG RIGHT LEG (16#1A1#, 16#1A1#, -1), -- LATIN SMALL LETTER O WITH HORN .. LATIN SMALL LETTER O WITH HORN (16#1A3#, 16#1A3#, -1), -- LATIN SMALL LETTER OI .. LATIN SMALL LETTER OI (16#1A5#, 16#1A5#, -1), -- LATIN SMALL LETTER P WITH HOOK .. LATIN SMALL LETTER P WITH HOOK (16#1A8#, 16#1A8#, -1), -- LATIN SMALL LETTER TONE TWO .. LATIN SMALL LETTER TONE TWO (16#1AD#, 16#1AD#, -1), -- LATIN SMALL LETTER T WITH HOOK .. LATIN SMALL LETTER T WITH HOOK (16#1B0#, 16#1B0#, -1), -- LATIN SMALL LETTER U WITH HORN .. LATIN SMALL LETTER U WITH HORN (16#1B4#, 16#1B4#, -1), -- LATIN SMALL LETTER Y WITH HOOK .. LATIN SMALL LETTER Y WITH HOOK (16#1B6#, 16#1B6#, -1), -- LATIN SMALL LETTER Z WITH STROKE .. LATIN SMALL LETTER Z WITH STROKE (16#1B9#, 16#1B9#, -1), -- LATIN SMALL LETTER EZH REVERSED .. LATIN SMALL LETTER EZH REVERSED (16#1BD#, 16#1BD#, -1), -- LATIN SMALL LETTER TONE FIVE .. LATIN SMALL LETTER TONE FIVE (16#1BF#, 16#1BF#, 56), -- LATIN LETTER WYNN .. LATIN LETTER WYNN (16#1C5#, 16#1C5#, -1), -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON (16#1C6#, 16#1C6#, -2), -- LATIN SMALL LETTER DZ WITH CARON .. LATIN SMALL LETTER DZ WITH CARON (16#1C8#, 16#1C8#, -1), -- LATIN CAPITAL LETTER L WITH SMALL LETTER J .. LATIN CAPITAL LETTER L WITH SMALL LETTER J (16#1C9#, 16#1C9#, -2), -- LATIN SMALL LETTER LJ .. LATIN SMALL LETTER LJ (16#1CB#, 16#1CB#, -1), -- LATIN CAPITAL LETTER N WITH SMALL LETTER J .. LATIN CAPITAL LETTER N WITH SMALL LETTER J (16#1CC#, 16#1CC#, -2), -- LATIN SMALL LETTER NJ .. LATIN SMALL LETTER NJ (16#1CE#, 16#1CE#, -1), -- LATIN SMALL LETTER A WITH CARON .. LATIN SMALL LETTER A WITH CARON (16#1D0#, 16#1D0#, -1), -- LATIN SMALL LETTER I WITH CARON .. LATIN SMALL LETTER I WITH CARON (16#1D2#, 16#1D2#, -1), -- LATIN SMALL LETTER O WITH CARON .. LATIN SMALL LETTER O WITH CARON (16#1D4#, 16#1D4#, -1), -- LATIN SMALL LETTER U WITH CARON .. LATIN SMALL LETTER U WITH CARON (16#1D6#, 16#1D6#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER U WITH DIAERESIS AND MACRON (16#1D8#, 16#1D8#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE (16#1DA#, 16#1DA#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND CARON .. LATIN SMALL LETTER U WITH DIAERESIS AND CARON (16#1DC#, 16#1DC#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE .. LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE (16#1DD#, 16#1DD#, -79), -- LATIN SMALL LETTER TURNED E .. LATIN SMALL LETTER TURNED E (16#1DF#, 16#1DF#, -1), -- LATIN SMALL LETTER A WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER A WITH DIAERESIS AND MACRON (16#1E1#, 16#1E1#, -1), -- LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON (16#1E3#, 16#1E3#, -1), -- LATIN SMALL LETTER AE WITH MACRON .. LATIN SMALL LETTER AE WITH MACRON (16#1E5#, 16#1E5#, -1), -- LATIN SMALL LETTER G WITH STROKE .. LATIN SMALL LETTER G WITH STROKE (16#1E7#, 16#1E7#, -1), -- LATIN SMALL LETTER G WITH CARON .. LATIN SMALL LETTER G WITH CARON (16#1E9#, 16#1E9#, -1), -- LATIN SMALL LETTER K WITH CARON .. LATIN SMALL LETTER K WITH CARON (16#1EB#, 16#1EB#, -1), -- LATIN SMALL LETTER O WITH OGONEK .. LATIN SMALL LETTER O WITH OGONEK (16#1ED#, 16#1ED#, -1), -- LATIN SMALL LETTER O WITH OGONEK AND MACRON .. LATIN SMALL LETTER O WITH OGONEK AND MACRON (16#1EF#, 16#1EF#, -1), -- LATIN SMALL LETTER EZH WITH CARON .. LATIN SMALL LETTER EZH WITH CARON (16#1F2#, 16#1F2#, -1), -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z (16#1F3#, 16#1F3#, -2), -- LATIN SMALL LETTER DZ .. LATIN SMALL LETTER DZ (16#1F5#, 16#1F5#, -1), -- LATIN SMALL LETTER G WITH ACUTE .. LATIN SMALL LETTER G WITH ACUTE (16#1F9#, 16#1F9#, -1), -- LATIN SMALL LETTER N WITH GRAVE .. LATIN SMALL LETTER N WITH GRAVE (16#1FB#, 16#1FB#, -1), -- LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE .. LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE (16#1FD#, 16#1FD#, -1), -- LATIN SMALL LETTER AE WITH ACUTE .. LATIN SMALL LETTER AE WITH ACUTE (16#1FF#, 16#1FF#, -1), -- LATIN SMALL LETTER O WITH STROKE AND ACUTE .. LATIN SMALL LETTER O WITH STROKE AND ACUTE (16#201#, 16#201#, -1), -- LATIN SMALL LETTER A WITH DOUBLE GRAVE .. LATIN SMALL LETTER A WITH DOUBLE GRAVE (16#203#, 16#203#, -1), -- LATIN SMALL LETTER A WITH INVERTED BREVE .. LATIN SMALL LETTER A WITH INVERTED BREVE (16#205#, 16#205#, -1), -- LATIN SMALL LETTER E WITH DOUBLE GRAVE .. LATIN SMALL LETTER E WITH DOUBLE GRAVE (16#207#, 16#207#, -1), -- LATIN SMALL LETTER E WITH INVERTED BREVE .. LATIN SMALL LETTER E WITH INVERTED BREVE (16#209#, 16#209#, -1), -- LATIN SMALL LETTER I WITH DOUBLE GRAVE .. LATIN SMALL LETTER I WITH DOUBLE GRAVE (16#20B#, 16#20B#, -1), -- LATIN SMALL LETTER I WITH INVERTED BREVE .. LATIN SMALL LETTER I WITH INVERTED BREVE (16#20D#, 16#20D#, -1), -- LATIN SMALL LETTER O WITH DOUBLE GRAVE .. LATIN SMALL LETTER O WITH DOUBLE GRAVE (16#20F#, 16#20F#, -1), -- LATIN SMALL LETTER O WITH INVERTED BREVE .. LATIN SMALL LETTER O WITH INVERTED BREVE (16#211#, 16#211#, -1), -- LATIN SMALL LETTER R WITH DOUBLE GRAVE .. LATIN SMALL LETTER R WITH DOUBLE GRAVE (16#213#, 16#213#, -1), -- LATIN SMALL LETTER R WITH INVERTED BREVE .. LATIN SMALL LETTER R WITH INVERTED BREVE (16#215#, 16#215#, -1), -- LATIN SMALL LETTER U WITH DOUBLE GRAVE .. LATIN SMALL LETTER U WITH DOUBLE GRAVE (16#217#, 16#217#, -1), -- LATIN SMALL LETTER U WITH INVERTED BREVE .. LATIN SMALL LETTER U WITH INVERTED BREVE (16#219#, 16#219#, -1), -- LATIN SMALL LETTER S WITH COMMA BELOW .. LATIN SMALL LETTER S WITH COMMA BELOW (16#21B#, 16#21B#, -1), -- LATIN SMALL LETTER T WITH COMMA BELOW .. LATIN SMALL LETTER T WITH COMMA BELOW (16#21D#, 16#21D#, -1), -- LATIN SMALL LETTER YOGH .. LATIN SMALL LETTER YOGH (16#21F#, 16#21F#, -1), -- LATIN SMALL LETTER H WITH CARON .. LATIN SMALL LETTER H WITH CARON (16#223#, 16#223#, -1), -- LATIN SMALL LETTER OU .. LATIN SMALL LETTER OU (16#225#, 16#225#, -1), -- LATIN SMALL LETTER Z WITH HOOK .. LATIN SMALL LETTER Z WITH HOOK (16#227#, 16#227#, -1), -- LATIN SMALL LETTER A WITH DOT ABOVE .. LATIN SMALL LETTER A WITH DOT ABOVE (16#229#, 16#229#, -1), -- LATIN SMALL LETTER E WITH CEDILLA .. LATIN SMALL LETTER E WITH CEDILLA (16#22B#, 16#22B#, -1), -- LATIN SMALL LETTER O WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER O WITH DIAERESIS AND MACRON (16#22D#, 16#22D#, -1), -- LATIN SMALL LETTER O WITH TILDE AND MACRON .. LATIN SMALL LETTER O WITH TILDE AND MACRON (16#22F#, 16#22F#, -1), -- LATIN SMALL LETTER O WITH DOT ABOVE .. LATIN SMALL LETTER O WITH DOT ABOVE (16#231#, 16#231#, -1), -- LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON (16#233#, 16#233#, -1), -- LATIN SMALL LETTER Y WITH MACRON .. LATIN SMALL LETTER Y WITH MACRON (16#253#, 16#253#, -210), -- LATIN SMALL LETTER B WITH HOOK .. LATIN SMALL LETTER B WITH HOOK (16#254#, 16#254#, -206), -- LATIN SMALL LETTER OPEN O .. LATIN SMALL LETTER OPEN O (16#256#, 16#257#, -205), -- LATIN SMALL LETTER D WITH TAIL .. LATIN SMALL LETTER D WITH HOOK (16#259#, 16#259#, -202), -- LATIN SMALL LETTER SCHWA .. LATIN SMALL LETTER SCHWA (16#25B#, 16#25B#, -203), -- LATIN SMALL LETTER OPEN E .. LATIN SMALL LETTER OPEN E (16#260#, 16#260#, -205), -- LATIN SMALL LETTER G WITH HOOK .. LATIN SMALL LETTER G WITH HOOK (16#263#, 16#263#, -207), -- LATIN SMALL LETTER GAMMA .. LATIN SMALL LETTER GAMMA (16#268#, 16#268#, -209), -- LATIN SMALL LETTER I WITH STROKE .. LATIN SMALL LETTER I WITH STROKE (16#269#, 16#269#, -211), -- LATIN SMALL LETTER IOTA .. LATIN SMALL LETTER IOTA (16#26F#, 16#26F#, -211), -- LATIN SMALL LETTER TURNED M .. LATIN SMALL LETTER TURNED M (16#272#, 16#272#, -213), -- LATIN SMALL LETTER N WITH LEFT HOOK .. LATIN SMALL LETTER N WITH LEFT HOOK (16#275#, 16#275#, -214), -- LATIN SMALL LETTER BARRED O .. LATIN SMALL LETTER BARRED O (16#280#, 16#280#, -218), -- LATIN LETTER SMALL CAPITAL R .. LATIN LETTER SMALL CAPITAL R (16#283#, 16#283#, -218), -- LATIN SMALL LETTER ESH .. LATIN SMALL LETTER ESH (16#288#, 16#288#, -218), -- LATIN SMALL LETTER T WITH RETROFLEX HOOK .. LATIN SMALL LETTER T WITH RETROFLEX HOOK (16#28A#, 16#28B#, -217), -- LATIN SMALL LETTER UPSILON .. LATIN SMALL LETTER V WITH HOOK (16#292#, 16#292#, -219), -- LATIN SMALL LETTER EZH .. LATIN SMALL LETTER EZH (16#3AC#, 16#3AC#, -38), -- GREEK SMALL LETTER ALPHA WITH TONOS .. GREEK SMALL LETTER ALPHA WITH TONOS (16#3AD#, 16#3AF#, -37), -- GREEK SMALL LETTER EPSILON WITH TONOS .. GREEK SMALL LETTER IOTA WITH TONOS (16#3B1#, 16#3C1#, -32), -- GREEK SMALL LETTER ALPHA .. GREEK SMALL LETTER RHO (16#3C2#, 16#3C2#, -31), -- GREEK SMALL LETTER FINAL SIGMA .. GREEK SMALL LETTER FINAL SIGMA (16#3C3#, 16#3CB#, -32), -- GREEK SMALL LETTER SIGMA .. GREEK SMALL LETTER UPSILON WITH DIALYTIKA (16#3CC#, 16#3CC#, -64), -- GREEK SMALL LETTER OMICRON WITH TONOS .. GREEK SMALL LETTER OMICRON WITH TONOS (16#3CD#, 16#3CE#, -63), -- GREEK SMALL LETTER UPSILON WITH TONOS .. GREEK SMALL LETTER OMEGA WITH TONOS (16#3D0#, 16#3D0#, -62), -- GREEK BETA SYMBOL .. GREEK BETA SYMBOL (16#3D1#, 16#3D1#, -57), -- GREEK THETA SYMBOL .. GREEK THETA SYMBOL (16#3D5#, 16#3D5#, -47), -- GREEK PHI SYMBOL .. GREEK PHI SYMBOL (16#3D6#, 16#3D6#, -54), -- GREEK PI SYMBOL .. GREEK PI SYMBOL (16#3D9#, 16#3D9#, -1), -- GREEK SMALL LETTER ARCHAIC KOPPA .. GREEK SMALL LETTER ARCHAIC KOPPA (16#3DB#, 16#3DB#, -1), -- GREEK SMALL LETTER STIGMA .. GREEK SMALL LETTER STIGMA (16#3DD#, 16#3DD#, -1), -- GREEK SMALL LETTER DIGAMMA .. GREEK SMALL LETTER DIGAMMA (16#3DF#, 16#3DF#, -1), -- GREEK SMALL LETTER KOPPA .. GREEK SMALL LETTER KOPPA (16#3E1#, 16#3E1#, -1), -- GREEK SMALL LETTER SAMPI .. GREEK SMALL LETTER SAMPI (16#3E3#, 16#3E3#, -1), -- COPTIC SMALL LETTER SHEI .. COPTIC SMALL LETTER SHEI (16#3E5#, 16#3E5#, -1), -- COPTIC SMALL LETTER FEI .. COPTIC SMALL LETTER FEI (16#3E7#, 16#3E7#, -1), -- COPTIC SMALL LETTER KHEI .. COPTIC SMALL LETTER KHEI (16#3E9#, 16#3E9#, -1), -- COPTIC SMALL LETTER HORI .. COPTIC SMALL LETTER HORI (16#3EB#, 16#3EB#, -1), -- COPTIC SMALL LETTER GANGIA .. COPTIC SMALL LETTER GANGIA (16#3ED#, 16#3ED#, -1), -- COPTIC SMALL LETTER SHIMA .. COPTIC SMALL LETTER SHIMA (16#3EF#, 16#3EF#, -1), -- COPTIC SMALL LETTER DEI .. COPTIC SMALL LETTER DEI (16#3F0#, 16#3F0#, -86), -- GREEK KAPPA SYMBOL .. GREEK KAPPA SYMBOL (16#3F1#, 16#3F1#, -80), -- GREEK RHO SYMBOL .. GREEK RHO SYMBOL (16#3F2#, 16#3F2#, -79), -- GREEK LUNATE SIGMA SYMBOL .. GREEK LUNATE SIGMA SYMBOL (16#3F5#, 16#3F5#, -96), -- GREEK LUNATE EPSILON SYMBOL .. GREEK LUNATE EPSILON SYMBOL (16#430#, 16#44F#, -32), -- CYRILLIC SMALL LETTER A .. CYRILLIC SMALL LETTER YA (16#450#, 16#45F#, -80), -- CYRILLIC SMALL LETTER IE WITH GRAVE .. CYRILLIC SMALL LETTER DZHE (16#461#, 16#461#, -1), -- CYRILLIC SMALL LETTER OMEGA .. CYRILLIC SMALL LETTER OMEGA (16#463#, 16#463#, -1), -- CYRILLIC SMALL LETTER YAT .. CYRILLIC SMALL LETTER YAT (16#465#, 16#465#, -1), -- CYRILLIC SMALL LETTER IOTIFIED E .. CYRILLIC SMALL LETTER IOTIFIED E (16#467#, 16#467#, -1), -- CYRILLIC SMALL LETTER LITTLE YUS .. CYRILLIC SMALL LETTER LITTLE YUS (16#469#, 16#469#, -1), -- CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS .. CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS (16#46B#, 16#46B#, -1), -- CYRILLIC SMALL LETTER BIG YUS .. CYRILLIC SMALL LETTER BIG YUS (16#46D#, 16#46D#, -1), -- CYRILLIC SMALL LETTER IOTIFIED BIG YUS .. CYRILLIC SMALL LETTER IOTIFIED BIG YUS (16#46F#, 16#46F#, -1), -- CYRILLIC SMALL LETTER KSI .. CYRILLIC SMALL LETTER KSI (16#471#, 16#471#, -1), -- CYRILLIC SMALL LETTER PSI .. CYRILLIC SMALL LETTER PSI (16#473#, 16#473#, -1), -- CYRILLIC SMALL LETTER FITA .. CYRILLIC SMALL LETTER FITA (16#475#, 16#475#, -1), -- CYRILLIC SMALL LETTER IZHITSA .. CYRILLIC SMALL LETTER IZHITSA (16#477#, 16#477#, -1), -- CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT .. CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT (16#479#, 16#479#, -1), -- CYRILLIC SMALL LETTER UK .. CYRILLIC SMALL LETTER UK (16#47B#, 16#47B#, -1), -- CYRILLIC SMALL LETTER ROUND OMEGA .. CYRILLIC SMALL LETTER ROUND OMEGA (16#47D#, 16#47D#, -1), -- CYRILLIC SMALL LETTER OMEGA WITH TITLO .. CYRILLIC SMALL LETTER OMEGA WITH TITLO (16#47F#, 16#47F#, -1), -- CYRILLIC SMALL LETTER OT .. CYRILLIC SMALL LETTER OT (16#481#, 16#481#, -1), -- CYRILLIC SMALL LETTER KOPPA .. CYRILLIC SMALL LETTER KOPPA (16#48B#, 16#48B#, -1), -- CYRILLIC SMALL LETTER SHORT I WITH TAIL .. CYRILLIC SMALL LETTER SHORT I WITH TAIL (16#48D#, 16#48D#, -1), -- CYRILLIC SMALL LETTER SEMISOFT SIGN .. CYRILLIC SMALL LETTER SEMISOFT SIGN (16#48F#, 16#48F#, -1), -- CYRILLIC SMALL LETTER ER WITH TICK .. CYRILLIC SMALL LETTER ER WITH TICK (16#491#, 16#491#, -1), -- CYRILLIC SMALL LETTER GHE WITH UPTURN .. CYRILLIC SMALL LETTER GHE WITH UPTURN (16#493#, 16#493#, -1), -- CYRILLIC SMALL LETTER GHE WITH STROKE .. CYRILLIC SMALL LETTER GHE WITH STROKE (16#495#, 16#495#, -1), -- CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK (16#497#, 16#497#, -1), -- CYRILLIC SMALL LETTER ZHE WITH DESCENDER .. CYRILLIC SMALL LETTER ZHE WITH DESCENDER (16#499#, 16#499#, -1), -- CYRILLIC SMALL LETTER ZE WITH DESCENDER .. CYRILLIC SMALL LETTER ZE WITH DESCENDER (16#49B#, 16#49B#, -1), -- CYRILLIC SMALL LETTER KA WITH DESCENDER .. CYRILLIC SMALL LETTER KA WITH DESCENDER (16#49D#, 16#49D#, -1), -- CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE (16#49F#, 16#49F#, -1), -- CYRILLIC SMALL LETTER KA WITH STROKE .. CYRILLIC SMALL LETTER KA WITH STROKE (16#4A1#, 16#4A1#, -1), -- CYRILLIC SMALL LETTER BASHKIR KA .. CYRILLIC SMALL LETTER BASHKIR KA (16#4A3#, 16#4A3#, -1), -- CYRILLIC SMALL LETTER EN WITH DESCENDER .. CYRILLIC SMALL LETTER EN WITH DESCENDER (16#4A5#, 16#4A5#, -1), -- CYRILLIC SMALL LIGATURE EN GHE .. CYRILLIC SMALL LIGATURE EN GHE (16#4A7#, 16#4A7#, -1), -- CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK (16#4A9#, 16#4A9#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN HA .. CYRILLIC SMALL LETTER ABKHASIAN HA (16#4AB#, 16#4AB#, -1), -- CYRILLIC SMALL LETTER ES WITH DESCENDER .. CYRILLIC SMALL LETTER ES WITH DESCENDER (16#4AD#, 16#4AD#, -1), -- CYRILLIC SMALL LETTER TE WITH DESCENDER .. CYRILLIC SMALL LETTER TE WITH DESCENDER (16#4AF#, 16#4AF#, -1), -- CYRILLIC SMALL LETTER STRAIGHT U .. CYRILLIC SMALL LETTER STRAIGHT U (16#4B1#, 16#4B1#, -1), -- CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE .. CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE (16#4B3#, 16#4B3#, -1), -- CYRILLIC SMALL LETTER HA WITH DESCENDER .. CYRILLIC SMALL LETTER HA WITH DESCENDER (16#4B5#, 16#4B5#, -1), -- CYRILLIC SMALL LIGATURE TE TSE .. CYRILLIC SMALL LIGATURE TE TSE (16#4B7#, 16#4B7#, -1), -- CYRILLIC SMALL LETTER CHE WITH DESCENDER .. CYRILLIC SMALL LETTER CHE WITH DESCENDER (16#4B9#, 16#4B9#, -1), -- CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE (16#4BB#, 16#4BB#, -1), -- CYRILLIC SMALL LETTER SHHA .. CYRILLIC SMALL LETTER SHHA (16#4BD#, 16#4BD#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN CHE .. CYRILLIC SMALL LETTER ABKHASIAN CHE (16#4BF#, 16#4BF#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER .. CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER (16#4C2#, 16#4C2#, -1), -- CYRILLIC SMALL LETTER ZHE WITH BREVE .. CYRILLIC SMALL LETTER ZHE WITH BREVE (16#4C4#, 16#4C4#, -1), -- CYRILLIC SMALL LETTER KA WITH HOOK .. CYRILLIC SMALL LETTER KA WITH HOOK (16#4C6#, 16#4C6#, -1), -- CYRILLIC SMALL LETTER EL WITH TAIL .. CYRILLIC SMALL LETTER EL WITH TAIL (16#4C8#, 16#4C8#, -1), -- CYRILLIC SMALL LETTER EN WITH HOOK .. CYRILLIC SMALL LETTER EN WITH HOOK (16#4CA#, 16#4CA#, -1), -- CYRILLIC SMALL LETTER EN WITH TAIL .. CYRILLIC SMALL LETTER EN WITH TAIL (16#4CC#, 16#4CC#, -1), -- CYRILLIC SMALL LETTER KHAKASSIAN CHE .. CYRILLIC SMALL LETTER KHAKASSIAN CHE (16#4CE#, 16#4CE#, -1), -- CYRILLIC SMALL LETTER EM WITH TAIL .. CYRILLIC SMALL LETTER EM WITH TAIL (16#4D1#, 16#4D1#, -1), -- CYRILLIC SMALL LETTER A WITH BREVE .. CYRILLIC SMALL LETTER A WITH BREVE (16#4D3#, 16#4D3#, -1), -- CYRILLIC SMALL LETTER A WITH DIAERESIS .. CYRILLIC SMALL LETTER A WITH DIAERESIS (16#4D5#, 16#4D5#, -1), -- CYRILLIC SMALL LIGATURE A IE .. CYRILLIC SMALL LIGATURE A IE (16#4D7#, 16#4D7#, -1), -- CYRILLIC SMALL LETTER IE WITH BREVE .. CYRILLIC SMALL LETTER IE WITH BREVE (16#4D9#, 16#4D9#, -1), -- CYRILLIC SMALL LETTER SCHWA .. CYRILLIC SMALL LETTER SCHWA (16#4DB#, 16#4DB#, -1), -- CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS .. CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS (16#4DD#, 16#4DD#, -1), -- CYRILLIC SMALL LETTER ZHE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZHE WITH DIAERESIS (16#4DF#, 16#4DF#, -1), -- CYRILLIC SMALL LETTER ZE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZE WITH DIAERESIS (16#4E1#, 16#4E1#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN DZE .. CYRILLIC SMALL LETTER ABKHASIAN DZE (16#4E3#, 16#4E3#, -1), -- CYRILLIC SMALL LETTER I WITH MACRON .. CYRILLIC SMALL LETTER I WITH MACRON (16#4E5#, 16#4E5#, -1), -- CYRILLIC SMALL LETTER I WITH DIAERESIS .. CYRILLIC SMALL LETTER I WITH DIAERESIS (16#4E7#, 16#4E7#, -1), -- CYRILLIC SMALL LETTER O WITH DIAERESIS .. CYRILLIC SMALL LETTER O WITH DIAERESIS (16#4E9#, 16#4E9#, -1), -- CYRILLIC SMALL LETTER BARRED O .. CYRILLIC SMALL LETTER BARRED O (16#4EB#, 16#4EB#, -1), -- CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS .. CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS (16#4ED#, 16#4ED#, -1), -- CYRILLIC SMALL LETTER E WITH DIAERESIS .. CYRILLIC SMALL LETTER E WITH DIAERESIS (16#4EF#, 16#4EF#, -1), -- CYRILLIC SMALL LETTER U WITH MACRON .. CYRILLIC SMALL LETTER U WITH MACRON (16#4F1#, 16#4F1#, -1), -- CYRILLIC SMALL LETTER U WITH DIAERESIS .. CYRILLIC SMALL LETTER U WITH DIAERESIS (16#4F3#, 16#4F3#, -1), -- CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE .. CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE (16#4F5#, 16#4F5#, -1), -- CYRILLIC SMALL LETTER CHE WITH DIAERESIS .. CYRILLIC SMALL LETTER CHE WITH DIAERESIS (16#4F9#, 16#4F9#, -1), -- CYRILLIC SMALL LETTER YERU WITH DIAERESIS .. CYRILLIC SMALL LETTER YERU WITH DIAERESIS (16#501#, 16#501#, -1), -- CYRILLIC SMALL LETTER KOMI DE .. CYRILLIC SMALL LETTER KOMI DE (16#503#, 16#503#, -1), -- CYRILLIC SMALL LETTER KOMI DJE .. CYRILLIC SMALL LETTER KOMI DJE (16#505#, 16#505#, -1), -- CYRILLIC SMALL LETTER KOMI ZJE .. CYRILLIC SMALL LETTER KOMI ZJE (16#507#, 16#507#, -1), -- CYRILLIC SMALL LETTER KOMI DZJE .. CYRILLIC SMALL LETTER KOMI DZJE (16#509#, 16#509#, -1), -- CYRILLIC SMALL LETTER KOMI LJE .. CYRILLIC SMALL LETTER KOMI LJE (16#50B#, 16#50B#, -1), -- CYRILLIC SMALL LETTER KOMI NJE .. CYRILLIC SMALL LETTER KOMI NJE (16#50D#, 16#50D#, -1), -- CYRILLIC SMALL LETTER KOMI SJE .. CYRILLIC SMALL LETTER KOMI SJE (16#50F#, 16#50F#, -1), -- CYRILLIC SMALL LETTER KOMI TJE .. CYRILLIC SMALL LETTER KOMI TJE (16#561#, 16#586#, -48), -- ARMENIAN SMALL LETTER AYB .. ARMENIAN SMALL LETTER FEH (16#1E01#, 16#1E01#, -1), -- LATIN SMALL LETTER A WITH RING BELOW .. LATIN SMALL LETTER A WITH RING BELOW (16#1E03#, 16#1E03#, -1), -- LATIN SMALL LETTER B WITH DOT ABOVE .. LATIN SMALL LETTER B WITH DOT ABOVE (16#1E05#, 16#1E05#, -1), -- LATIN SMALL LETTER B WITH DOT BELOW .. LATIN SMALL LETTER B WITH DOT BELOW (16#1E07#, 16#1E07#, -1), -- LATIN SMALL LETTER B WITH LINE BELOW .. LATIN SMALL LETTER B WITH LINE BELOW (16#1E09#, 16#1E09#, -1), -- LATIN SMALL LETTER C WITH CEDILLA AND ACUTE .. LATIN SMALL LETTER C WITH CEDILLA AND ACUTE (16#1E0B#, 16#1E0B#, -1), -- LATIN SMALL LETTER D WITH DOT ABOVE .. LATIN SMALL LETTER D WITH DOT ABOVE (16#1E0D#, 16#1E0D#, -1), -- LATIN SMALL LETTER D WITH DOT BELOW .. LATIN SMALL LETTER D WITH DOT BELOW (16#1E0F#, 16#1E0F#, -1), -- LATIN SMALL LETTER D WITH LINE BELOW .. LATIN SMALL LETTER D WITH LINE BELOW (16#1E11#, 16#1E11#, -1), -- LATIN SMALL LETTER D WITH CEDILLA .. LATIN SMALL LETTER D WITH CEDILLA (16#1E13#, 16#1E13#, -1), -- LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW (16#1E15#, 16#1E15#, -1), -- LATIN SMALL LETTER E WITH MACRON AND GRAVE .. LATIN SMALL LETTER E WITH MACRON AND GRAVE (16#1E17#, 16#1E17#, -1), -- LATIN SMALL LETTER E WITH MACRON AND ACUTE .. LATIN SMALL LETTER E WITH MACRON AND ACUTE (16#1E19#, 16#1E19#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW (16#1E1B#, 16#1E1B#, -1), -- LATIN SMALL LETTER E WITH TILDE BELOW .. LATIN SMALL LETTER E WITH TILDE BELOW (16#1E1D#, 16#1E1D#, -1), -- LATIN SMALL LETTER E WITH CEDILLA AND BREVE .. LATIN SMALL LETTER E WITH CEDILLA AND BREVE (16#1E1F#, 16#1E1F#, -1), -- LATIN SMALL LETTER F WITH DOT ABOVE .. LATIN SMALL LETTER F WITH DOT ABOVE (16#1E21#, 16#1E21#, -1), -- LATIN SMALL LETTER G WITH MACRON .. LATIN SMALL LETTER G WITH MACRON (16#1E23#, 16#1E23#, -1), -- LATIN SMALL LETTER H WITH DOT ABOVE .. LATIN SMALL LETTER H WITH DOT ABOVE (16#1E25#, 16#1E25#, -1), -- LATIN SMALL LETTER H WITH DOT BELOW .. LATIN SMALL LETTER H WITH DOT BELOW (16#1E27#, 16#1E27#, -1), -- LATIN SMALL LETTER H WITH DIAERESIS .. LATIN SMALL LETTER H WITH DIAERESIS (16#1E29#, 16#1E29#, -1), -- LATIN SMALL LETTER H WITH CEDILLA .. LATIN SMALL LETTER H WITH CEDILLA (16#1E2B#, 16#1E2B#, -1), -- LATIN SMALL LETTER H WITH BREVE BELOW .. LATIN SMALL LETTER H WITH BREVE BELOW (16#1E2D#, 16#1E2D#, -1), -- LATIN SMALL LETTER I WITH TILDE BELOW .. LATIN SMALL LETTER I WITH TILDE BELOW (16#1E2F#, 16#1E2F#, -1), -- LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE (16#1E31#, 16#1E31#, -1), -- LATIN SMALL LETTER K WITH ACUTE .. LATIN SMALL LETTER K WITH ACUTE (16#1E33#, 16#1E33#, -1), -- LATIN SMALL LETTER K WITH DOT BELOW .. LATIN SMALL LETTER K WITH DOT BELOW (16#1E35#, 16#1E35#, -1), -- LATIN SMALL LETTER K WITH LINE BELOW .. LATIN SMALL LETTER K WITH LINE BELOW (16#1E37#, 16#1E37#, -1), -- LATIN SMALL LETTER L WITH DOT BELOW .. LATIN SMALL LETTER L WITH DOT BELOW (16#1E39#, 16#1E39#, -1), -- LATIN SMALL LETTER L WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER L WITH DOT BELOW AND MACRON (16#1E3B#, 16#1E3B#, -1), -- LATIN SMALL LETTER L WITH LINE BELOW .. LATIN SMALL LETTER L WITH LINE BELOW (16#1E3D#, 16#1E3D#, -1), -- LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW (16#1E3F#, 16#1E3F#, -1), -- LATIN SMALL LETTER M WITH ACUTE .. LATIN SMALL LETTER M WITH ACUTE (16#1E41#, 16#1E41#, -1), -- LATIN SMALL LETTER M WITH DOT ABOVE .. LATIN SMALL LETTER M WITH DOT ABOVE (16#1E43#, 16#1E43#, -1), -- LATIN SMALL LETTER M WITH DOT BELOW .. LATIN SMALL LETTER M WITH DOT BELOW (16#1E45#, 16#1E45#, -1), -- LATIN SMALL LETTER N WITH DOT ABOVE .. LATIN SMALL LETTER N WITH DOT ABOVE (16#1E47#, 16#1E47#, -1), -- LATIN SMALL LETTER N WITH DOT BELOW .. LATIN SMALL LETTER N WITH DOT BELOW (16#1E49#, 16#1E49#, -1), -- LATIN SMALL LETTER N WITH LINE BELOW .. LATIN SMALL LETTER N WITH LINE BELOW (16#1E4B#, 16#1E4B#, -1), -- LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW (16#1E4D#, 16#1E4D#, -1), -- LATIN SMALL LETTER O WITH TILDE AND ACUTE .. LATIN SMALL LETTER O WITH TILDE AND ACUTE (16#1E4F#, 16#1E4F#, -1), -- LATIN SMALL LETTER O WITH TILDE AND DIAERESIS .. LATIN SMALL LETTER O WITH TILDE AND DIAERESIS (16#1E51#, 16#1E51#, -1), -- LATIN SMALL LETTER O WITH MACRON AND GRAVE .. LATIN SMALL LETTER O WITH MACRON AND GRAVE (16#1E53#, 16#1E53#, -1), -- LATIN SMALL LETTER O WITH MACRON AND ACUTE .. LATIN SMALL LETTER O WITH MACRON AND ACUTE (16#1E55#, 16#1E55#, -1), -- LATIN SMALL LETTER P WITH ACUTE .. LATIN SMALL LETTER P WITH ACUTE (16#1E57#, 16#1E57#, -1), -- LATIN SMALL LETTER P WITH DOT ABOVE .. LATIN SMALL LETTER P WITH DOT ABOVE (16#1E59#, 16#1E59#, -1), -- LATIN SMALL LETTER R WITH DOT ABOVE .. LATIN SMALL LETTER R WITH DOT ABOVE (16#1E5B#, 16#1E5B#, -1), -- LATIN SMALL LETTER R WITH DOT BELOW .. LATIN SMALL LETTER R WITH DOT BELOW (16#1E5D#, 16#1E5D#, -1), -- LATIN SMALL LETTER R WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER R WITH DOT BELOW AND MACRON (16#1E5F#, 16#1E5F#, -1), -- LATIN SMALL LETTER R WITH LINE BELOW .. LATIN SMALL LETTER R WITH LINE BELOW (16#1E61#, 16#1E61#, -1), -- LATIN SMALL LETTER S WITH DOT ABOVE .. LATIN SMALL LETTER S WITH DOT ABOVE (16#1E63#, 16#1E63#, -1), -- LATIN SMALL LETTER S WITH DOT BELOW .. LATIN SMALL LETTER S WITH DOT BELOW (16#1E65#, 16#1E65#, -1), -- LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE .. LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE (16#1E67#, 16#1E67#, -1), -- LATIN SMALL LETTER S WITH CARON AND DOT ABOVE .. LATIN SMALL LETTER S WITH CARON AND DOT ABOVE (16#1E69#, 16#1E69#, -1), -- LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE .. LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE (16#1E6B#, 16#1E6B#, -1), -- LATIN SMALL LETTER T WITH DOT ABOVE .. LATIN SMALL LETTER T WITH DOT ABOVE (16#1E6D#, 16#1E6D#, -1), -- LATIN SMALL LETTER T WITH DOT BELOW .. LATIN SMALL LETTER T WITH DOT BELOW (16#1E6F#, 16#1E6F#, -1), -- LATIN SMALL LETTER T WITH LINE BELOW .. LATIN SMALL LETTER T WITH LINE BELOW (16#1E71#, 16#1E71#, -1), -- LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW (16#1E73#, 16#1E73#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS BELOW .. LATIN SMALL LETTER U WITH DIAERESIS BELOW (16#1E75#, 16#1E75#, -1), -- LATIN SMALL LETTER U WITH TILDE BELOW .. LATIN SMALL LETTER U WITH TILDE BELOW (16#1E77#, 16#1E77#, -1), -- LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW (16#1E79#, 16#1E79#, -1), -- LATIN SMALL LETTER U WITH TILDE AND ACUTE .. LATIN SMALL LETTER U WITH TILDE AND ACUTE (16#1E7B#, 16#1E7B#, -1), -- LATIN SMALL LETTER U WITH MACRON AND DIAERESIS .. LATIN SMALL LETTER U WITH MACRON AND DIAERESIS (16#1E7D#, 16#1E7D#, -1), -- LATIN SMALL LETTER V WITH TILDE .. LATIN SMALL LETTER V WITH TILDE (16#1E7F#, 16#1E7F#, -1), -- LATIN SMALL LETTER V WITH DOT BELOW .. LATIN SMALL LETTER V WITH DOT BELOW (16#1E81#, 16#1E81#, -1), -- LATIN SMALL LETTER W WITH GRAVE .. LATIN SMALL LETTER W WITH GRAVE (16#1E83#, 16#1E83#, -1), -- LATIN SMALL LETTER W WITH ACUTE .. LATIN SMALL LETTER W WITH ACUTE (16#1E85#, 16#1E85#, -1), -- LATIN SMALL LETTER W WITH DIAERESIS .. LATIN SMALL LETTER W WITH DIAERESIS (16#1E87#, 16#1E87#, -1), -- LATIN SMALL LETTER W WITH DOT ABOVE .. LATIN SMALL LETTER W WITH DOT ABOVE (16#1E89#, 16#1E89#, -1), -- LATIN SMALL LETTER W WITH DOT BELOW .. LATIN SMALL LETTER W WITH DOT BELOW (16#1E8B#, 16#1E8B#, -1), -- LATIN SMALL LETTER X WITH DOT ABOVE .. LATIN SMALL LETTER X WITH DOT ABOVE (16#1E8D#, 16#1E8D#, -1), -- LATIN SMALL LETTER X WITH DIAERESIS .. LATIN SMALL LETTER X WITH DIAERESIS (16#1E8F#, 16#1E8F#, -1), -- LATIN SMALL LETTER Y WITH DOT ABOVE .. LATIN SMALL LETTER Y WITH DOT ABOVE (16#1E91#, 16#1E91#, -1), -- LATIN SMALL LETTER Z WITH CIRCUMFLEX .. LATIN SMALL LETTER Z WITH CIRCUMFLEX (16#1E93#, 16#1E93#, -1), -- LATIN SMALL LETTER Z WITH DOT BELOW .. LATIN SMALL LETTER Z WITH DOT BELOW (16#1E95#, 16#1E95#, -1), -- LATIN SMALL LETTER Z WITH LINE BELOW .. LATIN SMALL LETTER Z WITH LINE BELOW (16#1E9B#, 16#1E9B#, -59), -- LATIN SMALL LETTER LONG S WITH DOT ABOVE .. LATIN SMALL LETTER LONG S WITH DOT ABOVE (16#1EA1#, 16#1EA1#, -1), -- LATIN SMALL LETTER A WITH DOT BELOW .. LATIN SMALL LETTER A WITH DOT BELOW (16#1EA3#, 16#1EA3#, -1), -- LATIN SMALL LETTER A WITH HOOK ABOVE .. LATIN SMALL LETTER A WITH HOOK ABOVE (16#1EA5#, 16#1EA5#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE (16#1EA7#, 16#1EA7#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE (16#1EA9#, 16#1EA9#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE (16#1EAB#, 16#1EAB#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE (16#1EAD#, 16#1EAD#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW (16#1EAF#, 16#1EAF#, -1), -- LATIN SMALL LETTER A WITH BREVE AND ACUTE .. LATIN SMALL LETTER A WITH BREVE AND ACUTE (16#1EB1#, 16#1EB1#, -1), -- LATIN SMALL LETTER A WITH BREVE AND GRAVE .. LATIN SMALL LETTER A WITH BREVE AND GRAVE (16#1EB3#, 16#1EB3#, -1), -- LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE .. LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE (16#1EB5#, 16#1EB5#, -1), -- LATIN SMALL LETTER A WITH BREVE AND TILDE .. LATIN SMALL LETTER A WITH BREVE AND TILDE (16#1EB7#, 16#1EB7#, -1), -- LATIN SMALL LETTER A WITH BREVE AND DOT BELOW .. LATIN SMALL LETTER A WITH BREVE AND DOT BELOW (16#1EB9#, 16#1EB9#, -1), -- LATIN SMALL LETTER E WITH DOT BELOW .. LATIN SMALL LETTER E WITH DOT BELOW (16#1EBB#, 16#1EBB#, -1), -- LATIN SMALL LETTER E WITH HOOK ABOVE .. LATIN SMALL LETTER E WITH HOOK ABOVE (16#1EBD#, 16#1EBD#, -1), -- LATIN SMALL LETTER E WITH TILDE .. LATIN SMALL LETTER E WITH TILDE (16#1EBF#, 16#1EBF#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE (16#1EC1#, 16#1EC1#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE (16#1EC3#, 16#1EC3#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE (16#1EC5#, 16#1EC5#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE (16#1EC7#, 16#1EC7#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW (16#1EC9#, 16#1EC9#, -1), -- LATIN SMALL LETTER I WITH HOOK ABOVE .. LATIN SMALL LETTER I WITH HOOK ABOVE (16#1ECB#, 16#1ECB#, -1), -- LATIN SMALL LETTER I WITH DOT BELOW .. LATIN SMALL LETTER I WITH DOT BELOW (16#1ECD#, 16#1ECD#, -1), -- LATIN SMALL LETTER O WITH DOT BELOW .. LATIN SMALL LETTER O WITH DOT BELOW (16#1ECF#, 16#1ECF#, -1), -- LATIN SMALL LETTER O WITH HOOK ABOVE .. LATIN SMALL LETTER O WITH HOOK ABOVE (16#1ED1#, 16#1ED1#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE (16#1ED3#, 16#1ED3#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE (16#1ED5#, 16#1ED5#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE (16#1ED7#, 16#1ED7#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE (16#1ED9#, 16#1ED9#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW (16#1EDB#, 16#1EDB#, -1), -- LATIN SMALL LETTER O WITH HORN AND ACUTE .. LATIN SMALL LETTER O WITH HORN AND ACUTE (16#1EDD#, 16#1EDD#, -1), -- LATIN SMALL LETTER O WITH HORN AND GRAVE .. LATIN SMALL LETTER O WITH HORN AND GRAVE (16#1EDF#, 16#1EDF#, -1), -- LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE (16#1EE1#, 16#1EE1#, -1), -- LATIN SMALL LETTER O WITH HORN AND TILDE .. LATIN SMALL LETTER O WITH HORN AND TILDE (16#1EE3#, 16#1EE3#, -1), -- LATIN SMALL LETTER O WITH HORN AND DOT BELOW .. LATIN SMALL LETTER O WITH HORN AND DOT BELOW (16#1EE5#, 16#1EE5#, -1), -- LATIN SMALL LETTER U WITH DOT BELOW .. LATIN SMALL LETTER U WITH DOT BELOW (16#1EE7#, 16#1EE7#, -1), -- LATIN SMALL LETTER U WITH HOOK ABOVE .. LATIN SMALL LETTER U WITH HOOK ABOVE (16#1EE9#, 16#1EE9#, -1), -- LATIN SMALL LETTER U WITH HORN AND ACUTE .. LATIN SMALL LETTER U WITH HORN AND ACUTE (16#1EEB#, 16#1EEB#, -1), -- LATIN SMALL LETTER U WITH HORN AND GRAVE .. LATIN SMALL LETTER U WITH HORN AND GRAVE (16#1EED#, 16#1EED#, -1), -- LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE (16#1EEF#, 16#1EEF#, -1), -- LATIN SMALL LETTER U WITH HORN AND TILDE .. LATIN SMALL LETTER U WITH HORN AND TILDE (16#1EF1#, 16#1EF1#, -1), -- LATIN SMALL LETTER U WITH HORN AND DOT BELOW .. LATIN SMALL LETTER U WITH HORN AND DOT BELOW (16#1EF3#, 16#1EF3#, -1), -- LATIN SMALL LETTER Y WITH GRAVE .. LATIN SMALL LETTER Y WITH GRAVE (16#1EF5#, 16#1EF5#, -1), -- LATIN SMALL LETTER Y WITH DOT BELOW .. LATIN SMALL LETTER Y WITH DOT BELOW (16#1EF7#, 16#1EF7#, -1), -- LATIN SMALL LETTER Y WITH HOOK ABOVE .. LATIN SMALL LETTER Y WITH HOOK ABOVE (16#1EF9#, 16#1EF9#, -1), -- LATIN SMALL LETTER Y WITH TILDE .. LATIN SMALL LETTER Y WITH TILDE (16#1F00#, 16#1F07#, 8), -- GREEK SMALL LETTER ALPHA WITH PSILI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI (16#1F10#, 16#1F15#, 8), -- GREEK SMALL LETTER EPSILON WITH PSILI .. GREEK SMALL LETTER EPSILON WITH DASIA AND OXIA (16#1F20#, 16#1F27#, 8), -- GREEK SMALL LETTER ETA WITH PSILI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI (16#1F30#, 16#1F37#, 8), -- GREEK SMALL LETTER IOTA WITH PSILI .. GREEK SMALL LETTER IOTA WITH DASIA AND PERISPOMENI (16#1F40#, 16#1F45#, 8), -- GREEK SMALL LETTER OMICRON WITH PSILI .. GREEK SMALL LETTER OMICRON WITH DASIA AND OXIA (16#1F51#, 16#1F51#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA .. GREEK SMALL LETTER UPSILON WITH DASIA (16#1F53#, 16#1F53#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA (16#1F55#, 16#1F55#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA (16#1F57#, 16#1F57#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI .. GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI (16#1F60#, 16#1F67#, 8), -- GREEK SMALL LETTER OMEGA WITH PSILI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI (16#1F70#, 16#1F71#, 74), -- GREEK SMALL LETTER ALPHA WITH VARIA .. GREEK SMALL LETTER ALPHA WITH OXIA (16#1F72#, 16#1F75#, 86), -- GREEK SMALL LETTER EPSILON WITH VARIA .. GREEK SMALL LETTER ETA WITH OXIA (16#1F76#, 16#1F77#, 100), -- GREEK SMALL LETTER IOTA WITH VARIA .. GREEK SMALL LETTER IOTA WITH OXIA (16#1F78#, 16#1F79#, 128), -- GREEK SMALL LETTER OMICRON WITH VARIA .. GREEK SMALL LETTER OMICRON WITH OXIA (16#1F7A#, 16#1F7B#, 112), -- GREEK SMALL LETTER UPSILON WITH VARIA .. GREEK SMALL LETTER UPSILON WITH OXIA (16#1F7C#, 16#1F7D#, 126), -- GREEK SMALL LETTER OMEGA WITH VARIA .. GREEK SMALL LETTER OMEGA WITH OXIA (16#1F80#, 16#1F87#, 8), -- GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI (16#1F90#, 16#1F97#, 8), -- GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI (16#1FA0#, 16#1FA7#, 8), -- GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI (16#1FB0#, 16#1FB1#, 8), -- GREEK SMALL LETTER ALPHA WITH VRACHY .. GREEK SMALL LETTER ALPHA WITH MACRON (16#1FB3#, 16#1FB3#, 9), -- GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI (16#1FBE#, 16#1FBE#, -7205), -- GREEK PROSGEGRAMMENI .. GREEK PROSGEGRAMMENI (16#1FC3#, 16#1FC3#, 9), -- GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI (16#1FD0#, 16#1FD1#, 8), -- GREEK SMALL LETTER IOTA WITH VRACHY .. GREEK SMALL LETTER IOTA WITH MACRON (16#1FE0#, 16#1FE1#, 8), -- GREEK SMALL LETTER UPSILON WITH VRACHY .. GREEK SMALL LETTER UPSILON WITH MACRON (16#1FE5#, 16#1FE5#, 7), -- GREEK SMALL LETTER RHO WITH DASIA .. GREEK SMALL LETTER RHO WITH DASIA (16#1FF3#, 16#1FF3#, 9), -- GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI (16#FF41#, 16#FF5A#, -32), -- FULLWIDTH LATIN SMALL LETTER A .. FULLWIDTH LATIN SMALL LETTER Z (16#10428#, 16#1044D#, -40) -- DESERET SMALL LETTER LONG I .. DESERET SMALL LETTER ENG ); ************************************************************* From: Randy Brukardt Sent: Wednesday, November 27, 2002 11:01 AM Thanks for doing this. Where are you finding the information that you are using to do this? A quick search of the net didn't turn up anything machine-readable... ************************************************************* From: Michael F. Yoder Sent: Wednesday, November 27, 2002 12:20 PM The root link is www.unicode.org and the "latest version" link goes to http://www.unicode.org/unicode/reports/tr28/ The "this version" link at the top goes to a page with some relevant stuff. The page with the machine-readable files for V3.2 is: http://www.unicode.org/Public/UNIDATA/ . The current organization seems to be harder to navigate than it used to be; I'm unsure why. N.B. version 3.2 of Unicode claims to be "fully synchronized" with ISO 10646, so it is strongly preferable to earlier versions. ************************************************************* From: Robert I. Eachus Sent: Tuesday, March 18, 2003 12:56 AM I hate to reopen the character set can of worms, but I think we need to do it. In effect Latin 1 is being replaced by Latin 9 (ISO 8859-15). Latin 9 adds the Euro sign, OE Ligatures, and S and Z with caron, and capital Y with diaresis to Latin 1, removing the currency symbol, broken bar, some accents and the vulgar fractions. See http://www.cs.tut.fi/~jkorpela/latin9.html for a fuller explanation. Latin 9 is slowly being adopted. Of course some countries in the Euro zone are already using a "localized" version of Latin 1 with the currency sign representation looking suspiciously like a Euro symbol. So we could decide to leave this issue to Ada-1Z or whatever. However, I think that at the least we should add a Latin9 package to Ada with the correct character names. What else should or could be done? One possibility would be to redefine Ada.Characters.Handling to correctly treat seven new codes as lower or upper case characters. I would much prefer to go for the whole nine yards so we never need to do this again. Add an enumeration type Sets to Ada Characters, or if you prefer Character_Sets. It should enumerate all the ISO 8859 character sets. (If you want to be clever, we could start with ISO646 so that Sets'Pos(N) = ISO 8859-N.) In any case we should allow implementations to extend the type. This would allow both for new ISO 8859 character sets, and for Unicode, EBCDIC, IBM code pages, and so on. Now add procedure Set_Default_Character_Set, and function Current_Character_Set to Ada.Characters.Handling. (Or if you prefer to Ada.Characters.) As far as I am concerned the only required behavior for Set_Character_Set should be to accept an argument of Latin_1. It it probably a day or two of work to modify the functions in Ada.Character.Handling to support all the current ISO 8859 mappings. It is at least ten times harder to actually test all possible combinations of character set and Ada.Characters.Handling functions. It could be another five to ten times that to add tests to the validation suite, with very little practical effect. This is why I favor a minimalist approach to the requirements. (National bodies can of course require supporting other values for Character_Sets. For example, the Japanese national body could require Shift-JIS support if they felt like it, without requiring that compilers that comply to the Japanes national standard be incompatible with ISO 8652, without the ARG spending all of its time on character set issues.) What about names in Ada programs? ARGH! If your compiler is written in Ada and uses Ada.Characters.Handling, modifying the compiler is not a problem. Defining what it means to compile a program written using a non-Latin-1 character set threatens to expand clause 2 (Lexical Elements) to the size of a small telephone directory. I would prefer to just modify 2.1 to direct people to ISO 10646-1, which is the size of a large telephone directory, plus currently five ammendments, for the meaning of lexical elements in non-Latin_1 source representations, and let national bodies decide what they want to define locally. ************************************************************* From: Pascal Leroy Sent: Tuesday, March 18, 2003 2:15 AM > I hate to reopen the character set can of worms, but I think we need to > do it. In effect Latin 1 is being replaced by Latin 9 (ISO 8859-15). > Latin 9 adds the Euro sign, OE Ligatures, and S and Z with caron, and > capital Y with diaresis to Latin 1, removing the currency symbol, broken > bar, some accents and the vulgar fractions. See > http://www.cs.tut.fi/~jkorpela/latin9.html for a fuller explanation. This issue was discussed at some length as part of AI 285/01 (of which I am the editor). It is clear that adding support for Latin-9 in Ada.Characters (and children) is relatively straightforward. However there is the much nastier question of type Standard.Character, (which has pretty much to remain Latin-1 if you don't want to introduce awful incompatibilities) and of the interactions between what happens at compile-time and what happens at run-time. Consider for instance the call: Ada.Characters.Latin_9.Handling.Is_Letter ('έ') It has pretty much to return True (that's an S-caron in Latin-9), but that's certainly surprising! This amounts to breaking the Character abstraction and interpreting characters as bytes/code points, which is likely to lead to confusion in an Ada program that would deal with character sets having different encodings. Another interesting example is mentioned in the minutes of the Bedford meeting (http://www.ada-auth.org/ai-files/minutes/min-0210.html#AI285): "Consider the enumeration identifier "˜" (latin small letter y diaeresis). E'Image(˜) = "˜" in Latin-1 (there is no upper case version), but "Y" in Latin-9 (there is an upper case version). So we would need the identifier semantics to be changed depending on the character set. Pascal claims that this is important to reading French." After giving it more thought, I have come to the conclusion that the entire Latin-9 approach is misguided because: 1 - There is relatively little support in software out there for this encoding (heck, I am even reading that some mail gateways bounce back messages that use Latin-9 as their character encoding). Most of the editors that I have played with just go to Unicode when you type the Euro sign. That provides support for this new character without causing endless compatibility nightmares. 2 - I have gone through a similar "code point shuffle" mess at the beginning of the 80s: at the time we only had 7 bits per character (as you probably remember, the 8th bit was often used for parity) and some genius had invented to encode the French accented characters using the code points normally assigned to [, ], \, and the like. I have written thousands of lines of Pascal where an array indexing looked like Arr‡IŠ (instead of Arr[I]) just because of this silliness. What was painful-but-tolerable 20 years ago is just not going to fly nowadays: I am ready to bet that the world will go Unicode before it goes Latin-9. Therefore, the latest version of AI 285 proposes to go to Unicode for the text representation of programs, relying on the categorization work done by the Unicode people so that we don't have to argue endlessly about which characters can appear in identifiers, etc. And it entirely ignores Latin-9, or any other Latin-N for that matter. ************************************************************* From: Robert I. Eachus Sent: Tuesday, March 18, 2003 2:15 AM Pascal Leroy wrote: > This issue was discussed at some length as part of AI 285/01 (of which I am > the editor). It is clear that adding support for Latin-9 in Ada.Characters > (and children) is relatively straightforward. However there is the much > nastier question of type Standard.Character, (which has pretty much to > remain Latin-1 if you don't want to introduce awful incompatibilities) and > of the interactions between what happens at compile-time and what happens at > run-time. I thought we had an AI on the subject, but searching for Latin in the title didn't find it. I see what happened is that the name of the AI was changed. (I don't want to make work for Randy, and this may be a rare occurance or may not be. Perhaps a set of links to "old" names somewhere.) So I guess that the title of the original post is correct, because as I see it, the issue of Latin 9 support is completely separate from the issues with 16 and 32 bit character sets. Now to pull some magic by quoting from rev 1.4 of AI 285: An implementation is allowed to provide a library package named Ada.Characters.Latin_9. This package shall be identical to Ada.Characters.Latin_1, except for the following differences: - It doesn't declare the constants Currency_Sign, Broken_Bar, Diaeresis, Acute, Cedilla, Fraction_One_Quarter, Fraction_One_Half, and Fraction_Three_Quarter. - It declares the following constants: Euro_Sign : constant Character := '€'; -- Character'Val (164) UC_S_Caron : constant Character := 'S'; -- Character'Val (166) LC_S_Caron : constant Character := 's'; -- Character'Val (168) UC_Z_Caron : constant Character := 'Ž'; -- Character'Val (180) LC_Z_Caron : constant Character := 'ž'; -- Character'Val (184) UC_OE_Diphthong : constant Character := 'O'; -- Character'Val (188) LC_OE_Diphthong : constant Character := 'o'; -- Character'Val (189) UC_Y_Diaeresis : constant Character := 'Y'; -- Character'Val (190) In Netscape 7.01, with the encoding set to Latin-1, this displays (correctly) the Latin 9 representations! As does OpenOffice.org, Notepad and so on. Now let me abstract from the Ada.Characters.Latin 1 Currency_Sign : constant Character := ''; --Character'Val(164) Broken_Bar : constant Character := 'έ'; --Character'Val(166) Diaeresis : constant Character := '"'; --Character'Val(168) Acute : constant Character := '''; --Character'Val(180) Cedilla : constant Character := ','; --Character'Val(184) Fraction_One_Quarter : constant Character := '¬'; --Character'Val(188) Fraction_One_Half : constant Character := '«'; --Character'Val(189) Fraction_Three_Quarters : constant Character := '_'; --Character'Val(190) How can this work? Easy, other standards, in particular ISO/IEC 2022, http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=22747 specify control characters and escape sequences which can be used with Latin 1--or any ISO 8859 character set--to access characters from other sets. > Consider for instance the call: > > Ada.Characters.Latin_9.Handling.Is_Letter ('έ') > > It has pretty much to return True (that's an S-caron in Latin-9), but that's > certainly surprising! This amounts to breaking the Character abstraction > and interpreting characters as bytes/code points, which is likely to lead to > confusion in an Ada program that would deal with character sets having > different encodings. Why would you expect this call to work? You could argue that a compiler "should" raise Program_Error or Constraint_Error, but I would expect any reasonable compiler to object at compile time to an invalid character literal. Remember, notationally Ada is written in Unicode/ISO 10646 BMP, however it is represented. In context, that call is illegal and Ada.Characters.Latin_9.Handling.Is_Letter ('S') is legal and should return true. (Assuming we recommend having or allowing a package Ada.Characters.Latin_9.Handling.) But this discussion has done a lot to convince me that the best solution is to add a function Ada.Characters.Current_Set to Ada.Characters. The required work is trival for compilers that want to stay in the Latin 1 only world, and for compilers that do want to implement support for other 8-bit character sets, they really have to do most of the same work anyway. To repeat my proposal: Add an enumeration type Sets to Ada Characters, or if you prefer Character_Sets. It should enumerate all the ISO 8859 character sets. (If you want to be clever, we could start with ISO646 so that Sets'Pos(N) = ISO 8859-N.) In any case we should allow implementations to extend the type. This would allow both for new ISO 8859 character sets, and for Unicode, EBCDIC, IBM code pages, and so on. Now add procedure Set_Default_Character_Set, and function Current_Character_Set to Ada.Characters.Handling. (Or if you prefer to Ada.Characters.) As far as I am concerned the only required behavior for Set_Character_Set should be to accept an argument of Latin_1. We should probably also add a library pragma to change the default mapping of Character. (Compilers will probably accept command line setting of character mappings, but I think that a stanard pragma would help standardization.) If my proposal is accepted: Ada.Characters.Handling.Is_Letter ('έ') should return false when Ada.Characters.Current_Set is Latin_1, and Ada.Characters.Handling.Is_Letter ('S') should return true when Ada.Characters.Current_Set is Latin_9. The behavior of Ada.Characters.Handling.Is_Letter (Character'Val(166) Should depend on the current value of Ada.Characters.Current_Set. What happens in the other cases will at best be implementation defined. In other words, if your program contains a (Unicode/BMP or UTF8) character literal that is not in a supported character set, I expect Program_Error, if a character in a literal is not a legal literal for Character, it is an error, just like other misspellings of literals. > Another interesting example is mentioned in the minutes of the Bedford > meeting (http://www.ada-auth.org/ai-files/minutes/min-0210.html#AI285): > "Consider the enumeration identifier "˜" (latin small letter y diaeresis). > E'Image(˜) = "˜" in Latin-1 (there is no upper case version), but "Y" in > Latin-9 (there is an upper case version). So we would need the identifier > semantics to be changed depending on the character set. Pascal claims that > this is important to reading French." Exactly why I think a way is needed for the programmer to be able to determine what the actual character set mapping is. Almost no burden for compilers that support Latin 1 only, and not that much additional for compilers that do support other 8-bit mappings. (Actually, I may be wrong but I think all currently validated compilers accept source in non-Latin 1 character sets.) > After giving it more thought, I have come to the conclusion that the entire > Latin-9 approach is misguided because: > > 1 - There is relatively little support in software out there for this > encoding (heck, I am even reading that some mail gateways bounce back > messages that use Latin-9 as their character encoding). Most of the editors > that I have played with just go to Unicode when you type the Euro sign. > That provides support for this new character without causing endless > compatibility nightmares. > > 2 - I have gone through a similar "code point shuffle" mess at the > beginning of the 80s: at the time we only had 7 bits per character (as you > probably remember, the 8th bit was often used for parity) and some genius > had invented to encode the French accented characters using the code points > normally assigned to [, ], \, and the like. I have written thousands of > lines of Pascal where an array indexing looked like Arr‡IŠ (instead of > Arr[I]) just because of this silliness. What was painful-but-tolerable 20 > years ago is just not going to fly nowadays: I am ready to bet that the > world will go Unicode before it goes Latin-9. > > Therefore, the latest version of AI 285 proposes to go to Unicode for the > text representation of programs, relying on the categorization work done by > the Unicode people so that we don't have to argue endlessly about which > characters can appear in identifiers, etc. And it entirely ignores Latin-9, > or any other Latin-N for that matter. Couldn't agree more. The right solution is not to switch from Latin 1 to any other character set as a standard, but to supply a standard method for localization, and keep with the assumption of current Unicode/BMP for Wide_Character and for (notational) source. Does any implementor see a problem implementing the above recommendation? We could also go to the extreme of adding another optional annex dealing with character representation issues, but I think we all agree that the ARG should stay away from piecemeal character set bindings. On the other hand, I can see having a standard Wide_Character categorization, and allowing other characterizations to fall out from that. But let's keep that discussion in AI-285. ************************************************************* From: Randy Brukardt Sent: Tuesday, March 18, 2003 6:04 PM > (Actually, I may be > wrong but I think all currently validated compilers accept source in > non-Latin 1 character sets.) Since the only currently validated compilers are from Rational and DDC-I, that isn't saying much at all. You have to at least talk about widely-used compilers, but then you get into definitional problems. ************************************************************* From: Robert I. Eachus Sent: Tuesday, March 18, 2003 6:51 PM We are in the standards business. I think that this is an area where a small extention to the standard will be very helpful in providing portability. But we can't really worry about the cost of conformity for non-standardized compilers. ;-) That is why I think that a definition which names the various character sets should be standardized: type Character_Sets is (ISO_646, Latin_1, Latin_2,...Latin_Greek...); This would help standarized the way that non-Latin 1 character sets are named for compatibility. But I think we should stay out of the business of defining which characters are which for Latin_Greek, etc. That is ISO/IEC JTC1/SC2's job, and I think they do it pretty well. Now if my proposal is accepted and say, GNAT, chooses to support the function Ada.Characters.Current_Set in a useful manner. However, ACT sees no demand for Ada.Characters.Set_Default_Character_Set to do anything useful, and therefore raises an exception if you try to change the value. (In other words Ada.Characters.Set_Default_Character_Set (Ada.Characters.Current_Set) does not raise an exception, but actually trying to change the value does.) Some other vendor may have a customer who requires Latin_Hebrew support, but could care less about Latin 9. Fine. Assigning Ada names to the various 8859 character sets is in our area of competence. Deciding which sets compiler vendors support should be left up to their customers. Is this useful progress towards standarization? Sure. Is arguing over whether there is demand for Linear_B support way out of the way of anything that the ARG wants to get involved in? Obviously. Or worse, whether a variable named with the Greek Alpha, should match a Latin A? Arggh! (If you think that is bad what about CJK unification? Do we want to get into political cat fights about whether or not a Japanese Kanji code point matches a (Korean) Hangul character with a different appearence? Please! Anything but that...) That is why I think we should be in the business of defining how to change character sets, but should stay well out of the politics of whether, say, compilers purchased by the Canadian government must support Latin 9. ************************************************************* From: Pascal Leroy Sent: Wednesday, March 19, 2003 3:54 AM > In Netscape 7.01, with the encoding set to Latin-1, this displays > (correctly) the Latin 9 representations! As does OpenOffice.org, > Notepad and so on. Now let me abstract from the Ada.Characters.Latin 1 In the case of Notepad, it just goes to Unicode (encoded as UTF-8) as soon as you type a non-Latin-1 character. So I am not sure what your point is. (Didn't check the other software packages that you mention.) > Add an enumeration type Sets to Ada Characters, or if you prefer > Character_Sets. It should enumerate all the ISO 8859 character sets. > ... > Now add procedure Set_Default_Character_Set, and function > Current_Character_Set to Ada.Characters.Handling. > ... > We should probably also add a library pragma to change the default > mapping of Character. (Compilers will probably accept command line > setting of character mappings, but I think that a stanard pragma would > help standardization.) I understand the usefulness of a pragma, but I don't really understand what sense it makes to change the default character set (whatever that is) at run-time. Consider the case where you compile a program in Latin-9 mode, and it has an enumeration literal with an S-caron in it. Then at run-time you switch to Latin-1. Would the 'Image attribute now return a string including a broken bar? That would be very strange. I can imagine why a program might want to juggle with different character encodings (by withing different Latin_N units) but it seems to me that the default character set has to be fixed at compilation time. Anyway none of this changes my opinion that the Latin-N sets are far too unimportant to spend precious ARG time on them. > Or worse, > whether a variable named with the Greek Alpha, should match a Latin A? > Arggh! (If you think that is bad what about CJK unification? Do we want > to get into political cat fights about whether or not a Japanese Kanji > code point matches a (Korean) Hangul character with a different > appearence? Please! Anything but that...) As a matter of fact, the current AI 285 does exactly that, and I don't see this as a political cat fight. The idea is to just follow what the Unicode folks are doing (and I suppose _they_ do quite a bit of political cat fight). So to answer your questions, a Latin A is not the same thing as a Greek Alpha or a Cyrillic A. And at this point the kanjis and hanguls are not letters, so they are not allowed in identifiers. When the Unicode people decide that ideograms are letters, we will update the definition in Ada. ************************************************************* From: Jean-Pierre Rosen Sent: Wednesday, March 19, 2003 4:19 AM > I understand the usefulness of a pragma, but I don't really understand > what sense it makes to change the default character set (whatever that > is) at run-time. Consider the case where you compile a program in > Latin-9 mode, and it has an enumeration literal with an S-caron in it. > Then at run-time you switch to Latin-1. Would the 'Image attribute now > return a string including a broken bar? That would be very strange. > And if you go that way, you may want different tasks to use different encodings.... Did I hear "can of worms" ? ************************************************************* From: Robert I. Eachus Sent: Wednesday, March 19, 2003 4:43 PM First, let me get this out of the way. I really like UTF-8, and for that matter UTF-16. I would also love to put real Unicode/BMP support into Chapter (Clause) 2 and elsewhere in the RM. I would like to see a (standard) Wide_Text_IO that supported UTF-1. But it is a lot of work. However, even if users do eventually migrate toward 16-bit and 32-bit character standards, we currently have an 8-bit character type in the standard. My reasons behind arguing for a minimal AI in this area is that I think that it would "clear the decks" forever in the 8-bit area, and let us concentrate on enhancing 16-bit support in the future. Pascal Leroy wrote: > In the case of Notepad, it just goes to Unicode (encoded as UTF-8) as > soon as you type a non-Latin-1 character. So I am not sure what your > point is. (Didn't check the other software packages that you mention.) I guess you missed the point. Windows actually uses a superset of Latin 1 that contains all the Latin 9 characters with different code-points. Windows also has IANA-registered extended versions of some other Latin sets. (These are Windows-1291 et. seq.) See the MIME and HTML standards for more details. Notepad and other applications may switch to Unicode internally when you enter non-Latin 1 (or non-Windows 1291) characters. But if you cut-and-paste into a text document from one with a different mapping, most PC software seems to use ISO 2022 control characters to avoid having to reprocess the entire document. This can be done as long as you use at most three ISO 8859 (or Windows) font variants. > I understand the usefulness of a pragma, but I don't really understand > what sense it makes to change the default character set (whatever that > is) at run-time. > I can imagine why a program might want to juggle with different > character encodings (by withing different Latin_N units) but it seems to > me that the default character set has to be fixed at compilation time. You may be right which is why I gave that hypothetical GNAT example. I think it would be almost trivial for them to support a current character set enquiry function, but a procedure to change the character set at run-time might take a lot more work. Where you would want to be able to change the default character set at run-time would be for things like Character to UTF-8 encoders and decoders. > Consider the case where you compile a program in Latin-9 mode, and > it has an enumeration literal with an S-caron in it. Then at > run-time you switch to Latin-1. Would the 'Image attribute now > return a string including a broken bar? That would be very strange. Why? The character or string literal gets translated from Latin 9 to Character at compile time. Then you conceptually remap all Character and String values when you change the default character set at run-time. If you convert the literal from Latin 9 to UTF-8 or Unicode at compile time, then try to convert back with a default character set of Latin 1, you can and should expect a Constraint_Error. > Anyway none of this changes my opinion that the Latin-N sets are far too > unimportant to spend precious ARG time on them. In one sense, as I said I agree. But I think that since we do have compilers around that support remapping of Character, a standard way of querying that setting is needed for standardization. As I indicated, I can easily be convinced that a way of setting the default mapping at run-time is a bit too much. Certainly though, the same issues will come up with respect to Wide_Character if and when compilers support different Wide_Character mappings. In the Wide_Character case determining at run-time what the actual mapping is may be important, but I certainly agree that requiring support for changing the Wide_Character mapping at run-time (say from Shift-JIS to Unicode) would be extreme. Remember that all that my current proposal requires is that changing from Latin 1 to Latin 1 succeed. I agree that anything else should be left outside the scope of the (ISO) standard. I have no trouble with leaving the procedure to change the default character set out altogether, or making it optional. > As a matter of fact, the current AI 285 does exactly that, and I don't > see this as a political cat fight. The idea is to just follow what the > Unicode folks are doing (and I suppose _they_ do quite a bit of > political cat fight). So to answer your questions, a Latin A is not the > same thing as a Greek Alpha or a Cyrillic A. And at this point the > kanjis and hanguls are not letters, so they are not allowed in > identifiers. When the Unicode people decide that ideograms are letters, > we will update the definition in Ada. Exactly my point, except that I think we officially follow ISO 10646 not Unicode. So in theory we should update to Unicode 3.2 compatibility when DIS 10646(2003) is accepted. (Those battles come closer to vendettas than cat fights. The major battles are Japanese vs. Korean, Chinese vs. Japanese, Russian vs. Georgian, Greeks vs. Macedonians, and francophones vs. everybody. Did I miss anyone?) If any other ARG--or CRG--members really care about all this, you too can join the madness in Prague next week. (http://www.unicode.org/iuc/iuc23/ ;-) ************************************************************* From: Randy Brukardt Sent: Wednesday, March 19, 2003 7:25 PM > Pascal Leroy wrote: > > > In the case of Notepad, it just goes to Unicode (encoded as UTF-8) as > > soon as you type a non-Latin-1 character. So I am not sure what your > > point is. (Didn't check the other software packages that you mention.) > > I guess you missed the point. Windows actually uses a superset of Latin > 1 that contains all the Latin 9 characters with different code-points. > Windows also has IANA-registered extended versions of some other Latin > sets. (These are Windows-1291 et. seq.) See the MIME and HTML standards > for more details. Notepad and other applications may switch to Unicode > internally when you enter non-Latin 1 (or non-Windows 1291) characters. Humm, the messages you are sending are encoded as "Windows-1252", which is the standard Windows character set. That hardly proves anything at all (other than that Windows doesn't use Latin-1 itself). (I checked this out in the spam filter.) > But if you cut-and-paste into a text document from one with a > different mapping, most PC software seems to use ISO 2022 control > characters to avoid having to reprocess the entire document. This can > be done as long as you use at most three ISO 8859 (or Windows) font > variants. Nope, it doesn't change the text at all (if its in the standard Windows character set, which most everything is). And if you paste it into the DOS box (which uses the OEM character set - which is how I edit the AIs with my circa-1986 text editor), it just gets converted to the nearest equivalents. For instance, I get a capital Y for UC_Y_Diaeresis (which, BTW, is how your note will appear in the !appendix to AI-285). Generalizations about Windows are almost always wrong. :-) ************************************************************* From: Robert I. Eachus Sent: Thursday, March 20, 2003 12:20 AM Randy Brukardt wrote: > Humm, the messages you are sending are encoded as "Windows-1252", which is > the standard Windows character set. That hardly proves anything at all > (other than that Windows doesn't use Latin-1 itself). (I checked this out in > the spam filter.) (Sorry 1291 et. seq. instead of 1251 et. seq. was a typo.) I guess I shouldn't be surprised that 1252 as succeeded 1251 as the "standard" Windows binding in the US, but I hadn't noticed. But that more clearly makes my point. Users might want to be able to use 8-bit bindings that the ARG as a group should have little or no interest in. But there is the IANA registry, and I think we can bind to a pointer to those names with little difficulty, and leave it to compiler vendors and others to do the "proper" binding to the character set they want to use. We should in no way require compilers to reject S or o (S-caron or the oe ligature) in a name. But we should fix that through references to the Unicode & ISO/IEC 10646 standards, and let compiler vendors support the 8-bit sets their users want to use. (Including 8-bit standards like Shift-JIS and UTF-8.) > Nope, it doesn't change the text at all (if its in the standard Windows > character set, which most everything is). Oh, there are those who would make you pay dearly for those comments, unless you meant Unicode as the "standard" Windows character set. But the reality is that there is NO standard 8-bit character set for Windows, versions for different countries use different character sets. > And if you paste it into the DOS box (which uses the OEM character set - > which is how I edit the AIs with my circa-1986 text editor), it just gets > converted to the nearest equivalents. For instance, I get a capital Y for > UC_Y_Diaeresis (which, BTW, is how your note will appear in the !appendix > to AI-285). Ouch, does that mean I should write the proposal up as a new draft AI, so people can read it? > Generalizations about Windows are almost always wrong. :-) I have learned the hard way that generalizations about preferred character sets are ALWAYS wrong. ************************************************************* From: Randy Brukardt Sent: Thursday, March 20, 2003 5:51 PM > Randy Brukardt wrote: > > > Humm, the messages you are sending are encoded as "Windows-1252", which is > > the standard Windows character set. That hardly proves anything at all > > (other than that Windows doesn't use Latin-1 itself). (I checked this out in > > the spam filter.) > > (Sorry 1291 et. seq. instead of 1251 et. seq. was a typo.) > > I guess I shouldn't be surprised that 1252 as succeeded 1251 as the > "standard" Windows binding in the US, but I hadn't noticed. FYI, that's confused. 1251 is "Cyrillic", while 1252 is "Western European". ... > We should in no way require compilers to reject S or o (S-caron or the > oe ligature) in a name. But we should fix that through references to > the Unicode & ISO/IEC 10646 standards, and let compiler > vendors support the 8-bit sets their users want to use. (Including 8-bit > standards like Shift-JIS and UTF-8.) Which is exactly what Pascal has proposed. But it should be pointed out that this is a very pervasive change. It means that the representation for names at runtime (in things like the tables for 'Image, for 'External_Tag, for exception information) has to be changed (at the very least to UTF-8). For Janus/Ada, where most of the runtime code that deals with those things is written in assembler, such a change will be very expensive. And that will be true to some extent or other for all compilers. > > Nope, it doesn't change the text at all (if its in the standard Windows > > character set, which most everything is). > > Oh, there are those who would make you pay dearly for those comments, > unless you meant Unicode as the "standard" Windows character > set. But > the reality is that there is NO standard 8-bit character set for > Windows, versions for different countries use different > character sets. Of course. I should have said "standard US Windows character set"; didn't mean to imply that it is the same for everyone. > > And if you paste it into the DOS box (which uses the OEM character set - > > which is how I edit the AIs with my circa-1986 text editor), it just gets > > converted to the nearest equivalents. For instance, I get a capital Y for > > UC_Y_Diaeresis (which, BTW, is how your note will appear in the !appendix > > to AI-285). > > Ouch, does that mean I should write the proposal up as a new draft AI, > so people can read it? Nope, AIs go through the same text editor. Using non-7-bit characters in AIs is strongly discouraged. (If we wanted to start using HTML for AIs, then perhaps a little more flexibility could be allowed.) > > Generalizations about Windows are almost always wrong. :-) > > I have learned the hard way that generalizations about preferred > character sets are ALWAYS wrong. Correct. The less the standard says about character sets, the better. Your proposal seems to require a lot of additional verbiage and support to solve a problem that doesn't seem to actually exist. The Unicode/ISO 10646 problem does exist, but once we support that fully, compilers can support anything they want without us getting in the way. (It would be nice to have a way to convert to and from UTF-8 in Ada programs. But, that's one of many things that "easy enough to write yourself", so its hard to say if it worth adding anything for that.) ************************************************************* From: Robert Dewar Sent: Saturday, March 22, 2003 11:38 AM I find all this discussion of character sets going way off target. All we are talking about here is some predefined names for some of the characters, nothing more and nothing less. ************************************************************* From: Robert I. Eachus Sent: Sunday, March 23, 2003 11:43 AM I'm confused. I am certainly proposing using the IANA registry for the names of character sets, and a way for programmers to determine which set is in use. As understand the 16-bit character set issues, in addition to character names, there is characterization in terms of 2.2 Lexical Elements for non-Latin 1 characters. In other words which characters can be used in names and numeric literals. I suspect that what Robert is referring to the fact that if someone uses a non-Latin 1 eight-bit set by a command-line argument, that the names won't match the characters as displayed. If so, I am actually recommending the permission for implementors to 'fix' more than that. For example, I don't think we should require that implementations support the Windows 1252 character set, but it would be nice to allow implementations which choose to do so to get it right. Some of that will follow at compile time if implementations map Windows 1252 to the appropriate Unicode/BMP characters. But it would also be nice to allow the Ada.Characters heirarchy and Ada.Text_IO features to be used with non-Latin 1 character sets. The keyword in the previous paragraph is "allow." As I said, I think we can go a bit farther and provide a standard way to determine the current character set mapping. But there is no reason for us to say you must support these character sets (other than Latin 1!), but must not support these other sets. This really has to do with code points in the 00 to 3F and 80 to BF ranges being printable characters instead of control characters. ************************************************************* !topic To_Ada conversion in case of wchar_t'Size > 16 !reference RM95 B.3(58), RM95 B.3(60) !from Vadim Godunko 2003-01-21 !discussion At least one C library implementation (glibc) use 32-bit values for wchar_t type. In this case the behavior of conversion functions To_Ada is not determined. I propose that those function must return Wide_Character'Val (16#FFFD#) value (Replacement character) if value of Item is outside of BMP. *************************************************************