!standard 1.1.4(15) 05-11-15 AI95-00395/11 !standard 2.1(01) !standard 2.1(03) !standard 2.1(04) !standard 2.1(14) !standard 2.3(02) !standard 2.3(03) !standard 2.3(04) !standard 2.3(05) !standard 2.3(06) !standard 2.9(02) !standard 3.5.2(03) !standard 6.1(10) !standard A.1(36) !standard A.3.1(00) !standard A.3.1(02) !standard A.3.2(02) !standard A.3.2(13-18) !standard A.3.2(42-48) !standard A.3.4(01) !standard A.4.7(46) !standard A.4.7(48) !standard A.4.8(01) !standard J.14(00) !class amendment 05-01-25 !status Amendment 200Y 05-02-25 !status ARG Approved 10-0-1 05-04-17 !status work item 05-01-25 !status received 05-01-25 !priority High !difficulty Easy !subject Various clarifications regarding 16- and 32-bit characters !summary (See proposal.) !problem 1 - The characters in category other_format are generally not displayed. The syntax rule for identifier would make it possible to have an identifier that includes two underlines separated by an other_format, which would visually look like two underlines. Similarly for trailing underlines, or for identifiers that would look like reserved words. Is this intended? (No.) 2 - The character at position 16#AD#, SOFT HYPHEN, is in category other_format. It was allowed in Ada 95 in literals, but the current wording means that it's no longer allowed, which introduces an incompatibility. Is this intended? (No.) 3 - Many places in normative text talk about "upper case" without qualification. This is somewhat ambiguous in the Unicode world. 4 - The definition of the image of non-graphic wide characters results in long strings like "Character_12345678". This increases the Width attribute for Wide_String and Wide_Wide_String for no good reason. 5 - AI-302-3 defines Ada.Strings.Wide_Hash, Ada.Strings.Wide_Fixed.Wide_Hash, Ada.Strings.Wide_Bounded.Wide_Hash; and Ada.Strings.Wide_Unbounded.Wide_Hash; there should be double wide versions of these as well. Similarly, the addition of AI-362 to A.4.7 needs to be made in A.4.8. 6 - Ada.Strings.Wide_Maps.Wide_Constants and Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants define Upper_Case_Map and Lower_Case_Map. What is their effect? 7 - Wide_Wide_Character has 2**31 values. Therefore its size is 31. Because Wide_Wide_String is a packed array, each of its component should occupy 31 bits. Is this intended? (No.) 8 - ISO/IEC 10646:2003 reserves positions 16#FFFE# and 16#FFFF# of *each plane*, but the AARM only mentions the BMP. Is this intended? (No.) 9 - Ada compilers will have a mechanism for locale-independent case folding and character classification. It seems wrong to not allow Ada users to use these facilities. !proposal 1 - After removing the other_format characters, an identifier must not violate the "usual" rules about underlines. It must not be a reserved word, either. Also, other_format characters are allowed (but ignored) in reserved words, and in "special" attribute designators. Note that we must phrase the wording to only allow ASCII characters in identifiers, to avoid oddities like "if" written with a Turkish dotless-i, or "access" written with a German sharp-s. 2 - The incompatibility doesn't seem justified. While Unicode recommends that other_format characters be ignored in identifiers, it doesn't say anything about other constructs. ECMA C#, which we used as a guideline in resolving some of the characters issues, allow them in string literals. Hopefully decent program editors will provide a way to display these characters. Note that some languages allow any character in string literals. We do not want to go that far, in particular we do not want to allow control characters. They have been disallowed for 20 years, and there is no indication that users have had any problem with that. We are just avoiding an incompatibility. We must also specify what is the effect of other_format characters in operator symbols. We are following the rule that other_format characters work in operator symbols just like in normal text: they are allowed (and ignored) for operators that are reserved words, and disallowed in other operators. 3 - We are not going to fix all these places. Currently we only have a rule in 2.3, but it surely doesn't cover all the occurrences of "upper case", so it would be better to have a blanket statement somewhere in section 1. 4 - Change the language-defined names to keep the current value of Width (which is 12). 5 - Add Ada.Strings.Wide_Wide_Hash, Ada.Strings.Wide_Wide_Fixed.Wide_Wide_Hash, Ada.Strings.Wide_Wide_Bounded.Wide_Wide_Hash, and Ada.Strings.Wide_Wide_Unbounded.Wide_Wide_Hash to A.4.8's list of functions. Add A.4.7(46.1/2) to A.4.8. 6 - Ada.Strings.Wide_Maps.Wide_Constants defines Upper_Case_Map and Lower_Case_Map in terms of Ada.Strings.Maps.Constants. A.4.7(48) makes it clear that this is intended. Changing their definition would be inconsistent with Ada 95 - programs would behave differently with no compile-time indication. Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants is defined in terms of Ada.Strings.Wide_Maps.Wide_Constants, but the note A.4.7(48) was not carried over. So it is unclear whether these are just copies (which is what the normative wording implies) or whether they cover the full range. Covering the full range seems inconsistent with Wide_Maps. Case folding is not necessarily 1-to-1; therefore, these mappings are inappropriate for 32-bit characters anyway. Therefore, we stay consistent with Wide_Constants and add text to A.4.8. This text should be normative, and A.4.7 should be changed similarly. 7 - There are two ways to fix this issue: add a size clause for 32 bits; or add another 2**31 literals to Wide_Wide_Character. The former has the drawback that some operations involving low-level programming (Unchecked_Conversion, C interfacing) may become erroneous. The latter has the drawback that Wide_Wide_Character does not model properly the 10646 character set, and therefore programmers who care about internationalization have to deal with the 2**31 extra values; in particular, a signed integer type on a 32-bit machine cannot hold the Pos of a Wide_Wide_Character. This AI was written for the first option. 8 - Add wording to cover positions 16#FFFE# and 16#FFFF# of each plane. We are also removing the language-defined names FFFE and FFFF. 9 - Full case mapping and wide character categorization requires hefty run-time tables, so it would be inappropriate to add that to Ada.Characters.Handling. However, the addition of new operations dealing with Wide_Wide_Characters in Ada.Characters.Handling is problematic, as it makes some calls (those that use literals) ambiguous. So we are moving the conversion functions in a new child package, Ada.Characters.Conversions, and making the existing conversion functions in Ada.Characters.Handling obsolete. We are also adding packages Ada.Wide_Characters and Ada.Wide_Wide_Characters as umbrellas for implementation-defined (or user-defined) operations on Wide_ and Wide_Wide_Characters and Strings. !wording 1 - Change 2.3(2-4) to read: identifier ::= identifier_start {identifier_start | identifier_extend} identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= mark_non_spacing | mark_spacing_combining | number_decimal | punctuation_connector | other_format After eliminating the characters in category other_format, an identifier shall not contain two consecutive characters in category punctuation_connector, or end with a character in that category. Add before 2.3(6): After applying these transformations, an identifier shall not be identical to a reserved word (in upper case). Replace the introductory sentence of 2.9(2) with: The following are the reserved words. Within a program, some or all of the letters of a reserved word may be in upper case, and one or more characters in category other_format may be inserted within or at the end of the reserved word. 2 - Change 2.1(14) (as modified by AI95-00285) to read: graphic_character Any character that is not in the categories other_control, other_private_use, other_surrogate, format_effector, and whose code position is neither 16#FFFE# nor 16#FFFF#. Change 6.1(10) to read: The sequence of characters in an operator_symbol shall form a reserved word, a delimiter, or compound delimiter that corresponds to an operator belonging to one of the six categories of operators defined in clause 4.5. AARM Note: The "sequence of characters" of the string literal of the operator is a technical term (see 2.6), and does not include the surrounding quote characters. As defined in 2.2, lexical elements are "formed" from a sequence of characters. Spaces are not allowed, and upper and lower case is not significant. See 2.2 and 2.9 for rules related to the use of other_format characters in delimiters and reserved words. 3 - Add before 1.1.4(15): When this International Standard mentions the conversion of some character or sequence of characters to upper case, it means the character or sequence of characters obtained by using locale-independent full case folding, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2003. AARM Note: For sequences of characters, case folding is applied to the sequence, not to individual characters. It sometimes can make a difference. Change the second paragraph added after 2.3(5) by AI95-00285 to read: o The remaining sequence of characters is converted to upper case. 4 - In the second paragraph added after 3.5.2(3) by AI95-00285, replace: ... the string "Character_" ... by: ... the string "Hex_" ... 5 - In A.4.8(1) (introduced by AI95-00285), add Strings.Wide_Wide_Hash, Strings.Wide_Wide_Fixed.Wide_Wide_Hash Strings.Wide_Wide_Bounded.Wide_Wide_Hash, and Strings.Wide_Wide_Unbounded.Wide_Wide_Hash functions. In A.4.8(28) (introduced by AI95-00285), add Strings.Hash, Strings.Fixed.Hash, Strings.Bounded.Hash, and Strings.Unbounded.Hash functions. Add after A.4.8(45) (introduced by AI95-00285): Pragma Pure is replaced by pragma Preelaborate in Strings.Wide_Wide_Maps.Wide_Wide_Constants. 6 - Change A.4.7(48) into normative text. Add at the end of A.4.8 (introduced by AI95-00302): Each Wide_Wide_Character_Set constant in the package Strings.Wide_Wide_Maps.Wide_Wide_Constants contains no values outside the Character portion of Wide_Wide_Character. Similarly, each Wide_Wide_Character_Mapping constant in this package is the identity mapping when applied to any element outside the Character portion of Wide_Wide_Character. 7 - Add in A.1, after the declaration of Wide_Wide_Character: for Wide_Wide_Character'Size use 32; 8 - Change 2.1(1) (as modified by AI95-00285) to read: The character repertoire for the text of an Ada program consists of the entire coding space described by the ISO/IEC 10646:2003 Universal Multiple- Octet Coded Character Set. This coding space is organized in *planes*, each plane comprising 65536 characters. Change the paragraph inserted after 2.1(3) by AI95-00285 to read: A *character* is defined by this International Standard for each cell in the coding space described by ISO/IEC 10646:2003, regardless of whether or not ISO/IEC 10646:2003 allocates a character to that cell. Change the second sentence of 2.1(4) (as modified by AI95-00285) to read: A character whose relative code position in its plane is 16#FFFE# or 16#FFFF# is not allowed anywhere in the text of a program. Change 2.1(14) to read (that's in addition to the changes for this paragraph above): graphic_character Any character that is not in the categories other_control, other_private_use, other_surrogate, format_effector, and whose relative code position in its plane is neither 16#FFFE# nor 16#FFFF#. In 3.5.2(3), remove: "The last 2 values of Wide_Character correspond to the nongraphic positions FFFE and FFFF of the BMP, and are assigned the language-defined names FFFE and FFFF. As with the other language-defined names for nongraphic characters, the names FFFE and FFFF are usable only with the attributes (Wide_)Image and (Wide_)Value; they are not usable as enumeration literals." In A.1(36) replace: type Wide_Character is (nul, soh, ..., Hex_0000FFFF); 9 - Change A.3.1(0) to read: A.3.1 The Packages Characters, Wide_Characters, and Wide_Wide_Characters Insert after A.3.1(2): The library package Wide_Characters has the following declaration: package Ada.Wide_Characters is pragma Pure (Wide_Characters); end Ada.Wide_Characters; The library package Wide_Wide_Characters has the following declaration: package Ada.Wide_Wide_Characters is pragma Pure (Wide_Wide_Characters); end Ada.Wide_Wide_Characters; Implementation Advice If an implementation chooses to provide implementation-defined operations on Wide_Character or Wide_String (such as case mapping, classification, collating and sorting, etc.) it should do so by providing child units of Wide_Characters. Similarly if it chooses to provide implementation-defined operations on Wide_Wide_Character or Wide_Wide_String it should do so by providing child units of Wide_Wide_Characters. Add before A.3.2(2): with Ada.Characters.Conversions Replace A.3.2(13) by: -- The functions Is_Character, Is_String, To_Character, To_String, To_Wide_Character, -- and To_Wide_String are obsolescent; see J.14. Delete A.3.2(14-18). Delete A.3.2(42-48). Add section J.14: J.14 Character and Wide_Character Conversion Functions The following declarations exist in the declaration of package Ada.Characters.Handling: function Is_Character (Item : in Wide_Character) return Boolean renames Conversions.Is_Character; function Is_String (Item : in Wide_String) return Boolean renames Conversions.Is_String; function To_Character (Item : in Wide_Character; Substitute : in Character := ' ') return Character renames Conversions.To_Character; function To_String (Item : in Wide_String; Substitute : in Character := ' ') return String renames Conversions.To_String; function To_Wide_Character (Item : in Character) return Wide_Character renames Conversions.To_Wide_Character; function To_Wide_String (Item : in String) return Wide_String renames Conversions.To_Wide_String; Add section A.3.4: A.3.4 The Package Characters.Conversions The library package Ada.Characters.Conversions has the following declaration: package Ada.Characters.Conversions is pragma Pure (Conversions); function Is_Character (Item : in Wide_Character) return Boolean; function Is_String (Item : in Wide_String) return Boolean; function Is_Character (Item : in Wide_Wide_Character) return Boolean; function Is_String (Item : in Wide_Wide_String) return Boolean; function Is_Wide_Character (Item : in Wide_Wide_Character) return Boolean; function Is_Wide_String (Item : in Wide_Wide_String) return Boolean; function To_Wide_Character (Item : in Character) return Wide_Character; function To_Wide_String (Item : in String) return Wide_String; function To_Wide_Wide_Character (Item : in Character) return Wide_Wide_Character; function To_Wide_Wide_String (Item : in String) return Wide_Wide_String; function To_Wide_Wide_Character (Item : in Wide_Character) return Wide_Wide_Character; function To_Wide_Wide_String (Item : in Wide_String) return Wide_Wide_String; function To_Character (Item : in Wide_Character; Substitute : in Character := ' ') return Character; function To_String (Item : in Wide_String; Substitute : in Character := ' ') return String; function To_Character (Item : in Wide_Wide_Character; Substitute : in Character := ' ') return Character; function To_String (Item : in Wide_Wide_String; Substitute : in Character := ' ') return String; function To_Wide_Character (Item : in Wide_Wide_Character; Substitute : in Wide_Character := ' ') return Wide_Character; function To_Wide_String (Item : in Wide_Wide_String; Substitute : in Wide_Character := ' ') return Wide_String; end Ada.Characters.Conversions; (The wording for the semantics of the operations declared in this package is identical to the one currently in AI95-00285.) !discussion (See proposal.) !example !corrigendum 1.1.4(15) @dinsb A @i is a nonterminal in the grammar defined in BNF under "Syntax." Names of syntactic categories are set in a different font, @fa. @dinst When this International Standard mentions the conversion of some character or sequence of characters to upper case, it means the character or sequence of characters obtained by using locale-independent full case folding, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2003. !corrigendum 2.1(01) @drepl The only characters allowed outside of @fas are the @fas and @fas. @dby The character repertoire for the text of an Ada program consists of the entire coding space described by the ISO/IEC 10646:2003 Universal Multiple-Octet Coded Character Set. This coding space is organized in @i, each plane comprising 65536 characters. !corrigendum 2.1(03) @drepl @xcode<@fa> @dby @xindent is defined by this International Standard for each cell in the coding space described by ISO/IEC 10646:2003, regardless of whether or not ISO/IEC 10646:2003 allocates a character to that cell.> !corrigendum 2.1(04) @drepl The character repertoire for the text of an Ada program consists of the collection of characters called the Basic Multilingual Plane (BMP) of the ISO 10646 Universal Multiple-Octet Coded Character Set, plus a set of @fas and, in comments only, a set of @fas; the coded representation for these characters is implementation defined (it need not be a representation defined within ISO-10646-1). @dby The coded representation for characters is implementation defined (it need not be a representation defined within ISO/IEC 10646:2003). A character whose relative code position in its plane is 16#FFFE# or 16#FFFF# is not allowed anywhere in the text of a program. The semantics of an Ada program whose text is not in Normalization Form KC (as defined by section 24 of ISO/IEC 10646:2003) is implementation defined. !corrigendum 2.1(14) @drepl @xhang<@xterm<@fa> Any control function, other than a @fa, that is allowed in a comment; the set of @fas allowed in comments is implementation defined.> @dby @xhang<@xterm<@fa> Any character that is not in the categories @fa, @fa, @fa, @fa, and whose relative code position in its plane is neither 16#FFFE# nor 16#FFFF#.> !corrigendum 2.3(02) @drepl @xcode<@fa> @dby @xcode<@fa> !corrigendum 2.3(03) @drepl @xcode<@fa> @dby @xcode<@fa> @xcode<@fa> !corrigendum 2.3(04) @drepl An identifier shall not be a reserved word. @dby After eliminating the characters in category @fa, an @fa shall not contain two consecutive characters in category punctuation_connector, or end with a character in that category. !corrigendum 2.3(05) @drepl All characters of an @fa are significant, including any underline character. @fas differing only in the use of corresponding upper and lower case letters are considered the same. @dby Two @fas are considered the same if they consist of the same sequence of characters after applying the following transformations (in this order): @xbullet are eliminated.> @xbullet !corrigendum 2.3(06) @dinsb In a nonstandard mode, an implementation may support other upper/lower case equivalence rules for identifiers, to accommodate local conventions. @dinst After applying these transformations, an identifier shall not be identical to a reserved word (in upper case). !corrigendum 2.9(02) @dprepl The following are the @i (ignoring upper/lower case distinctions): @dby The following are the @i. Within a program, some or all of the letters of a reserved word may be in upper case, and one or more characters in category @fa may be inserted within or at the end of the reserved word. !corrigendum 3.5.2(03) @drepl The predefined type Wide_Character is a character type whose values correspond to the 65536 code positions of the ISO 10646 Basic Multilingual Plane (BMP). Each of the graphic characters of the BMP has a corresponding @fa in Wide_Character. The first 256 values of Wide_Character have the same @fa or language-defined name as defined for Character. The last 2 values of Wide_Character correspond to the nongraphic positions FFFE and FFFF of the BMP, and are assigned the language-defined names @i and @i. As with the other language-defined names for nongraphic characters, the names @i and @i are usable only with the attributes (Wide_)Image and (Wide_)Value; they are not usable as enumeration literals. All other values of Wide_Character are considered graphic characters, and have a corresponding @fa. @dby The predefined type Wide_Character is a character type whose values correspond to the 65536 code positions of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP). Each of the graphic characters of the BMP has a corresponding @fa in Wide_Character. The first 256 values of Wide_Character have the same @fa or language-defined name as defined for Character. Each of the @fas has a corresponding @fa. The predefined type Wide_Wide_Character is a character type whose values correspond to the 2147483648 code positions of the ISO/IEC 10646:2003 character set. Each of the @fas has a corresponding @fa in Wide_Wide_Character. The first 65536 values of Wide_Wide_Character have the same @fa or language-defined name as defined for Wide_Character. The characters whose code position is larger than 16#FF# and which are not @fas have language-defined names which are formed by appending to the string "Hex_" the representation of their code position in hexadecimal as eight extended digits. As with other language-defined names, these names are usable only with the attributes (Wide_)Wide_Image and (Wide_)Wide_Value; they are not usable as enumeration literals. !corrigendum 6.1(10) @drepl The sequence of characters in an @fa shall correspond to an operator belonging to one of the six classes of operators defined in clause 4.5 (spaces are not allowed and the case of letters is not significant). @dby The sequence of characters in an @fa shall form a reserved word, a delimiter, or compound delimiter that corresponds to an operator belonging to one of the six categories of operators defined in clause 4.5. !corrigendum A.1(36) @drepl @xcode< --@ft<@i< The predefined operators for the type Character are the same as for>> --@ft<@i< any enumeration type.>> --@ft<@i< The declaration of type Wide_Character is based on the standard ISO 10646 BMP character set.>> --@ft<@i< The first 256 positions have the same contents as type Character. See 3.5.2.>> @b Wide_Character @b (@i, @i ... @i, @i); @b ASCII @b ... @b ASCII; --@ft<@i>> @dby @xcode< --@ft<@i< The predefined operators for the type Character are the same as for>> --@ft<@i< any enumeration type.>> --@ft<@i< The declaration of type Wide_Character is based on the standard ISO/IEC 10646:2003 BMP character>> --@ft<@i< set. The first 256 positions have the same contents as type Character. See 3.5.2.>> @b Wide_Character @b (@i, @i ... @i, @i); --@ft<@i< The declaration of type Wide_Wide_Character is based on the full>> --@ft<@i< ISO/IEC 10646:2003 character set. The first 65536 positions have the>> --@ft<@i< same contents as type Wide_Character. See 3.5.2.>> @b Wide_Wide_Character @b (@i, @i ... @i, @i); @b Wide_Wide_Character'Size @b 32; @b ASCII @b ... @b ASCII; --@ft<@i>> !corrigendum A.3.1(00) @drepl The Package Characters @dby The Packages Characters, Wide_Characters, and Wide_Wide_Characters !corrigendum A.3.1(02) @dinsa @xcode<@b Ada.Characters @b @b Pure(Characters); @b Ada.Characters;> @dinss The library package Wide_Characters has the following declaration: @xcode<@b Ada.Wide_Characters @b @b Pure(Wide_Characters); @b Ada.Wide_Characters;> The library package Wide_Wide_Characters has the following declaration: @xcode<@b Ada.Wide_Wide_Characters @b @b Pure(Wide_Wide_Characters); @b Ada.Wide_Wide_Characters;> @i<@s8> If an implementation chooses to provide implementation-defined operations on Wide_Character or Wide_String (such as case mapping, classification, collating and sorting, etc.) it should do so by providing child units of Wide_Characters. Similarly if it chooses to provide implementation-defined operations on Wide_Wide_Character or Wide_Wide_String it should do so by providing child units of Wide_Wide_Characters. !corrigendum A.3.2(02) @drepl @xcode<@b Ada.Characters.Handling @b @b Preelaborate(Handling);> @dby @xcode<@b Ada.Characters.Conversions; @b Ada.Characters.Handling @b @b Pure(Handling);> !corrigendum A.3.2(13) @drepl @xcode< --@ft<@i>> @dby @xcode< --@ft<@i> --@ft<@i>> !corrigendum A.3.2(14) @ddel @xcode< @b Is_Character (Item : @b Wide_Character) @b Boolean; @b Is_String (Item : @b Wide_String) @b Boolean;> !corrigendum A.3.2(15) @ddel @xcode< @b To_Character (Item : @b Wide_Character; Substitute : @b Character := ' ') @b Character;> !corrigendum A.3.2(16) @ddel @xcode< @b To_String (Item : @b Wide_String; Substitute : @b Character := ' ') @b String;> !corrigendum A.3.2(17) @ddel @xcode< @b To_Wide_Character (Item : @b Character) @b Wide_Character;> !corrigendum A.3.2(18) @ddel @xcode< @b To_Wide_String (Item : @b String) @b Wide_String;> !corrigendum A.3.2(42) @ddel The following set of functions test Wide_Character values for membership in Character, or convert between corresponding characters of Wide_Character and Character. !comment A.3.2(43-47) are deleted by the original AI-285, so no change is needed here. !corrigendum A.3.2(48) @ddel @xhang<@xterm !corrigendum A.3.4(01) @dinsc The library package Characters.Conversions has the following declaration: @xcode<@b Ada.Characters.Conversions @b @b Pure(Conversions); @b Is_Character (Item : @b Wide_Character) @b Boolean; @b Is_String (Item : @b Wide_String) @b Boolean; @b Is_Character (Item : @b Wide_Wide_Character) @b Boolean; @b Is_String (Item : @b Wide_Wide_String) @b Boolean; @b Is_Wide_Character (Item : @b Wide_Wide_Character) @b Boolean; @b Is_Wide_String (Item : @b Wide_Wide_String) @b Boolean; @b To_Wide_Character (Item : @b Character) @b Wide_Character; @b To_Wide_String (Item : @b String) @b Wide_String; @b To_Wide_Wide_Character (Item : @b Character) @b Wide_Wide_Character; @b To_Wide_Wide_String (Item : @b String) @b Wide_Wide_String; @b To_Wide_Wide_Character (Item : @b Wide_Character) @b Wide_Wide_Character; @b To_Wide_Wide_String (Item : @b Wide_String) @b Wide_Wide_String; @b To_Character (Item : @b Wide_Character; Substitute : @b Character := ' ') @b Character; @b To_String (Item : @b Wide_String; Substitute : @b Character := ' ') @b String; @b To_Character (Item : @b Wide_Wide_Character; Substitute : @b Character := ' ') @b Character; @b To_String (Item : @b Wide_Wide_String; Substitute : @b Character := ' ') @b String; @b To_Wide_Character (Item : @b Wide_Wide_Character; Substitute : @b Wide_Character := ' ') @b Wide_Character; @b To_Wide_String (Item : @b Wide_Wide_String; Substitute : @b Wide_Character := ' ') @b Wide_String; @b Ada.Characters.Conversions;> The functions in package Characters.Conversions test Wide_Wide_Character or Wide_Character values for membership in Wide_Character or Character, or convert between corresponding characters of Wide_Wide_Character, Wide_Character, and Character. @xcode<@b Is_Character (Item : @b Wide_Character) @b Boolean;> @xindent @xcode<@b Is_Character (Item : @b Wide_Wide_Character) @b Boolean;> @xindent @xcode<@b Is_Wide_Character (Item : @b Wide_Wide_Character) @b Boolean;> @xindent @xcode<@b Is_String (Item : @b Wide_String) @b Boolean; @b Is_String (Item : @b Wide_Wide_String) @b Boolean;> @xindent @xcode<@b Is_Wide_String (Item : @b Wide_Wide_String) @b Boolean;> @xindent @xcode<@b To_Character (Item : @b Wide_Character; Substitute : @b Character := ' ') @b Character; @b To_Character (Item : @b Wide_Wide_Character; Substitute : @b Character := ' ') @b Character;> @xindent @xcode<@b To_Wide_Character (Item : @b Character) @b Wide_Character;> @xindent @xcode<@b To_Wide_Character (Item : @b Wide_Wide_Character; Substitute : @b Wide_Character := ' ') @b Wide_Character;> @xindent @xcode<@b To_Wide_Wide_Character (Item : @b Character) @b Wide_Wide_Character;> @xindent @xcode<@b To_Wide_Wide_Character (Item : @b Wide_Character) @b Wide_Wide_Character;> @xindent @xcode<@b To_String (Item : @b Wide_String; Substitute : @b Character := ' ') @b String; @b To_String (Item : @b Wide_Wide_String; Substitute : @b Character := ' ') @b String;> @xindent @xcode<@b To_Wide_String (Item : @b String) @b Wide_String;> @xindent @xcode<@b To_Wide_String (Item : @b Wide_Wide_String; Substitute : @b Wide_Character := ' ') @b Wide_String;> @xindent @xcode<@b To_Wide_Wide_String (Item : @b String) @b Wide_Wide_String; @b To_Wide_Wide_String (Item : @b Wide_String) @b Wide_Wide_String;> @xindent !corrigendum A.4.7(46) @drepl @xcode< Character_Set : @b Wide_Maps.Wide_Character_Set; --@ft<@i>> @dby @xcode< Character_Set : @b Wide_Maps.Wide_Character_Set; --@ft<@i> --@ft<@i>> Each Wide_Character_Set constant in the package Strings.Wide_Maps.Wide_Constants contains no values outside the Character portion of Wide_Character. Similarly, each Wide_Character_Mapping constant in this package is the identity mapping when applied to any element outside the Character portion of Wide_Character. !corrigendum A.4.7(48) @ddel @xindent<13 Each Wide_Character_Set constant in the package Strings.Wide_Maps.Wide_Constants contains no values outside the Character portion of Wide_Character. Similarly, each Wide_Character_Mapping constant in this package is the identity mapping when applied to any element outside the Character portion of Wide_Character.> !corrigendum A.4.8(01) @dinsc Facilities for handling strings of Wide_Wide_Character elements are found in the packages Strings.Wide_Wide_Maps, Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, Strings.Wide_Wide_Unbounded, and Strings.Wide_Wide_Maps.Wide_Wide_Constants, and in the functions Strings.Wide_Wide_Hash, Strings.Wide_Wide_Fixed.Wide_Wide_Hash, Strings.Wide_Wide_Bounded.Wide_Wide_Hash, and Strings.Wide_Wide_Unbounded.Wide_Wide_Hash. They provide the same string-handling operations as the corresponding packages for strings of Character elements. @i<@s8> The library package Strings.Wide_Wide_Maps has the following declaration. @xcode<@b Ada.Strings.Wide_Wide_Maps @b @b Preelaborate(Wide_Wide_Maps); --@ft<@i< Representation for a set of Wide_Wide_Character values:>> @b Wide_Wide_Character_Set @b; @b Preelaborable_Initialization(Wide_Wide_Character_Set); Null_Set : @b Wide_Wide_Character_Set; @b Wide_Wide_Character_Range @b @b Low : Wide_Wide_Character; High : Wide_Wide_Character; @b; --@ft<@i< Represents Wide_Wide_Character range Low..High>> @b Wide_Wide_Character_Ranges @b (Positive @b <@>) @b Wide_Wide_Character_Range; @b To_Set (Ranges : @b Wide_Wide_Character_Ranges) @b Wide_Wide_Character_Set; @b To_Set (Span : @b Wide_Wide_Character_Range) @b Wide_Wide_Character_Set; @b To_Ranges (Set : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Ranges; @b "=" (Left, Right : @b Wide_Wide_Character_Set) @b Boolean; @b "@b" (Right : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Set; @b "@b" (Left, Right : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Set; @b "@b" (Left, Right : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Set; @b "@b" (Left, Right : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Set; @b "-" (Left, Right : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Set; @b Is_In (Element : @b Wide_Wide_Character; Set : @b Wide_Wide_Character_Set) @b Boolean; @b Is_Subset (Elements : @b Wide_Wide_Character_Set; Set : @b Wide_Wide_Character_Set) @b Boolean; @b "<=" (Left : @b Wide_Wide_Character_Set; Right : @b Wide_Wide_Character_Set) @b Boolean @b Is_Subset; --@ft<@i< Alternative representation for a set of Wide_Wide_Character values:>> @b Wide_Wide_Character_Sequence @b Wide_Wide_String; @b To_Set (Sequence : @b Wide_Wide_Character_Sequence) @b Wide_Wide_Character_Set; @b To_Set (Singleton : @b Wide_Wide_Character) @b Wide_Wide_Character_Set; @b To_Sequence (Set : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Sequence; --@ft<@i< Representation for a Wide_Wide_Character to Wide_Wide_Character>> --@ft<@i< mapping:>> @b Wide_Wide_Character_Mapping @b; @b Preelaborable_Initialization(Wide_Wide_Character_Mapping); @b Value (Map : @b Wide_Wide_Character_Mapping; Element : @b Wide_Wide_Character) @b Wide_Wide_Character; Identity : @b Wide_Wide_Character_Mapping; @b To_Mapping (From, To : @b Wide_Wide_Character_Sequence) @b Wide_Wide_Character_Mapping; @b To_Domain (Map : @b Wide_Wide_Character_Mapping) @b Wide_Wide_Character_Sequence; @b To_Range (Map : @b Wide_Wide_Character_Mapping) @b Wide_Wide_Character_Sequence; @b Wide_Wide_Character_Mapping_Function @b @b (From : @b Wide_Wide_Character) @b Wide_Wide_Character; @b ... --@ft<@i< not specified by the language>> @b Ada.Strings.Wide_Wide_Maps;> The context clause for each of the packages Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, and Strings.Wide_Wide_Unbounded identifies Strings.Wide_Wide_Maps instead of Strings.Maps. For each of the packages Strings.Fixed, Strings.Bounded, Strings.Unbounded, and Strings.Maps.Constants, and for functions String.Hash, Strings.Fixed.Hash, Strings.Bounded.Hash, and Strings.Unbounded.Hash, the corresponding wide wide string package or function has the same contents except that @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet The following additional declarations are present in Strings.Wide_Wide_Maps.Wide_Wide_Constants: @xcode< Character_Set : @b Wide_Wide_Maps.Wide_Wide_Character_Set; --@ft<@i< Contains each Wide_Wide_Character value WWC such that>> --@ft<@i< Characters.Conversions.Is_Character(WWC) is True>> Wide_Character_Set : @b Wide_Wide_Maps.Wide_Wide_Character_Set; --@ft<@i< Contains each Wide_Wide_Character value WWC such that>> --@ft<@i< Characters.Conversions.Is_Wide_Character(WWC) is True>>> Each Wide_Wide_Character_Set constant in the package Strings.Wide_Wide_Maps.Wide_Wide_Constants contains no values outside the Character portion of Wide_Wide_Character. Similarly, each Wide_Wide_Character_Mapping constant in this package is the identity mapping when applied to any element outside the Character portion of Wide_Wide_Character. @fa Pure is replaced by @fa Preelaborate in Strings.Wide_Wide_Maps.Wide_Wide_Constants. @xindent<@s9> !corrigendum J.14(00) @dinsc The following declarations exist in the declaration of package Ada.Characters.Handling: @xcode< @b Is_Character (Item : @b Wide_Character) @b Boolean @b Conversions.Is_Character; @b Is_String (Item : @b Wide_String) @b Boolean @b Conversions.Is_String; @b To_Character (Item : @b Wide_Character; Substitute : @b Character := ' ') @b Character @b Conversions.To_Character; @b To_String (Item : @b Wide_String; Substitute : @b Character := ' ') @b String @b Conversions.To_String; @b To_Wide_Character (Item : @b Character) @b Wide_Character @b Conversions.To_Wide_Character; @b To_Wide_String (Item : @b String) @b Wide_String @b Conversions.To_Wide_String;> !ACATS test ACATS C-Test(s) should be created to test these rules. !appendix From: Robert Dewar Sent: Sunday, January 23, 2005 12:24 PM The grammar as it is now allows identifiers to contain the sequence underline other-format-character underline Now the normal way of handling other-format-character internally would be to simply ignore it, but then we end up internally with an identifier with two underscores in it. That's a real pain, since we assume that two underscores is reserved. I really think this is undesirable for other reasons, since other-format-character often corresponds to something not visible, such as formatting information, and you end up with an identifier that has two visible underscores in a row. I would recommend we modify the grammar in AI195 to eliminate this unpleasant possibility. Note that the current rules also allow an identifier to effectively end with an underscore (by ending with the sequence underscore other-format-character) but not to begin with an underscore. I know the standard is written in terms of how to compare identifiers, but in fact I think may compilers will work as GNAT does, by canonicalizing identifiers as they are scanned. P.S. for those who don't want to go rummaging in the AI, other-format characters include stuff like: invisible separator soft hyphen zero width non-joiner zero width no-break space tag space language tag **************************************************************** From: Pascal Leroy Sent: Monday, January 24, 2005 4:56 AM (I suppose you mean AI 285, not AI 195, btw.) I fully agree, I didn't realize that the syntax as written did allow for two (visibly) consecutive underscores, or for trailing underscores. It was never my intent to allow that. The other_format characters need to be integrated in the BNF for identifier so that they don't interrupt an identifier, but being typically invisible they should not be usable to circumvent the presentation rules that we know and love. It might be possible to fix the BNF to account for this rule, but I think it would be clearer to add a syntax rule in English like: "After eliminating the characters in category other_format, an identifier shall not contain two consecutive characters in category punctuation_connector, or a end with a character in that category." **************************************************************** From: Robert Dewar Sent: Monday, January 25, 2005 7:29 AM > (I suppose you mean AI 285, not AI 195, btw.) Yes indeed, sorry about that misprint > It might be possible to fix the BNF to account for this rule, but I think > it would be clearer to add a syntax rule in English like: > > "After eliminating the characters in category other_format, an identifier > shall not contain two consecutive characters in category > punctuation_connector, or a end with a character in that category." I agree, it is a bit tricky (not impossible, but messy) to do this in BNF. Note that once this sentence is added, you can simplify the grammar to: identifier ::= identifier_start {identifier_start | identifier_extend} identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= mark_non_spacing | mark_spacing_combining | number_decimal_digit | punctuation_connector | other_format which is exactly the grammar that annex 7 of UAX #15 recommends. So that's nice. We adopt exactly the Unicode recommendation, with an extra sentence giving the restriction that we decide to add. **************************************************************** From: Pascal Leroy Sent: Monday, January 24, 2005 7:45 AM Excellent point! So this is even better, we don't look like we add our own inventions on top of Unicode. **************************************************************** From: Dan Eilers Sent: Monday, January 24, 2005 1:14 PM AI 285 says: > The characters in the category other_format are effectively ignored in most > lexical elements, with the exception that they are illegal in string_literals > and character_literals. Is the intent that other-format characters will be allowed in other lexical elements, such as reserved words, numeric literals, and compound delimiters? It seems a little strange to be gumming up the works of lexical analyzers by allowing certain formating characters inside certain lexemes. **************************************************************** From: Robert Dewar Sent: Monday, January 24, 2005 1:22 PM > Is the intent that other-format characters will be allowed in other lexical > elements, such as reserved words, numeric literals, and compound delimiters? I don't know the intent, but the rules are clear, other-format characters are allowed ONLY in identifiers. > > It seems a little strange to be gumming up the works of lexical analyzers > by allowing certain formating characters inside certain lexemes. Well it's not that hard to implement, but it does seem odd. **************************************************************** From: Dan Eilers Sent: Monday, January 24, 2005 2:00 PM Then the erroneous wording in AI 285 needs to be changed from: The characters in the category other_format are effectively ignored in most lexical elements, with the exception that they are illegal in string_literals and character_literals. to: The characters in the category other_format are illegal in all lexical elements except identifiers (and maybe comments). > > It seems a little strange to be gumming up the works of lexical analyzers > > by allowing certain formating characters inside certain lexemes. > > Well it's not that hard to implement, but it does seem odd. Our lexical analyzer processes reserved words and identifiers together, so it will have more of an impact. Are there any users chomping at the bit to put formatting characters in their identifiers? If not, it seems unwise to slow down lexical processing for everybody else on the off chance that someone might eventually find some use for this. **************************************************************** From: Robert Dewar Sent: Tuesday, January 25, 2005 2:23 PM > Then the erroneous wording in AI 285 needs to be changed from: > > The characters in the category other_format are effectively ignored in most > lexical elements, with the exception that they are illegal in string_literals > and character_literals. Well I see this wording, but I don't see anything else in the AI to back up this position. I really don't want to have to allow these junk characters in the middle of := > The characters in the category other_format are illegal in all lexical > elements except identifiers (and maybe comments) Let's add reserved words, and for sure absolutely anything shbould be allowed in a comment except an end of line, which terminates the comment. > Our lexical analyzer processes reserved words and identifiers together, > so it will have more of an impact. Ah ha, you are right, my current implementation is ignoring these other format characters in reserved words, and it would be a huge pain to fix this. It also is bizarre to allow these in identifiers and not in reserved words. I also think it would be horrible (really an extension of my double underline point) to allow identifiers that are visually identical to reserved words, differing only in invisible format characters. I can even see programmers misusing this when they really really want to use a reserved word as an identifier, UGH! > Are there any users chomping at the bit to put formatting characters > in their identifiers? If not, it seems unwise to slow down lexical > processing for everybody else on the off chance that someone might > eventually find some use for this. Well there is merit in following the recommendations of the standard. **************************************************************** From: Randy Brukardt Sent: Monday, January 24, 2005 2:22 PM > Are there any users chomping at the bit to put formatting characters > in their identifiers? If not, it seems unwise to slow down lexical > processing for everybody else on the off chance that someone might > eventually find some use for this. My understanding of the intent is that we are trying to match (within reason) the Unicode recommendations for program identifiers. The presumption is that the Unicode people know more about character sets than we ever will, so it is best to follow their lead. I personally don't think that there are many who are "chomping at the bit" to use any Unicode characters in identifiers. So it would be impossible to predict what the users that do want such identifiers will want. Simply allowing Unicode characters in identifiers is going to slow down the lexing (as with Dan's implementation, Janus/Ada processes ids and reserved words together), and I doubt that the particulars of the allowed characters will make much difference. **************************************************************** From: Robert Dewar Sent: Monday, January 24, 2005 3:12 PM I don't think allowing unicode characters in identifiers slows things down significantly in practice, at least not with the approach we take (which you can look at if you like :-) I do think making a distinction between identifiers and keywords is a huge menace and we should fix this. This has nothing to do with the standard really, it is perfectly appropriate to apply the unicode recommendations for identifiers to keywords, regarding the notion in unicode of identifier to be more general and subsume keywords. It's really so much easier to simply ignore the format effectors as you store the identifier in the first place. **************************************************************** From: Randy Brukardt Sent: Monday, January 24, 2005 4:08 PM > I don't think allowing unicode characters in identifiers slows things > down significantly in practice, at least not with the approach we take > (which you can look at if you like :-) I think that the table lookups (which can't be pure array indexing like it is now) will slow things down somewhat. But I don't think it will be a major issue. > I do think making a distinction between identifiers and keywords is a > huge menace and we should fix this. This has nothing to do with the > standard really, it is perfectly appropriate to apply the unicode > recommendations for identifiers to keywords, regarding the notion > in unicode of identifier to be more general and subsume keywords. I certainly agree with you here, and didn't mean to give the impression that I didn't. > It's really so much easier to simply ignore the format effectors as > you store the identifier in the first place. Yes, if they don't change equality, they certainly would be ignored. In which case, they need to be allowed in reserved words. Especially because we don't want the abuse someone suggested about sticking invisible characters into a keyword to make it an identifier. I'm not quite sure what wording change is needed, however. **************************************************************** From: Robert Dewar Sent: Monday, January 24, 2005 9:35 PM Randy Brukardt wrote: > I think that the table lookups (which can't be pure array indexing like it > is now) will slow things down somewhat. But I don't think it will be a major > issue. But you only look up in the tables if you have a wide character, so you can't say that this slows things down. It is true that having to check for letters etc is slower than the approach GNAT took before which was to allow any wide characters in identifiers, but neither approach slows things down for identifiers not containing wide characters. ... >>It's really so much easier to simply ignore the format effectors as >>you store the identifier in the first place. > > Yes, if they don't change equality, they certainly would be ignored. In > which case, they need to be allowed in reserved words. Especially because we > don't want the abuse someone suggested about sticking invisible characters > into a keyword to make it an identifier. I'm not quite sure what wording > change is needed, however. The main thing is to agree that there will be no ACATS tests that test for this anomoly :-) **************************************************************** From: Randy Brukardt Sent: Monday, January 24, 2005 10:25 PM You have figure out the character class of every character somehow; that certainly includes the Latin-1 characters. You can test for wide characters first, then do different lookups for wide and non-wide, or you can think of the lookup as a single operation, in which case the lookup is complicated by handling wide characters. The code is probably essentially the same either way, and its clearly slower for handling wide characters outside of literals. **************************************************************** From: Robert Dewar Sent: Monday, January 24, 2005 11:37 PM > You have figure out the character class of every character somehow; that > certainly includes the Latin-1 characters. You can test for wide characters > first, then do different lookups for wide and non-wide, or you can think of > the lookup as a single operation, in which case the lookup is complicated by > handling wide characters. That's a really bad idea to do it as a single lookup > The code is probably essentially the same either way, and its clearly slower > for handling wide characters outside of literals. In GNAT, there really is zero penalty here. The way things are done is to have an identifier table of valid identifer characters. If wide characters are allowed, then depending on the encoding, all upper half characters are not in this table, triggering an exit from identifier scanning, at which point you do the appropriate tests for wide characters. But in practice in the real world, 99.9% of all identifiers are in the lower half of ASCII anyway. Programs with characters in the upper half are either UTF-8 encoded or they are not. If they are not, then the only triggering characters are ESC (e.g. for Shift-JIS) or '[' for brackets, but those are not valid identifier characters in any case. If such programs are UTF-8 coded, then you have to decode anyway. I really don't see *any* penalty *at all* here. I invite you to look at the GNAT code, and explain why there is any penalty whatever. **************************************************************** From: Randy Brukardt Sent: Tuesday, January 25, 2005 12:15 AM Interesting. Sounds to me like you traded off maintainable code for performance (certainly a justifiable trade-off in some cases, and this quite possibly is one of them). Given the number of places that would have to do special processing (not just identifiers, but white space, literals, and comments), it seems like a nightmare. In fact, AI-285 *is* a nightmare, any way you slice it. It affects *everything*, and little of it in simple ways. Sigh. **************************************************************** From: Robert Dewar Sent: Tuesday, January 25, 2005 6:54 AM Randy Brukardt wrote: > Interesting. Sounds to me like you traded off maintainable code for > performance (certainly a justifiable trade-off in some cases, and this quite > possibly is one of them). Well I find the code very nicely maintainable, because it is mostly table driven (see csets in the GNAT sources for the tables for all the many character sets for identifiers supported by GNAT: Lexical analyzers are such a trivial part of a compiler anyway, and yes, it is very much worthwhile worrying about speed here :-) > Given the number of places that would have to do > special processing (not just identifiers, but white space, literals, and > comments), it seems like a nightmare. In fact, AI-285 *is* a nightmare, any > way you slice it. It affects *everything*, and little of it in simple ways. > Sigh. Nightmare seems a bit strong, I have taken about six days to do everything except the pretty mechanical Wide_Wide packages. True, that is longer than most other 2005 features :-) I do agree that in terms of effort-to-value ratio, this one is vanishingly small. Particularly because of so many edge cases. Quite a chunk of the time was taken in dealing with the very annoying case of Width/Wide_Width/Wide_Wide_Width applied to dynamic subtypes of String/Wide_String/Wide_Wide_String (nine cases now instead of only four before). **************************************************************** From: Robert Dewar Sent: Tuesday, January 25, 2005 12:03 AM OK, here is another puzzle. What is the status of Soft Hyphen? The database entry is 00AD;SOFT HYPHEN;Cf;0;ON;;;;;N;;;;; Meaning that this is Other, Format, and therefore not a graphic character. So is this character *really* excluded from string literals? That seems like quite a surprising incompatibility, and indeed causes failure of an ACATS test: with Ada.Characters.Latin_1; package C250002_["C1"] is type Enum is ( Item, 'A', '["AD"]', AE_["C6"]["E6"]_ae, '["2D"]', '["FF"]' ); task type C2_["C2"] is entry C2_["C3"]; end C2_["C2"]; end C250002_["C1"]; So ???? **************************************************************** From: Pascal Leroy Sent: Tuesday, January 25, 2005 3:34 AM My reply to Dan and Robert's comments. Dan: > Is the intent that other-format characters will be allowed in > other lexical elements, such as reserved words, numeric > literals, and compound delimiters? Hmm, that's interesting. I originally wrote the AI by stating that other_format are first stripped out of the program text, and that the rest of section 2 applied to the "clean" text. However, this caused an incompatibility which was thought to be unpleasant: such a character appearing in a string literal would be OK in Ada 95, but would silently disappear in Ada 2005. So we decided to make other_formal illegal in character and string literals, to detect the problem at compile time. But obviously I did a lousy job of fixing the rest of the AI. Robert: > I don't know the intent, but the rules are clear, > other-format characters are allowed ONLY in identifiers. And comments (I think that's cover edby the current wording). This could be changed (to allow them pretty much everywhere) but I am reluctant to do this at this point, as it would seem to hair up both the RM and the implementations. Plus, this AI is approved by SC22 (yes, I mean SC22, not WG9) so we should only fix serious problems, not do cosmetic changes. Finally, the main use of other_format is to control the presentation of text, so it's only in comments and identifiers that they make any sense. Robert: > Well I see this wording, but I don't see anything else in the > AI to back up this position. I really don't want to have to > allow these junk characters in the middle of := Agreed. This is non-normative text in the AI anyway, so we can safely ignore it ;-) Note that the Unicode recommendations are not entirely clear as to what should be done with other_format in programming languages (except for identifiers, where we strictly follow the recommendations). Robert: > It also is bizarre to allow these > in identifiers and not in reserved words. I also think it > would be horrible (really an extension of my double underline > point) to allow identifiers that are visually identical to > reserved words, differing only in invisible format > characters. I can even see programmers misusing this when > they really really want to use a reserved word as an identifier, UGH! For sure we don't want this. I'll add wording to make sure that this is taken care of. Robert: > So this is annoying, we have introduced a non upwards > compatibility by forcing compilers to go to a lot of effort > to forbid a curious set of wide characters in string > literals, just to cause people trouble who run into this > silly rule in existing programs. Wake up, this AI does create a myriad little incompatibilities, just because many characters that were classified as graphic characters are not anymore. This was well understood by the ARG during the discussion of the AI, and we tried to make it so that incompatibilities would be detected at compile-time most of the time. Robert: > That's just drawn from the database, but I am a little bit > unsure of this table. What is the category of codes which > simply have no definition at all in the table. I assume they > are not excluded, since otherwise why are FFFE and FFFF > specially treated. Right, this was done on purpose: graphic_character is defined by exclusion (any character not in category...) so that the characters which are not classified yet (by Unicode) are considered graphic characters, and can therefore be used in string literals. On the other hand, they are neither letters nor digits, so they cannot be used in identifiers. This is essentially what Ada 95 did with respect to the wide characters. Robert: > What is the status of Soft Hyphen? > The database entry is > > 00AD;SOFT HYPHEN;Cf;0;ON;;;;;N;;;;; > > Meaning that this is Other, Format, and therefore not a > graphic character. So is this character *really* excluded > from string literals? That seems like quite a surprising > incompatibility... Yes, it is excluded from string and character literals, and yes, this is an incompatibility, but as I explained above, one which is detected at compile time. Again, we were aware of this, and sorry, this is not the worse incompatibility that comes with Ada 2005. **************************************************************** From: Robert Dewar Sent: Tuesday, January 25, 2005 7:13 AM Yes, but we introduce incompatibilities if there is a good reason to do so. Here there is no good reason at all that I can see except appeal to some notion of uniformity that has nothing to do with Ada. I don't find this worth implementing, so this is a place where GNAT will quite deliberately not conform. If anyone ever wants to validate (not clear that this will ever happen), we can put this under control of some silly pedantic switch. In fact there is an argument for putting the entire graphic-in-string stuff under such a switch. Perhaps it would be nice to have a collected list of all incompatibilities especially since some of them are considered bad by Pascal (not sure what he is referring to, since in general we have had little trouble in that area). The A5 case is the first time I have seen tests fail so far. **************************************************************** From: Robert Dewar Sent: Tuesday, January 25, 2005 7:18 AM oops, I mean AD case :-) **************************************************************** From: Robert Dewar Sent: Tuesday, January 25, 2005 7:31 AM Let me be a little clearer on why I think it is such a bad mistake to exclude AD from string literals. In practice, Ada programs are run in many different environments where the graphics associated with the upper half have nothing to do with international standards or with anything in the Ada standard (e.g. various windows character sets). Ada programs work just fine in such environments, provided that the compiler and rules do not get in the way. Yes, 10646 thinks AD is a soft hyphen, but in my XP environment, it comes out as an upside down exclamation point. It really seems annoying to tell an Ada programmer working on XP that you can freely deal with all the upper half graphics in the range A0-FE, except for AD. I don't mind so much the changes in wide character stuff, since no one uses this anyway (we know because of bug reports that show that no one ran into things which were pretty fundamental for many years). But Ada programs working with 8-bit chars in various character sets are all over the place. Now a counter argument in the XP case is that 80-9F are also graphic characters in windows. True, but this is a (somewhat annoying) restriction that Ada programmers are used to and have worked around, but the AD exclusion is new and annoying, and simply makes no sense whatever in many environments. **************************************************************** From: Robert Dewar Sent: Tuesday, January 25, 2005 7:38 AM Actually, now that I think of it, once you get into the switch business, you might as well allow 80-9F in string and character literals when not in pedantic mode. This would make working under windows much easier. **************************************************************** From: Pascal Leroy Sent: Tuesday, January 25, 2005 7:39 AM > Yes, but we introduce incompatibilities if there is a good > reason to do so. Here there is no good reason at all that I > can see except... The reason is that we have a *mandate* from SC22 to support Unicode, er, I mean, ISO/IEC 10646:2003. There is no way that we could get the Amendment past SC22 without this. > Perhaps it would be nice to have a collected list of all > incompatibilities especially since some of them are > considered bad by Pascal (not sure what he is referring to, > since in general we have had little trouble in that area). > The A5 case is the first time I have seen tests fail so far. The AARM that is being prepared for Ada 2005 has a fairly extensive list of incompatibilities, much like the AARM for Ada 95. It turns out that this one is not mentioned, and I agree that it should, but I still think it's a rather unimportant incompatibility. At any rate, an implementation is free to have a nonstandard mode where it deviates from the syntax rules spelled out in the RM. **************************************************************** From: Robert Dewar Sent: Tuesday, January 25, 2005 8:17 AM It's fine to support unicode. How can a mandate to support unicode be interpreted as a mandate to NOT support something. We support all valid unicode stuff, where does it say in the unicode standard that we are required to reject AD in string literals, I don't see it. **************************************************************** From: Randy Brukardt Sent: Tuesday, January 25, 2005 12:32 PM > Perhaps it would be nice to have a collected list of all > incompatibilities especially since some of them are considered > bad by Pascal (not sure what he is referring to, since in general > we have had little trouble in that area). The A5 case is the first > time I have seen tests fail so far. I've tried to identify all incompatibilities and inconsistencies in the AARM, in the same way that it was done for Ada 95. It would be possible to extract those (mechanically or otherwise) to provide a short document. There is a similar list of "extensions" to Ada 95. **************************************************************** From: Randy Brukardt Sent: Tuesday, January 25, 2005 12:45 PM > It really seems annoying to tell an Ada programmer working > on XP that you can freely deal with all the upper half > graphics in the range A0-FE, except for AD. Well, I sympathize, but can't get too excited about this. But I'm more concerned about the basic idea: why can't soft hyphen be used in string literals? It's commonly used (the AARM is full of them) and it generally has a display representation (else you couldn't edit it). A program that generated AARM text could have many soft hyphens in strings and character literals; it seems like another case of Nanny Ada: "Wide_[AD]Wide_[AD]Text_IO" is a whole lot clearer than "Wide_" & Character'Val(16#AD#) & "Wide_" Character'Val(16#AD#) & "Text_IO" **************************************************************** From: Robert Dewar Sent: Tuesday, January 25, 2005 1:44 PM > "Wide_[AD]Wide_[AD]Text_IO" > > is a whole lot clearer than > > "Wide_" & Character'Val(16#AD#) & "Wide_" Character'Val(16#AD#) & > "Text_IO" That comparison is not quite fair, it should be > "Wide_["AD"]Wide_["AD"]Text_IO" > > compared to: > > "Wide_" & SH & "Wide_" SH & "Text_IO" And indeed I rather prefer the second one here I must say. But I think it should be the programmer's choice. The argument in favor of not allowing soft hyphens is presumably that if you type in Unicode (whatever that means), and display in unicode, then the soft hyphens will be invisible in a program listing, which seems a worry. Of course if you use brackets notation, all is well (another reason for not being so down on brackets notation :-) **************************************************************** From: Randy Brukardt Sent: Tuesday, January 25, 2005 2:51 PM Well, that assumes that you use a use clause for package Latin_1; I would not do that personally because it isn't a package I use frequently enough. And the first one is more complex than it would be in practice: you'd just insert the proper character in your editor. I wrote it with the brackets notation only so I could send it in e-mail. > But I think it should be the programmer's choice. > > The argument in favor of not allowing soft hyphens is > presumably that if you type in Unicode (whatever that > means), and display in unicode, then the soft hyphens > will be invisible in a program listing, which seems > a worry. Of course if you use brackets notation, all > is well (another reason for not being so down on > brackets notation :-) I know, but that seems to me to be saying that we want the language to work with the crappiest possible tools. Any Unicode programming editor that didn't provide a way to show "hidden" characters would be pretty worthless. (Word does that - which is hardly a programming editor - and I generally leave the hidden characters displayed there.) These rules made sense in 1980, when everything was in 7-bit ASCII (if you were lucky); it's 25 years later now, and everything is done graphically with rich fonts. Prohibiting tabs and soft hyphens simply because some ancient editors can't display them is silly. **************************************************************** From: Robert Dewar Sent: Tuesday, January 25, 2005 3:56 PM I couldn't agree more! For my taste, I would allow AD in string literals and characters, and also allow all wide characters in literals. It seems to me that 10646 is about supporting use of wide characters, not making it hard by introducing unnecessary restrictions. If there are good and sufficient reasons to avoid some character in some particular environments, then please let's allow the programmer to make this decision and not try to second guess requiremens. **************************************************************** From: Robert A. Duff Sent: Tuesday, January 25, 2005 3:40 PM > Prohibiting tabs and soft hyphens simply because some > ancient editors can't display them is silly. Well, I think tabs are an abomination that should never have been invented. I don't even think they should be allowed in *whitespace* in Ada programs, much less string literals! But... But I tend to agree with Randy's sentiment, here. If some character has a reasonable use, as suggested by Randy at least for soft hyphens, it seems like a shame to forbid it in the language definition. If you don't like tabs or soft hyphens or whatever, make it a project-wide coding convention, and enforce it using a script as part of your CM system or something like that. **************************************************************** From: Robert Dewar Sent: Tuesday, January 25, 2005 5:30 PM I agree with the Robert Duff who wrote the third, permissive, paragraph, and I disagree with the Robert Duff who wrote the second, non-permissive para :-) **************************************************************** From: Randy Brukardt Sent: Tuesday, January 25, 2005 5:47 PM > Well, I think tabs are an abomination that should never have been > invented. I don't even think they should be allowed in *whitespace* > in Ada programs, much less string literals! But... The people who designed HTML agreed with you; they left out tabs. Now, try to get free-form text to line up properly (Ada syntax productions come to mind). Luckily for us, we make printed copies of the AARM from PDF derived from RTF, which has no such restrictions. Programming certainly needs tabs (especially when it is using a readable, non-fixed width font). Now the implementation of tabs often sucks... Anyway, back to your regularly scheduled language feature debate, already in progress. :-) **************************************************************** From: Jean-Pierre Rosen Sent: Wednesday, January 26, 2005 2:32 AM > If there are good and sufficient reasons to avoid some character in > some particular environments, then please let's allow the programmer > to make this decision and not try to second guess requiremens. > Which seems to beg for a configuration pragma Restriction (Basic_Character_Set_Only) ... **************************************************************** From: Pascal Leroy Sent: Wednesday, January 26, 2005 4:30 AM In reply to Bob, Randy and Robert: First, a political comment. Irrespective of the technical issues, the topic of character set is a very delicate one politically. Jim and I were very concerned that it could cause a catfight at the SC22 level that would derail the Amendment process, with potentially devastating consequences. So at the Palma WG9 meeting we decided to send to SC22 a summary of AI 285 to get a stamp of approval well in advance of the vote on the entire Amendment, so as to avoid ending up in a quagmire. Thanks to the support of Kiyoshi and Steve M., our proposal was approved by SC22, so we are on pretty firm ground now. However, by following this process, we have pretty much committed to not making substantial changes to AI 285. Otherwise there will be someone is SC22 who will think that we are cheating, and the gates of hell will open. I wished Bob, Randy, Robert and others had read the AI at that time, because we should really have had this discussion before sending the AI to SC22. Note that I am *not* trying to use this argument to quench the discussion, but I think we should only be doing minimal changes to the AI at this point. It's OK to say "we discovered an unintended consequence of the write-up, we are fixing it"; it's another kettle of fish to say "well, we really changed our mind on this entire business". Now for the technical discussion. I do not feel very strongly about other_format in literals, but my intuition is to be conservative because we have so little experience with programming in Unicode. On the other hand, I am noticing that Java and C# allow anything (including control characters) in string literals. On the third hand I'm not sure these languages are models that we want to follow. Specific comments below. Robert: > That comparison is not quite fair, it should be > > "Wide_["AD"]Wide_["AD"]Text_IO" > > compared to: > > "Wide_" & SH & "Wide_" SH & "Text_IO" > > And indeed I rather prefer the second one here I must say. > > The argument in favor of not allowing soft hyphens is > presumably that if you type in Unicode (whatever that means), > and display in unicode, then the soft hyphens will be > invisible in a program listing, which seems a worry. Well surely the second idiom is preferable because the bracket notation is not defined by the language, and is therefore not a portable syntax ;-) The first piece of program text doesn't even parse with a compiler (like ours) that doesn't support the bracket notation, regardless of the soft hyphen issue. And yes, the rationale for not allowing other_format characters in literals is that they may or may not print (or be displayed), and they may alter the presentation of the literals in surprising ways (details below). Randy: > I know, but that seems to me to be saying that we want the > language to work with the crappiest possible tools. Any > Unicode programming editor that didn't provide a way to show > "hidden" characters would be pretty worthless. (Word does > that - which is hardly a programming editor - and I generally > leave the hidden characters displayed there.) > > These rules made sense in 1980, when everything was in 7-bit > ASCII (if you were lucky); it's 25 years later now, and > everything is done graphically with rich fonts. Prohibiting > tabs and soft hyphens simply because some ancient editors > can't display them is silly. You cannot have it both ways, Randy. A few days ago you agreed with Robert that we should disallow other_format characters that would be used to write an identifier that looks like a reserved word (e.g., pro-tected where the hyphen is a soft hyphen). Now you say that anything goes because surely people will be able to display all these funny characters. But then you should not be bothered by pro-tected. None of us has much experience with Unicode editors (whatever that means), so I think we should err on the side of caution. It is not a simple matter of displaying the characters, by the way. These characters typically have some semantics for displaying text. For instance a soft-hyphen indicates a place where the editor can fold the line. I for one who be annoyed if my editor folded the line in the middle of a string literal. Another case that gave me headaches is this: among the other_format characters are some that change the display direction. Even if you have an editor that displays these characters, it's unclear how you would interpret what you see on the glass. Compare for instance: "a" & Right2Left & "bc" & Left2Right & "d" -- unambiguous, good old Ada "a[Right2Left]bc[Left2Right]d" -- Is it bc or cb in the string? Depends on whether the formatting characters are interpreted by the editor. I have read enough of the Unicode standard to realize that this is an extremely complicated area, and again, I'd rather be conservative. (Just out of curiosity, has anyone other than Robert looked at Unicode?) Bob: > But I tend to agree with Randy's sentiment, here. If some > character has a reasonable use, as suggested by Randy at > least for soft hyphens, it seems like a shame to forbid it in > the language definition. If you don't like tabs or soft > hyphens or whatever, make it a project-wide coding > convention, and enforce it using a script as part of your CM > system or something like that. This is the Bob I totally disagree with. Let's make the language very lax, and people will implement coding conventions on top of it if they like. This is really a flexibility vs. safety tradeoff. However, I for one have a hard time evaluating the safety impact, i.e., the confusion that may stem from having literals that are not wysiwyg. Perhaps I am overstating the problem. But I don't think you can just ignore it. **************************************************************** From: Robert A. Duff Sent: Wednesday, January 26, 2005 1:10 PM In reply to Pascal: > In reply to Bob, Randy and Robert: > > First, a political comment. Irrespective of the technical issues, the > topic of character set is a very delicate one politically. I'll defer to you on the political issues. If you say we should leave it as is to avoid rocking the boat, that's fine with me. You ought to think about whether people motivated by political concerns can discern the difference between minor and major changes. For all I know, no change at all is politically acceptable. > I wished Bob, Randy, Robert and others had read the AI at that time, > because we should really have had this discussion before sending the AI to > SC22. Well, I did look at it at the time, but my eyes glazed over, and my review was therefore useless. ;-) I'm only picking up on Robert's comments, and Robert apparently didn't notice these issues until he started to implement the thing. >...(Just > out of curiosity, has anyone other than Robert looked at Unicode?) A little bit, but again, my eyes glaze over. > Bob: > > But I tend to agree with Randy's sentiment, here. If some > > character has a reasonable use, as suggested by Randy at > > least for soft hyphens, it seems like a shame to forbid it in > > the language definition. If you don't like tabs or soft > > hyphens or whatever, make it a project-wide coding > > convention, and enforce it using a script as part of your CM > > system or something like that. > > This is the Bob I totally disagree with. Let's make the language very > lax, and people will implement coding conventions on top of it if they > like. I didn't state that as a general principle -- it just seems reasonable in this case. But I'd be happy either way (I mainly stick to 7-bit ASCII for my own code!). **************************************************************** From: Randy Brukardt Sent: Wednesday, January 26, 2005 3:40 PM > I wished Bob, Randy, Robert and others had read the AI at that time, > because we should really have had this discussion before sending the AI to > SC22. Note that I am *not* trying to use this argument to quench the > discussion, but I think we should only be doing minimal changes to the AI > at this point. It's OK to say "we discovered an unintended consequence of > the write-up, we are fixing it"; it's another kettle of fish to say "well, > we really changed our mind on this entire business". I doubt very much that SC22 cares what characters are allowed vs. not allowed in string literals. That was never the point of the political discussion. In any case, the AI did not point out the incompatibility, and I for one didn't think of it (as you point out, its not documented in the AARM). We should always look at incompatibilities carefully to see if they are justified. This one, IMHO, does not seem to be. ... > Randy: > > I know, but that seems to me to be saying that we want the > > language to work with the crappiest possible tools. Any > > Unicode programming editor that didn't provide a way to show > > "hidden" characters would be pretty worthless. (Word does > > that - which is hardly a programming editor - and I generally > > leave the hidden characters displayed there.) > > > > These rules made sense in 1980, when everything was in 7-bit > > ASCII (if you were lucky); it's 25 years later now, and > > everything is done graphically with rich fonts. Prohibiting > > tabs and soft hyphens simply because some ancient editors > > can't display them is silly. > > You cannot have it both ways, Randy. A few days ago you agreed with > Robert that we should disallow other_format characters that would be used > to write an identifier that looks like a reserved word (e.g., pro-tected > where the hyphen is a soft hyphen). I don't remember ever agreeing with any such thing. My understanding of Robert's position is that we have to allow other-format in reserved words, because otherwise the processing of identifiers (of which reserved words are a subset) is substantially complicated. That's what I agreed with; you seem to be taking the opposite approach. > Now you say that anything goes > because surely people will be able to display all these funny characters. > But then you should not be bothered by pro-tected. I'm not, and never was. > None of us has much experience with Unicode editors (whatever that means), > so I think we should err on the side of caution. It is not a simple matter > of displaying the characters, by the way. These characters typically have > some semantics for displaying text. For instance a soft-hyphen indicates > a place where the editor can fold the line. I for one who be annoyed if > my editor folded the line in the middle of a string literal. A Unicode programming editor clearly will not do such things in hidden-text mode. A general purpose word processor is not appropriate for editing programs now, and I very much doubt that will change. One of my objections to AI-388 is that is essentially forces implementations to create Unicode programming editors (since an editor that can't display the predefined packages is junk), and that is certainly a non-trivial task -- and one for which off-the-shelf support is quite scanty. Probably most would simply support a subset of Unicode (graphic characters and a few well-used other-formats, and little else). ... > I have read enough of the Unicode standard to realize that this is an > extremely complicated area, and again, I'd rather be conservative. (Just > out of curiosity, has anyone other than Robert looked at Unicode?) I don't doubt it. I'd be happy to simply except Soft-Hyphen and Tab from the existing rules, and stop there. I'm less concerned about other ones. Another option would be to allow them, and give an implementation-permission to not allow problematic other-formats, private-use, etc. in strings. We generally don't talk about source code formats, and it is there that there is a problem, not in the language definition. Worrying about what editors might or might not do is purely a function of the source formats and the tools, and that is way out of bounds for the language. Having restrictions on strings because some editor somewhere might not work right is pretty silly. Even if we went all the way and specified that canonical Ada source be given in UTF-8 , we could hardly force editors and tools to be able to handle all possible source. So the problem is the assumption that every tool can handle every possible Ada program. Once you realize that is impractical in a Unicode world, there really remains no important reason for restrictions *in the language*. The restrictions (if they need to exist) are *in the tools*. The standard needs to recognize that there must be the possibility of character restrictions in the tools; once it does so, there is no need to restrict character or string literals or comments *in the standard*. (Identifiers are a whole different kettle of fish, of course, and that was the area that was so contentious in SC22.) **************************************************************** From: Pascal Leroy Sent: Thursday, January 27, 2005 3:02 AM > We should always look at incompatibilities carefully to see > if they are justified. This one, IMHO, does not seem to be. Fine. At any rate I'll write an AI to discuss this in Paris. > I don't remember ever agreeing with any such thing. My > understanding of Robert's position is that we have to allow > other-format in reserved words, because otherwise the > processing of identifiers (of which reserved words are a > subset) is substantially complicated. That's what I agreed > with; you seem to be taking the opposite approach. I think we are actually in agreement. My view is that when the tokenizer reads a "word" it first removes all the other_format characters, and then checks to see if it's a reserved word (after conversion to upper case) or an identifier (in which case it checks for double underscores and the like). So pro-tected would be a reserved word, not an identifier. I feel quite strongly about this, btw, because this seems to align with the Unicode recommendations, and to match what other languages are doing. > I don't doubt it. I'd be happy to simply except Soft-Hyphen > and Tab from the existing rules, and stop there. I'm less > concerned about other ones. I would be very much opposed to picking characters one by one. We should take entire categories of characters, as defined by Unicode, if only because we don't have a good understanding of the purpose of all these weird 16- and 32-bit characters. We trust that the Unicode folks got the categorization right. I could live with other_format in literals. I'd rather not include tabs (which would effectively mean allow everything except format_effectors) because we have all been bitten by tabs at one point or another in our life. Remember, other_format is the only category that creates an incompatibility. **************************************************************** From: Robert Dewar Sent: Thursday, January 27, 2005 3:58 PM Robert A Duff wrote: > Well, I did look at it at the time, but my eyes glazed over, and my > review was therefore useless. ;-) I'm only picking up on Robert's > comments, and Robert apparently didn't notice these issues until he > started to implement the thing. Excactly, you don't really dig into the details till you look at them. I think it is essential that we fix things to have the same basic syntax for keywords and identifiers. I think it would be nice to me more permissing in string and character literals given that a) this makes the language far more convenient to use b) other languages faced with the same decision have gone in that direction c) it avoids a completely unnecessary non-upwards compatibility. If I had my way, I would also do in the case equivalence (I have fully implemented it, so this is not to ease the implementation burden in GNAT :-). The reason is that proper case equivalence processing is unavoidably locale dependent. It is simply too peculiar that a Turkish Ada programmer finds that dotted i is folded incorrectly to capital I without a dot. This means that of the identifiers Capital I with dot Lower case I with dot Lower case I without dot Captial I with out dot the first is distinct from the last three, which is just weird. I am sure that there are other locale dependent weirdnesses like this. But we can live with this if necessary. Good Ada style is never to take advantage of the case equivalence in any case. P.S. Pascal's tables in the AI for letters and numbers are significantly wrong. If people are interested, I am happy to post GNAT's understanding of the unicode categorizations :-) **************************************************************** From: Robert Dewar Sent: Thursday, January 27, 2005 4:15 PM >>You cannot have it both ways, Randy. A few days ago you agreed with >>Robert that we should disallow other_format characters that would be used >>to write an identifier that looks like a reserved word (e.g., pro-tected >>where the hyphen is a soft hyphen). I never said anything of the kind. It is fine to allow other other format stuff in identifiers following the recommendations. What is not OK is allowing underline (some soft junk) underline I think the prohibition against two underlines (more generally against two punctuation,connector class charactes should apply AFTER soft junk is stripped, not before. **************************************************************** From: Robert Dewar Sent: Thursday, January 27, 2005 4:20 PM > I think we are actually in agreement. My view is that when the tokenizer > reads a "word" it first removes all the other_format characters, and then > checks to see if it's a reserved word (after conversion to upper case) or > an identifier (in which case it checks for double underscores and the > like). So pro-tected would be a reserved word, not an identifier. Yes, that's right, I agree also > > I feel quite strongly about this, btw, because this seems to align with > the Unicode recommendations, and to match what other languages are doing. Not sure about the strongly, since I think this is non-critical, but certainly I agree. >>I don't doubt it. I'd be happy to simply except Soft-Hyphen >>and Tab from the existing rules, and stop there. I'm less >>concerned about other ones. That;s because you have not looked through them, and/or you simply don't know what they are. Here is the list: UTF_32_Other_Format : constant UTF_32_Ranges := ( (16#000AD#, 16#000AD#), -- SOFT HYPHEN .. SOFT HYPHEN (16#00600#, 16#00603#), -- ARABIC NUMBER SIGN .. ARABIC SIGN SAFHA (16#006DD#, 16#006DD#), -- ARABIC END OF AYAH .. ARABIC END OF AYAH (16#0070F#, 16#0070F#), -- SYRIAC ABBREVIATION MARK .. SYRIAC ABBREVIATION MARK (16#017B4#, 16#017B5#), -- KHMER VOWEL INHERENT AQ .. KHMER VOWEL INHERENT AA (16#0200C#, 16#0200F#), -- ZERO WIDTH NON-JOINER .. RIGHT-TO-LEFT MARK (16#0202A#, 16#0202E#), -- LEFT-TO-RIGHT EMBEDDING .. RIGHT-TO-LEFT OVERRIDE (16#02060#, 16#02063#), -- WORD JOINER .. INVISIBLE SEPARATOR (16#0206A#, 16#0206F#), -- INHIBIT SYMMETRIC SWAPPING .. NOMINAL DIGIT SHAPES (16#0FEFF#, 16#0FEFF#), -- ZERO WIDTH NO-BREAK SPACE .. ZERO WIDTH NO-BREAK SPACE (16#0FFF9#, 16#0FFFB#), -- INTERLINEAR ANNOTATION ANCHOR .. INTERLINEAR ANNOTATION TERMINATOR (16#1D173#, 16#1D17A#), -- MUSICAL SYMBOL BEGIN BEAM .. MUSICAL SYMBOL END PHRASE (16#E0001#, 16#E0001#), -- LANGUAGE TAG .. LANGUAGE TAG (16#E0020#, 16#E007F#)); -- TAG SPACE .. CANCEL TAG Why on earth would you suggest treating zero width no-break space or invisible separator in a manner different from Soft Hyphen (not sure what tab has to do with this, it is not an other format character). > I would be very much opposed to picking characters one by one. We should > take entire categories of characters, as defined by Unicode, if only > because we don't have a good understanding of the purpose of all these > weird 16- and 32-bit characters. We trust that the Unicode folks got the > categorization right. Exactly, I agree > I could live with other_format in literals. I'd rather not include tabs > (which would effectively mean allow everything except format_effectors) > because we have all been bitten by tabs at one point or another in our > life. Remember, other_format is the only category that creates an > incompatibility. RIght, I would allow everything except format effectors, that's an old Ada tradition, and I think it is fine to extend it to separator,line and separator,para. Think for a moment of an environment where Line Separator is used routinely to end lines, you really do NOT want to allow this in string literals as well. **************************************************************** From: Pascal Leroy Sent: Friday, January 28, 2005 3:29 AM > But we can live with this if necessary. Good Ada style is > never to take advantage of the case equivalence in any case. I noticed that issue while reviewing the AARM, and added an IA to say roughly "if you target a culture where some locale-dependent case folding rule is more appropriate, by all means, provide a nonstandard mode that supports this case folding". Being an IA, it doesn't impose anything on implementations, but it draws their attention to the fact that other case folding rules exist, and it also tell users that it's a legitimate request that they may want to bring up with their vendor (provided that they are willing to put a laaarge number of Turkish Lira on the table ;-) As far as I know only Turkish and Lithuanian have that problem (well, ancient Greek too, but I don't expect many people to program in ancient Greek). > P.S. Pascal's tables in the AI for letters and numbers are > significantly wrong. If people are interested, I am happy to > post GNAT's understanding of the unicode categorizations :-) I know, I know, but please don't post your tables. If I read any piece of code from GNAT my brain gets polluted by public domain software, and IBM won't let me work on Apex anymore (it does sound silly, but it's absolutely true). **************************************************************** From: Robert Dewar Sent: Friday, January 28, 2005 4:21 AM > know, but please don't post your tables. If I read any piece of code from > GNAT my brain gets polluted by public domain software, and IBM won't let > me work on Apex anymore (it does sound silly, but it's absolutely true). That is indeed complete nonsense, given this is under the GMGPL, and you can perfectly well incorporate it into Apex with no legal problems whatever. I really think it would be better to have an agreed on set of tables that we all use. Uniformtiy of implementations is more important than the junk rules we have agreed to implement! **************************************************************** From: Robert Dewar Sent: Friday, January 28, 2005 4:23 AM By the way, NO PART of gnat is public domain, please do not spread this seriously incorrect misconception. All our software is copyrighted and we object to people trying to dilute our copyright by claiming that our software is in the public domain. Thanks for being careful on this point in future. For sure, you have to check licensing conditions. In this case, the GNAT code is under the GMGPL precisely so that other proprietary implementations can share the tables if they wish. **************************************************************** From: Pascal Leroy Sent: Friday, January 28, 2005 5:12 AM I realize that, and "public domain" was just a convenient shorthand, although I should have been more precise/careful. Sorry about that. > For sure, you have to check licensing conditions. In this > case, the GNAT code is under the GMGPL precisely so that > other proprietary implementations can share the tables if they wish. I understand that, but I was not actually kidding: we had to remove libraries covered by the LGPL from some of our products before the lawyers objected to it. Anyway, this is getting totally off-topic... **************************************************************** From: Robert Dewar Sent: Saturday, January 29, 2005 10:03 AM > These tables were posted when Rational was still an independent company, > and Rational had a very lax policy with respect to IP, so it would have > been fine at the time to incorporate them in a product. They didn't have > a copyright notice, and that was on purpose. a) copyright notices have zero legal sigficance (true since 1986) b) tables like this are not copyrightable anyway > > This is an irrelevant question. The only document you need to implement > Ada-related products is the AARM for Ada 2005, and certainly the copyright > situation for the AARM is very clear. That surely is NOT true here, the necessary information for implementing the unicode stuff is not incorporated in the AARM (should it be? I would think the answer should be yes, we should include the tables in the AARM). Is the copyright situation for the AARM clear. It is owned by everyone who has contributed text. Have they all signed waivers or assignments? **************************************************************** From: Robert Dewar Sent: Saturday, January 29, 2005 1:43 PM As it turns out, the tables are too long to post to the list anyway. If anyone wants a copy of the tables, send me some email and I will attach them to the reply. Note that the tables themselves are not copyrightable elements, since they are derived from external requirements (so there is no creative element, see Altai vs Computer Associates for a well worth while analysis of what is and what is not copyrightable when it comes to sofware). **************************************************************** From: Robert Dewar Sent: Saturday, January 29, 2005 1:48 PM Pascal Leroy wrote: > I understand that, but I was not actually kidding: we had to remove > libraries covered by the LGPL from some of our products before the lawyers > objected to it. > > Anyway, this is getting totally off-topic... I know, but it is worth while making one more point. Indeed your lawyers understandably disliked the LGPL, since it places pretty severe restrictions on the distribution, namely that things must be distributed in object form in a manner that makes it possible to relink with modified versions of the LGPL'ed units, and if you change any of these units, you have to distribute the sources. The GMGPL (a similar license is used for the gnu c and g++ libraries) is quite different and much more permissive, it places no restriction whatever on the distribution of executable programs containing GMGPL'ed code. The whole point of this license is to remove any impediments to use of the related components in proprietary or classified code. **************************************************************** From: Dan Eilers Sent: Saturday, January 29, 2005 2:31 PM > IBM of course has rather stricter policies, so I suppose that today I > wouldn't post the tables anymore, at least not without a copyright notice > and/or approval by our legal department. This of course doesn't answer the question. Does IBM or any other company have any known IP claims on the contents of any AI's? I would not be the least bit surprised if IBM has current or pending patents or other IP claims on Unicode, and if so, the licensing rights should be clearly stated before we go adding Unicode tables and algorithms to our compilers, and especially before we go adding a gratuitous Unicode character to the spec of a predefined Ada package, which, as noted in the Atlanta ARG minutes, is intended to force Unicode algorithms not only on Ada compilers but also on Ada toolsets. I would also not be the least bit surprised if some of the AI's, especially those that standardize features from existing implementations, such as pragma assert, pragma no_return, etc., are subject to current or pending IP claims. > > If so, please let us know which AI's these are... > > This is an irrelevant question. I think it is highly relevant. There have been some very high visibility recent cases of companies implementing a standard getting sued because the standard is claimed to infringe the IP of one of the members of the standardization committee, witness Rambus, for example. There is certainly precedent for patenting Ada compiler implementation techniques, witness the DEC Generics patent. There is certainly precedent for patenting compiler implementation techniques in general, witness the patents on implementing multiple inheritance, and SSA, for example. I am not aware of any effort within the ARG or WG9 to head off such difficulties. Quite to the contrary, ARG members are claiming to be afraid to read public domain comments from other vendors, for fear of brain pollution. > The only document you need to implement > Ada-related products is the AARM for Ada 2005, ... Certainly you are not claiming that all the implementation information from the AI's is in the AARM? Certainly you are not claiming that someone implementing the AARM will be automatically free of IP claims? > and certainly the copyright > situation for the AARM is very clear. What is the copyright situation for the AARM??? Is the AARM being produced by the ARG? By Ada Europe? By AXE Consultants? By Mitre? By Springer Verlag? My understanding is that the ARG is limiting itself to producing only an Amendment document, and not a new RM or annotated RM. **************************************************************** From: Randy Brukardt Sent: Saturday, January 29, 2005 10:21 PM > What is the copyright situation for the AARM??? Is the AARM being > produced by the ARG? By Ada Europe? By AXE Consultants? By Mitre? > By Springer Verlag? It's quite clear from the copyright pages of the AARM; they were updated first, before any copies went out. Same with the Amendment document. I can't say if anyone could claim rights to some part of these documents. My guess is, sure someone *could* - even to parts of Ada 83. But all of those contributions were made to public mailing lists, where there is an implied intent to allow others to use the contributions unencumbered. (It probably would be prudent to post such a statement for both ARG and Ada-Comment. But that would be insufficient to avoid any *possibility* of problems. I doubt that there is *anything* that could be done that would avoid any *possibility* of problems, short of the Shakespeare solution.) Ada could use some publicity anyway. :-) **************************************************************** From: Robert Dewar Sent: Saturday, January 29, 2005 10:28 PM > This of course doesn't answer the question. Does IBM or any other > company have any known IP claims on the contents of any AI's? No one knows the status of comments made to public newsgroups. Obviously individual comments are copyrighted by their author unless copyright is specifically disclaimed. On the other hand, everyone assumes that quoting such comments (as I am doing here) is fair use. I think a jury would agree, but no one knows what a jury might or might not do, and you can't live your life worrying about jury opinions that have not been stated and might never be. > I would not be the least bit surprised if IBM has current or pending > patents or other IP claims on Unicode, and if so, the licensing rights > should be clearly stated before we go adding Unicode tables and algorithms > to our compilers, and especially before we go adding a gratuitous > Unicode character to the spec of a predefined Ada package, which, > as noted in the Atlanta ARG minutes, is intended to force Unicode > algorithms not only on Ada compilers but also on Ada toolsets. You can perfectly well look up the patents if you want. To say you think there may be such without bothering to find out definitely seems mere FUD to me. > I would also not be the least bit surprised if some of the AI's, > especially those that standardize features from existing implementations, > such as pragma assert, pragma no_return, etc., are subject to current > or pending IP claims. Well pragma assert and pragma no_return come from GNAT, where there are definitely no such claims. I don't see the point in executing some document to that effect in this case, but it could be done. In any case, these are language features, and the issue is being compatible with other implementations of other languagres. Dan, reread the Lotus-Borland case, and it should put your mind at rest that no such IP's can be asserted. > I think it is highly relevant. There have been some very high visibility > recent cases of companies implementing a standard getting sued because > the standard is claimed to infringe the IP of one of the members of the > standardization committee, witness Rambus, for example. True, but I don't see Unicode as a concern here, and would recommend (not as an attorney, but also not as a complete neophyte, I have been certified as an expert in copyright matters several times in federal court) that we not waste time on this issue. I do not regard it as a legitimate argument against pi, though I agree that this is a gratuitous feature. I don't think it's so bad in practice, since I think Ada programmers have better taste than the ARG and will not use it :-) I would recommend that tools not bother with it. Either you make a tool completely wide wide aware (which is likely to be a huge job), or you just say that your tool only handles non wide stuff, and the (mis)use of pi is just a special case of programs that are not handled. > There is certainly precedent for patenting Ada compiler implementation > techniques, witness the DEC Generics patent. There is certainly > precedent for patenting compiler implementation techniques in general, > witness the patents on implementing multiple inheritance, and SSA, for > example. Yes, but that's really not an issue here. I think this is just FUD that can safely be ignored. Certainly making a judgment as the CEO of AdaCore (I have to make judgments like this all the time > I am not aware of any effort within the ARG or WG9 to head off such > difficulties. Quite to the contrary, ARG members are claiming to be > afraid to read public domain comments from other vendors, for fear of > brain pollution. Well the brain pollution problem is Pascal's own, and I don't think it should legitimately stop anyone from posting anything in this mailing list. Indeed companies who are REALLY serious about the brain pollution issue would not dream of letting an employee for whom this is a concern serve on a standards committee (I know of several such cases). I still don't think the ARG need spend any time worrying about this issue. If there is any action required it is at a higher level, and I would encourage Dan to make waves at those higher levels to see if he can attract any concern (I would expect not). > Certainly you are not claiming that someone implementing the AARM > will be automatically free of IP claims? No one is ever free of such claims. Anyone can sue anyone, anytime, about anything. If you write any code, you can be sued for patent violations concerning patents you do not know about, and could not know about. I would worry far more about that problem than the issue here with Ada language design and features. **************************************************************** From: Robert Dewar Sent: Saturday, January 29, 2005 11:15 PM The unicode patent policy is stated in http://www.unicode.org/policies/patent_policy.html I have read this document in its entirety, and my judgment is that this completely lays to rest any concerns that Dan Eilers has expressed with regard to patents and unicode, but I encourage others to read this document if they have any concerns. **************************************************************** From: Dan Eilers Sent: Saturday, January 29, 2005 9:34 PM > You can perfectly well look up the patents if you want. To > say you think there may be such without bothering to find > out definitely seems mere FUD to me. I was hoping I wouldn't have to, since rummaging through patent databases is nontrivial. The implications of a patent are not at all clear from the title, and the legalese used in the descriptions often are designed more for obfuscation than for clarity, in an attempt to be as widely applicable as possible. And of course a patent database doesn't help at all with patents that haven't yet been published, which are the ones that seem to have caused the most trouble in standardization efforts. Well, rummaging through the patent database at http://www.freepatentsonline.com turns up 603 patents mentioning the word Unicode, some of which are assigned to IBM. I don't have any idea how serious this is, without taking a lot of time to examine each of these. ...[Editor's note: Part of this message is filed in AI-388, along with appropriate responses to it.] > The unicode patent policy is stated in > > http://www.unicode.org/policies/patent_policy.html This requires non-discriminatory licensing terms. It doesn't say what those terms are, or which companies have IP claims related to Unicode. **************************************************************** From: Robert Dewar Sent: Sunday, January 30, 2005 2:31 AM Can I suggest that any further messages about patents or copyright issues with unicode identify themselves with something like the above, so that the thread is identified, and easily ignored for those who do not want to follow it further (which certainly includes me!) **************************************************************** From: Pascal Leroy Sent: Sunday, January 30, 2005 5:59 AM > This of course doesn't answer the question. Does IBM or any > other company have any known IP claims on the contents of any AI's? It surely doesn't answer the question, because I am neither competent nor allowed to make legal statements for IBM. If you are really worried about this, I suggest that your attorneys contact the IBM General Legal and Intellectual Property Law Department, in Armonk, NY, USA. **************************************************************** From: Robert Dewar Sent: Sunday, January 30, 2005 12:01 PM chuckle :-) If he follows your advice, the IBM lawyers can probably keep Dan occupied for months :-) I must say, Dan is an expert at undermining his own arguments by moving into unreasonable arguments. P.S. Sorry I will miss ARG meeting again, but Gary will be there. **************************************************************** From: Robert Dewar Sent: Sunday, January 30, 2005 12:11 PM Oops this was really not supposed to go to the list (the ARG list is one of these nasty lists where a specific reply goes to the list and not to the sender). Never mind, I think Dan knows my point of view at this stage! I actually think that Dan's original argument against AI-388 had quite a bit of merit. This really was a gratuitous change, not requested by any real Ada user as far as I know. Never mind, there are worse problems. Actually these days, IBM is not the main target to worry about if you are worried about patent issues in contexts like this (I assume everyone is aware of IBM's recent welcome actions in this area). **************************************************************** From: Dan Eilers Sent: Monday, January 31, 2005 4:21 AM Anytime someone proposes including in an international standard some feature that their company has patented or has patents pending, I think it is reasonable to expect that such a proposal (and if the proposal is accepted, the new standard document itself) be accompanied by a statement clarifying the intended licensing conditions for such patents. **************************************************************** From: Robert Dewar Sent: Sunday, January 30, 2005 8:34 AM [The comment below is posted in AI-388.] > There is obviously no pressing issue in the RM or in other AIs for > a better looking Pi symbol. I suppose it could be argued that SC22's > resolution 02-24 for "appropriate support of Unicode" constitutes > as a pressing need for certain Unicode characters _other_ than Pi. > For example, Unicode recommends that paired quotation marks use > U+201C and U+201D instead of the ASCII double-quote mark that > Ada currently uses. I am not suggesting following this recommendation, > but if there is a pressing need to follow Unicode recommendations as > closely as possible, then Pi is the wrong symbol to start with. Actually I think that supporting U+201C and U+201D for strings would make perfectly good sense in the context of AI285, and I would recommend adding it, since it is a negligible extra burden in the context of AI-285. **************************************************************** From: Dan Eilers Sent: Monday, January 31, 2005 3:21 AM Just to be clear. Some of the reasons that I am not suggesting following this recommendation are: 1) Pascal has made it very clear that the deadline for submitting new proposals was more than a year ago, and for a new AI to stand a chance, it must address a pressing issue in the existing RM or in other AIs, and this does not come anywhere close to meeting that standard. 2) Even if the deadline had not passed, the language is insufficiently broke. No users are clamoring for this, and there are a host of features and libraries available in competing languages that users are clamoring for. 3) Consistency of language design is important. If Unicode directional quotes were allowed, there would soon be a call for Unicode "/=", and if that were allowed, there would soon be a call for Unicode "<=" and ">=". And if those were allowed, there would soon be a call for multiply and divide symbols, etc. The ARG would end up spending all its time debating each possible Unicode symbol, with no guiding principle. 4) Code portability is more important than the appearance of mathematical symbols. Users in a Unicode environment would be tempted to use these symbols, making their code non-portable to non-Unicode environments. **************************************************************** From: Pascal Leroy Sent: Monday, January 31, 2005 5:01 AM > Actually I think that supporting U+201C and U+201D for > strings would make perfectly good sense in the context of > AI285, and I would recommend adding it, since it is a > negligible extra burden in the context of AI-285. Don't go there! There was a clear opposition from some countries to allowing extended character sets in contexts other than identifiers (and string literals and comments, but that was already present in Ada 95). I originally had proposed to allow extended characters in numeric literals, and this was rejected. Furthermore, as explained by the Unicode standard, use of the quotation marks is extremely culture-dependent. The above characters would be appropriate for English, but not for French, and U+201C which is opening in English is closing in German. In fact I cannot find a recommendation to allow U+201C and U+201D to delimit string literals in the Unicode standard, the only thing I can find is an indication that these are the preferred characters to use in English text. Finally, ECMA C#, which is a good model from this perspective, because it's one of the few ISO languages that support Unicode, only allows U+0022 (the good ol' double quote) to delimit string literals. **************************************************************** From: Robert A. Duff Sent: Monday, January 31, 2005 7:14 AM > Don't go there! I wouldn't dare go there. ;-) But I'm curious: >... There was a clear opposition from some countries to > allowing extended character sets in contexts other than identifiers (and > string literals and comments, but that was already present in Ada 95). What was the basis of this opposition? Were they generally opposed, or was it particular characters or lexical elements, such as num lits mentioned below? >...I > originally had proposed to allow extended characters in numeric literals, > and this was rejected. ... > The above characters would be > appropriate for English, but not for French, ... We use French-style quotes for labels, right? **************************************************************** From: Pascal Leroy Sent: Monday, January 31, 2005 8:07 AM > What was the basis of this opposition? Were they generally > opposed, or was it particular characters or lexical elements, > such as num lits mentioned below? Numeric literals were discussed in great detail, and there was a feeling that the "usual" digits (I first wrote "arabic" but of course although Europe got these digits from the Arabs, they use distinct forms these days) are understood by programmers, and that allowing other digits in literals would lead to confusion. For instance, Unicode has special characters for the Roman digits, which look furiously like X, V, I, etc. Consider the assignment: A := VIII; Is the right-hand side an identifier, or a literal? You need to have a good eyesight to distinguish the letter V from the Roman digit V. Furthermore, is the value of the literal 8 or 5111? Hmm. In general, you want to refrain from introducing locale-dependent issues in the lexical elements (as far as I can tell, the only locale-dependent issue we have at this point is the fact that the case folding rule is not ideal for Turkish and Lithuanian). Identifiers are much less problematic, and they do bring more value, as it makes sense for programmers, say, in Russia, to be able to use their native alphabet and not rely on some phony translitteration. > We use French-style quotes for labels, right? We use Jean's approximation of French quotes, built from ASCII characters. Of course, Unicode now has specific characters for the French quotes. **************************************************************** From: Robert Dewar Sent: Monday, January 31, 2005 11:20 AM >...I >originally had proposed to allow extended characters in numeric literals, >and this was rejected. I would hope so! I can't believe anyone would even suggest this :-) **************************************************************** From: Robert Dewar Sent: Monday, January 31, 2005 11:26 AM > Numeric literals were discussed in great detail, and there was a feeling > that the "usual" digits (I first wrote "arabic" but of course although > Europe got these digits from the Arabs, they use distinct forms these > days) Arabic digits are still used in West Arabic writing (e.g. Morroco). It is the East Arabic writing that uses the other set of digits. > are understood by programmers, and that allowing other digits in > literals would lead to confusion. For instance, Unicode has special > characters for the Roman digits, which look furiously like X, V, I, etc. > Consider the assignment: > > A := VIII; > > Is the right-hand side an identifier, or a literal? You need to have a > good eyesight to distinguish the letter V from the Roman digit V. > Furthermore, is the value of the literal 8 or 5111? Hmm. The fact that this kind of nonsense was even discussed gives a lot more insight into how things went a bit far. Are you seriously saying that someone suggested that we support Roman numerals for literals. Please tell me this is a joke :-) > > In general, you want to refrain from introducing locale-dependent issues > in the lexical elements (as far as I can tell, the only locale-dependent > issue we have at this point is the fact that the case folding rule is not > ideal for Turkish and Lithuanian). not ideal = plain wrong > Identifiers are much less problematic, and they do bring more value, as it > makes sense for programmers, say, in Russia, to be able to use their > native alphabet and not rely on some phony translitteration. Actually most programmers prefer to maintain the portability of sticking to lower half ASCII. GNAT has allowed wide characters in identifiers for ever, but we have never seen any serious use of it, just a bit of academic experimentation. I doubt Ada 2005 will be different. **************************************************************** From: Robert A. Duff Sent: Monday, January 31, 2005 5:40 PM Hmm. Maybe by 2015 Unicode will be ubiquitous. Remember that Ada 83 was designed with the fact in mind that lowercase letters might not be available. Colonel Whittaker used to send us (Intermetrics) bug reports in ALL CAPS. In my own code, I'm pretty happy with 7-bit ascii for identifiers, but I can understand why other folks might want more. But I've always thought things like <= are rather uncivilized compared to the notation I learned in grade school. There's a program called a2ps that knows how to print various programming languages; for Ada, it boldfaces keywords, and prints a proper less-than-or-equal symbol, and so forth. I was once hired by a law firm on an intellectual property rights case. The code was not Ada. Most of the comments were in Kanji. Most of the identifiers were in English. I wonder what this Roman Numeral might mean: X := 16#VIII#; -- ;-) **************************************************************** From: Robert Dewar Sent: Saturday, January 29, 2005 1:53 PM In Robert's view, these tables which GNAT currently uses are not copyrightable elements, since they are derived in an obvious and mechanical way from the external requirements. All the tables except the case conversion one have been recomputed using a program I wrote from what I understand to be the latest unicode categorization tables. Still no guarantees :-) The case conversion table is from the AI, and may thus be out of date. Frankly I have a hard time getting up the energy to check it since as I wrote in a previous note, it seems to me to be a piece of junk. Still if some customer finds a bug, or the ACATS tests do, then we will fix it at that point :-) I have not included the (fairly trivial) code that goes with these tables that provides convenient interfaces for use by a client, including an efficient binary search through the tables. If anyone is interested let me know, and I will email you privately the entire GMGPL'ed unit from the GNAT sources. (comments welcome) ----------------------------------------------- -- Tables for UTF_32 Categorization Routines -- ----------------------------------------------- -- Note these tables are derived from those given in AI-285. For details -- see //www.ada-auth.org/cgi-bin/cvsweb.cgi/AIs/AI-00285.TXT?rev=1.22. type UTF_32_Range is record Lo : Char_Code; Hi : Char_Code; end record; type UTF_32_Ranges is array (Positive range <>) of UTF_32_Range; -- The following array includes all characters considered digits, i.e. -- all characters from the Unicode table with categories: -- Number, Decimal Digit (Nd) UTF_32_Digits : constant UTF_32_Ranges := ( (16#00030#, 16#00039#), -- DIGIT ZERO .. DIGIT NINE (16#00660#, 16#00669#), -- ARABIC-INDIC DIGIT ZERO .. ARABIC-INDIC DIGIT NINE (16#006F0#, 16#006F9#), -- EXTENDED ARABIC-INDIC DIGIT ZERO .. EXTENDED ARABIC-INDIC DIGIT NINE (16#00966#, 16#0096F#), -- DEVANAGARI DIGIT ZERO .. DEVANAGARI DIGIT NINE (16#009E6#, 16#009EF#), -- BENGALI DIGIT ZERO .. BENGALI DIGIT NINE (16#00A66#, 16#00A6F#), -- GURMUKHI DIGIT ZERO .. GURMUKHI DIGIT NINE (16#00AE6#, 16#00AEF#), -- GUJARATI DIGIT ZERO .. GUJARATI DIGIT NINE (16#00B66#, 16#00B6F#), -- ORIYA DIGIT ZERO .. ORIYA DIGIT NINE (16#00BE7#, 16#00BEF#), -- TAMIL DIGIT ONE .. TAMIL DIGIT NINE (16#00C66#, 16#00C6F#), -- TELUGU DIGIT ZERO .. TELUGU DIGIT NINE (16#00CE6#, 16#00CEF#), -- KANNADA DIGIT ZERO .. KANNADA DIGIT NINE (16#00D66#, 16#00D6F#), -- MALAYALAM DIGIT ZERO .. MALAYALAM DIGIT NINE (16#00E50#, 16#00E59#), -- THAI DIGIT ZERO .. THAI DIGIT NINE (16#00ED0#, 16#00ED9#), -- LAO DIGIT ZERO .. LAO DIGIT NINE (16#00F20#, 16#00F29#), -- TIBETAN DIGIT ZERO .. TIBETAN DIGIT NINE (16#01040#, 16#01049#), -- MYANMAR DIGIT ZERO .. MYANMAR DIGIT NINE (16#01369#, 16#01371#), -- ETHIOPIC DIGIT ONE .. ETHIOPIC DIGIT NINE (16#017E0#, 16#017E9#), -- KHMER DIGIT ZERO .. KHMER DIGIT NINE (16#01810#, 16#01819#), -- MONGOLIAN DIGIT ZERO .. MONGOLIAN DIGIT NINE (16#01946#, 16#0194F#), -- LIMBU DIGIT ZERO .. LIMBU DIGIT NINE (16#0FF10#, 16#0FF19#), -- FULLWIDTH DIGIT ZERO .. FULLWIDTH DIGIT NINE (16#104A0#, 16#104A9#), -- OSMANYA DIGIT ZERO .. OSMANYA DIGIT NINE (16#1D7CE#, 16#1D7FF#)); -- MATHEMATICAL BOLD DIGIT ZERO .. MATHEMATICAL MONOSPACE DIGIT NINE -- The following table includes all characters considered letters, i.e. -- all characters from the Unicode table with categories: -- Letter, Uppercase (Lu) -- Letter, Lowercase (Ll) -- Letter, Titlecase (Lt) -- Letter, Modifier (Lm) -- Letter, Other (Lo) -- Number, Letter (Nl) UTF_32_Letters : constant UTF_32_Ranges := ( (16#00041#, 16#0005A#), -- LATIN CAPITAL LETTER A .. LATIN CAPITAL LETTER Z (16#00061#, 16#0007A#), -- LATIN SMALL LETTER A .. LATIN SMALL LETTER Z (16#000AA#, 16#000AA#), -- FEMININE ORDINAL INDICATOR .. FEMININE ORDINAL INDICATOR (16#000B5#, 16#000B5#), -- MICRO SIGN .. MICRO SIGN (16#000BA#, 16#000BA#), -- MASCULINE ORDINAL INDICATOR .. MASCULINE ORDINAL INDICATOR (16#000C0#, 16#000D6#), -- LATIN CAPITAL LETTER A WITH GRAVE .. LATIN CAPITAL LETTER O WITH DIAERESIS (16#000D8#, 16#000F6#), -- LATIN CAPITAL LETTER O WITH STROKE .. LATIN SMALL LETTER O WITH DIAERESIS (16#000F8#, 16#00236#), -- LATIN SMALL LETTER O WITH STROKE .. LATIN SMALL LETTER T WITH CURL (16#00250#, 16#002C1#), -- LATIN SMALL LETTER TURNED A .. MODIFIER LETTER REVERSED GLOTTAL STOP (16#002C6#, 16#002D1#), -- MODIFIER LETTER CIRCUMFLEX ACCENT .. MODIFIER LETTER HALF TRIANGULAR COLON (16#002E0#, 16#002E4#), -- MODIFIER LETTER SMALL GAMMA .. MODIFIER LETTER SMALL REVERSED GLOTTAL STOP (16#002EE#, 16#002EE#), -- MODIFIER LETTER DOUBLE APOSTROPHE .. MODIFIER LETTER DOUBLE APOSTROPHE (16#0037A#, 16#0037A#), -- GREEK YPOGEGRAMMENI .. GREEK YPOGEGRAMMENI (16#00386#, 16#00386#), -- GREEK CAPITAL LETTER ALPHA WITH TONOS .. GREEK CAPITAL LETTER ALPHA WITH TONOS (16#00388#, 16#0038A#), -- GREEK CAPITAL LETTER EPSILON WITH TONOS .. GREEK CAPITAL LETTER IOTA WITH TONOS (16#0038C#, 16#0038C#), -- GREEK CAPITAL LETTER OMICRON WITH TONOS .. GREEK CAPITAL LETTER OMICRON WITH TONOS (16#0038E#, 16#003A1#), -- GREEK CAPITAL LETTER UPSILON WITH TONOS .. GREEK CAPITAL LETTER RHO (16#003A3#, 16#003CE#), -- GREEK CAPITAL LETTER SIGMA .. GREEK SMALL LETTER OMEGA WITH TONOS (16#003D0#, 16#003F5#), -- GREEK BETA SYMBOL .. GREEK LUNATE EPSILON SYMBOL (16#003F7#, 16#003FB#), -- GREEK CAPITAL LETTER SHO .. GREEK SMALL LETTER SAN (16#00400#, 16#00481#), -- CYRILLIC CAPITAL LETTER IE WITH GRAVE .. CYRILLIC SMALL LETTER KOPPA (16#0048A#, 16#004CE#), -- CYRILLIC CAPITAL LETTER SHORT I WITH TAIL .. CYRILLIC SMALL LETTER EM WITH TAIL (16#004D0#, 16#004F5#), -- CYRILLIC CAPITAL LETTER A WITH BREVE .. CYRILLIC SMALL LETTER CHE WITH DIAERESIS (16#004F8#, 16#004F9#), -- CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS .. CYRILLIC SMALL LETTER YERU WITH DIAERESIS (16#00500#, 16#0050F#), -- CYRILLIC CAPITAL LETTER KOMI DE .. CYRILLIC SMALL LETTER KOMI TJE (16#00531#, 16#00556#), -- ARMENIAN CAPITAL LETTER AYB .. ARMENIAN CAPITAL LETTER FEH (16#00559#, 16#00559#), -- ARMENIAN MODIFIER LETTER LEFT HALF RING .. ARMENIAN MODIFIER LETTER LEFT HALF RING (16#00561#, 16#00587#), -- ARMENIAN SMALL LETTER AYB .. ARMENIAN SMALL LIGATURE ECH YIWN (16#005D0#, 16#005EA#), -- HEBREW LETTER ALEF .. HEBREW LETTER TAV (16#005F0#, 16#005F2#), -- HEBREW LIGATURE YIDDISH DOUBLE VAV .. HEBREW LIGATURE YIDDISH DOUBLE YOD (16#00621#, 16#0063A#), -- ARABIC LETTER HAMZA .. ARABIC LETTER GHAIN (16#00640#, 16#0064A#), -- ARABIC TATWEEL .. ARABIC LETTER YEH (16#0066E#, 16#0066F#), -- ARABIC LETTER DOTLESS BEH .. ARABIC LETTER DOTLESS QAF (16#00671#, 16#006D3#), -- ARABIC LETTER ALEF WASLA .. ARABIC LETTER YEH BARREE WITH HAMZA ABOVE (16#006D5#, 16#006D5#), -- ARABIC LETTER AE .. ARABIC LETTER AE (16#006E5#, 16#006E6#), -- ARABIC SMALL WAW .. ARABIC SMALL YEH (16#006EE#, 16#006EF#), -- ARABIC LETTER DAL WITH INVERTED V .. ARABIC LETTER REH WITH INVERTED V (16#006FA#, 16#006FC#), -- ARABIC LETTER SHEEN WITH DOT BELOW .. ARABIC LETTER GHAIN WITH DOT BELOW (16#006FF#, 16#006FF#), -- ARABIC LETTER HEH WITH INVERTED V .. ARABIC LETTER HEH WITH INVERTED V (16#00710#, 16#00710#), -- SYRIAC LETTER ALAPH .. SYRIAC LETTER ALAPH (16#00712#, 16#0072F#), -- SYRIAC LETTER BETH .. SYRIAC LETTER PERSIAN DHALATH (16#0074D#, 16#0074F#), -- SYRIAC LETTER SOGDIAN ZHAIN .. SYRIAC LETTER SOGDIAN FE (16#00780#, 16#007A5#), -- THAANA LETTER HAA .. THAANA LETTER WAAVU (16#007B1#, 16#007B1#), -- THAANA LETTER NAA .. THAANA LETTER NAA (16#00904#, 16#00939#), -- DEVANAGARI LETTER SHORT A .. DEVANAGARI LETTER HA (16#0093D#, 16#0093D#), -- DEVANAGARI SIGN AVAGRAHA .. DEVANAGARI SIGN AVAGRAHA (16#00950#, 16#00950#), -- DEVANAGARI OM .. DEVANAGARI OM (16#00958#, 16#00961#), -- DEVANAGARI LETTER QA .. DEVANAGARI LETTER VOCALIC LL (16#00985#, 16#0098C#), -- BENGALI LETTER A .. BENGALI LETTER VOCALIC L (16#0098F#, 16#00990#), -- BENGALI LETTER E .. BENGALI LETTER AI (16#00993#, 16#009A8#), -- BENGALI LETTER O .. BENGALI LETTER NA (16#009AA#, 16#009B0#), -- BENGALI LETTER PA .. BENGALI LETTER RA (16#009B2#, 16#009B2#), -- BENGALI LETTER LA .. BENGALI LETTER LA (16#009B6#, 16#009B9#), -- BENGALI LETTER SHA .. BENGALI LETTER HA (16#009BD#, 16#009BD#), -- BENGALI SIGN AVAGRAHA .. BENGALI SIGN AVAGRAHA (16#009DC#, 16#009DD#), -- BENGALI LETTER RRA .. BENGALI LETTER RHA (16#009DF#, 16#009E1#), -- BENGALI LETTER YYA .. BENGALI LETTER VOCALIC LL (16#009F0#, 16#009F1#), -- BENGALI LETTER RA WITH MIDDLE DIAGONAL .. BENGALI LETTER RA WITH LOWER DIAGONAL (16#00A05#, 16#00A0A#), -- GURMUKHI LETTER A .. GURMUKHI LETTER UU (16#00A0F#, 16#00A10#), -- GURMUKHI LETTER EE .. GURMUKHI LETTER AI (16#00A13#, 16#00A28#), -- GURMUKHI LETTER OO .. GURMUKHI LETTER NA (16#00A2A#, 16#00A30#), -- GURMUKHI LETTER PA .. GURMUKHI LETTER RA (16#00A32#, 16#00A33#), -- GURMUKHI LETTER LA .. GURMUKHI LETTER LLA (16#00A35#, 16#00A36#), -- GURMUKHI LETTER VA .. GURMUKHI LETTER SHA (16#00A38#, 16#00A39#), -- GURMUKHI LETTER SA .. GURMUKHI LETTER HA (16#00A59#, 16#00A5C#), -- GURMUKHI LETTER KHHA .. GURMUKHI LETTER RRA (16#00A5E#, 16#00A5E#), -- GURMUKHI LETTER FA .. GURMUKHI LETTER FA (16#00A72#, 16#00A74#), -- GURMUKHI IRI .. GURMUKHI EK ONKAR (16#00A85#, 16#00A8D#), -- GUJARATI LETTER A .. GUJARATI VOWEL CANDRA E (16#00A8F#, 16#00A91#), -- GUJARATI LETTER E .. GUJARATI VOWEL CANDRA O (16#00A93#, 16#00AA8#), -- GUJARATI LETTER O .. GUJARATI LETTER NA (16#00AAA#, 16#00AB0#), -- GUJARATI LETTER PA .. GUJARATI LETTER RA (16#00AB2#, 16#00AB3#), -- GUJARATI LETTER LA .. GUJARATI LETTER LLA (16#00AB5#, 16#00AB9#), -- GUJARATI LETTER VA .. GUJARATI LETTER HA (16#00ABD#, 16#00ABD#), -- GUJARATI SIGN AVAGRAHA .. GUJARATI SIGN AVAGRAHA (16#00AD0#, 16#00AD0#), -- GUJARATI OM .. GUJARATI OM (16#00AE0#, 16#00AE1#), -- GUJARATI LETTER VOCALIC RR .. GUJARATI LETTER VOCALIC LL (16#00B05#, 16#00B0C#), -- ORIYA LETTER A .. ORIYA LETTER VOCALIC L (16#00B0F#, 16#00B10#), -- ORIYA LETTER E .. ORIYA LETTER AI (16#00B13#, 16#00B28#), -- ORIYA LETTER O .. ORIYA LETTER NA (16#00B2A#, 16#00B30#), -- ORIYA LETTER PA .. ORIYA LETTER RA (16#00B32#, 16#00B33#), -- ORIYA LETTER LA .. ORIYA LETTER LLA (16#00B35#, 16#00B39#), -- ORIYA LETTER VA .. ORIYA LETTER HA (16#00B3D#, 16#00B3D#), -- ORIYA SIGN AVAGRAHA .. ORIYA SIGN AVAGRAHA (16#00B5C#, 16#00B5D#), -- ORIYA LETTER RRA .. ORIYA LETTER RHA (16#00B5F#, 16#00B61#), -- ORIYA LETTER YYA .. ORIYA LETTER VOCALIC LL (16#00B71#, 16#00B71#), -- ORIYA LETTER WA .. ORIYA LETTER WA (16#00B83#, 16#00B83#), -- TAMIL SIGN VISARGA .. TAMIL SIGN VISARGA (16#00B85#, 16#00B8A#), -- TAMIL LETTER A .. TAMIL LETTER UU (16#00B8E#, 16#00B90#), -- TAMIL LETTER E .. TAMIL LETTER AI (16#00B92#, 16#00B95#), -- TAMIL LETTER O .. TAMIL LETTER KA (16#00B99#, 16#00B9A#), -- TAMIL LETTER NGA .. TAMIL LETTER CA (16#00B9C#, 16#00B9C#), -- TAMIL LETTER JA .. TAMIL LETTER JA (16#00B9E#, 16#00B9F#), -- TAMIL LETTER NYA .. TAMIL LETTER TTA (16#00BA3#, 16#00BA4#), -- TAMIL LETTER NNA .. TAMIL LETTER TA (16#00BA8#, 16#00BAA#), -- TAMIL LETTER NA .. TAMIL LETTER PA (16#00BAE#, 16#00BB5#), -- TAMIL LETTER MA .. TAMIL LETTER VA (16#00BB7#, 16#00BB9#), -- TAMIL LETTER SSA .. TAMIL LETTER HA (16#00C05#, 16#00C0C#), -- TELUGU LETTER A .. TELUGU LETTER VOCALIC L (16#00C0E#, 16#00C10#), -- TELUGU LETTER E .. TELUGU LETTER AI (16#00C12#, 16#00C28#), -- TELUGU LETTER O .. TELUGU LETTER NA (16#00C2A#, 16#00C33#), -- TELUGU LETTER PA .. TELUGU LETTER LLA (16#00C35#, 16#00C39#), -- TELUGU LETTER VA .. TELUGU LETTER HA (16#00C60#, 16#00C61#), -- TELUGU LETTER VOCALIC RR .. TELUGU LETTER VOCALIC LL (16#00C85#, 16#00C8C#), -- KANNADA LETTER A .. KANNADA LETTER VOCALIC L (16#00C8E#, 16#00C90#), -- KANNADA LETTER E .. KANNADA LETTER AI (16#00C92#, 16#00CA8#), -- KANNADA LETTER O .. KANNADA LETTER NA (16#00CAA#, 16#00CB3#), -- KANNADA LETTER PA .. KANNADA LETTER LLA (16#00CB5#, 16#00CB9#), -- KANNADA LETTER VA .. KANNADA LETTER HA (16#00CBD#, 16#00CBD#), -- KANNADA SIGN AVAGRAHA .. KANNADA SIGN AVAGRAHA (16#00CDE#, 16#00CDE#), -- KANNADA LETTER FA .. KANNADA LETTER FA (16#00CE0#, 16#00CE1#), -- KANNADA LETTER VOCALIC RR .. KANNADA LETTER VOCALIC LL (16#00D05#, 16#00D0C#), -- MALAYALAM LETTER A .. MALAYALAM LETTER VOCALIC L (16#00D0E#, 16#00D10#), -- MALAYALAM LETTER E .. MALAYALAM LETTER AI (16#00D12#, 16#00D28#), -- MALAYALAM LETTER O .. MALAYALAM LETTER NA (16#00D2A#, 16#00D39#), -- MALAYALAM LETTER PA .. MALAYALAM LETTER HA (16#00D60#, 16#00D61#), -- MALAYALAM LETTER VOCALIC RR .. MALAYALAM LETTER VOCALIC LL (16#00D85#, 16#00D96#), -- SINHALA LETTER AYANNA .. SINHALA LETTER AUYANNA (16#00D9A#, 16#00DB1#), -- SINHALA LETTER ALPAPRAANA KAYANNA .. SINHALA LETTER DANTAJA NAYANNA (16#00DB3#, 16#00DBB#), -- SINHALA LETTER SANYAKA DAYANNA .. SINHALA LETTER RAYANNA (16#00DBD#, 16#00DBD#), -- SINHALA LETTER DANTAJA LAYANNA .. SINHALA LETTER DANTAJA LAYANNA (16#00DC0#, 16#00DC6#), -- SINHALA LETTER VAYANNA .. SINHALA LETTER FAYANNA (16#00E01#, 16#00E30#), -- THAI CHARACTER KO KAI .. THAI CHARACTER SARA A (16#00E32#, 16#00E33#), -- THAI CHARACTER SARA AA .. THAI CHARACTER SARA AM (16#00E40#, 16#00E46#), -- THAI CHARACTER SARA E .. THAI CHARACTER MAIYAMOK (16#00E81#, 16#00E82#), -- LAO LETTER KO .. LAO LETTER KHO SUNG (16#00E84#, 16#00E84#), -- LAO LETTER KHO TAM .. LAO LETTER KHO TAM (16#00E87#, 16#00E88#), -- LAO LETTER NGO .. LAO LETTER CO (16#00E8A#, 16#00E8A#), -- LAO LETTER SO TAM .. LAO LETTER SO TAM (16#00E8D#, 16#00E8D#), -- LAO LETTER NYO .. LAO LETTER NYO (16#00E94#, 16#00E97#), -- LAO LETTER DO .. LAO LETTER THO TAM (16#00E99#, 16#00E9F#), -- LAO LETTER NO .. LAO LETTER FO SUNG (16#00EA1#, 16#00EA3#), -- LAO LETTER MO .. LAO LETTER LO LING (16#00EA5#, 16#00EA5#), -- LAO LETTER LO LOOT .. LAO LETTER LO LOOT (16#00EA7#, 16#00EA7#), -- LAO LETTER WO .. LAO LETTER WO (16#00EAA#, 16#00EAB#), -- LAO LETTER SO SUNG .. LAO LETTER HO SUNG (16#00EAD#, 16#00EB0#), -- LAO LETTER O .. LAO VOWEL SIGN A (16#00EB2#, 16#00EB3#), -- LAO VOWEL SIGN AA .. LAO VOWEL SIGN AM (16#00EBD#, 16#00EBD#), -- LAO SEMIVOWEL SIGN NYO .. LAO SEMIVOWEL SIGN NYO (16#00EC0#, 16#00EC4#), -- LAO VOWEL SIGN E .. LAO VOWEL SIGN AI (16#00EC6#, 16#00EC6#), -- LAO KO LA .. LAO KO LA (16#00EDC#, 16#00EDD#), -- LAO HO NO .. LAO HO MO (16#00F00#, 16#00F00#), -- TIBETAN SYLLABLE OM .. TIBETAN SYLLABLE OM (16#00F40#, 16#00F47#), -- TIBETAN LETTER KA .. TIBETAN LETTER JA (16#00F49#, 16#00F6A#), -- TIBETAN LETTER NYA .. TIBETAN LETTER FIXED-FORM RA (16#00F88#, 16#00F8B#), -- TIBETAN SIGN LCE TSA CAN .. TIBETAN SIGN GRU MED RGYINGS (16#01000#, 16#01021#), -- MYANMAR LETTER KA .. MYANMAR LETTER A (16#01023#, 16#01027#), -- MYANMAR LETTER I .. MYANMAR LETTER E (16#01029#, 16#0102A#), -- MYANMAR LETTER O .. MYANMAR LETTER AU (16#01050#, 16#01055#), -- MYANMAR LETTER SHA .. MYANMAR LETTER VOCALIC LL (16#010A0#, 16#010C5#), -- GEORGIAN CAPITAL LETTER AN .. GEORGIAN CAPITAL LETTER HOE (16#010D0#, 16#010F8#), -- GEORGIAN LETTER AN .. GEORGIAN LETTER ELIFI (16#01100#, 16#01159#), -- HANGUL CHOSEONG KIYEOK .. HANGUL CHOSEONG YEORINHIEUH (16#0115F#, 16#011A2#), -- HANGUL CHOSEONG FILLER .. HANGUL JUNGSEONG SSANGARAEA (16#011A8#, 16#011F9#), -- HANGUL JONGSEONG KIYEOK .. HANGUL JONGSEONG YEORINHIEUH (16#01200#, 16#01206#), -- ETHIOPIC SYLLABLE HA .. ETHIOPIC SYLLABLE HO (16#01208#, 16#01246#), -- ETHIOPIC SYLLABLE LA .. ETHIOPIC SYLLABLE QO (16#01248#, 16#01248#), -- ETHIOPIC SYLLABLE QWA .. ETHIOPIC SYLLABLE QWA (16#0124A#, 16#0124D#), -- ETHIOPIC SYLLABLE QWI .. ETHIOPIC SYLLABLE QWE (16#01250#, 16#01256#), -- ETHIOPIC SYLLABLE QHA .. ETHIOPIC SYLLABLE QHO (16#01258#, 16#01258#), -- ETHIOPIC SYLLABLE QHWA .. ETHIOPIC SYLLABLE QHWA (16#0125A#, 16#0125D#), -- ETHIOPIC SYLLABLE QHWI .. ETHIOPIC SYLLABLE QHWE (16#01260#, 16#01286#), -- ETHIOPIC SYLLABLE BA .. ETHIOPIC SYLLABLE XO (16#01288#, 16#01288#), -- ETHIOPIC SYLLABLE XWA .. ETHIOPIC SYLLABLE XWA (16#0128A#, 16#0128D#), -- ETHIOPIC SYLLABLE XWI .. ETHIOPIC SYLLABLE XWE (16#01290#, 16#012AE#), -- ETHIOPIC SYLLABLE NA .. ETHIOPIC SYLLABLE KO (16#012B0#, 16#012B0#), -- ETHIOPIC SYLLABLE KWA .. ETHIOPIC SYLLABLE KWA (16#012B2#, 16#012B5#), -- ETHIOPIC SYLLABLE KWI .. ETHIOPIC SYLLABLE KWE (16#012B8#, 16#012BE#), -- ETHIOPIC SYLLABLE KXA .. ETHIOPIC SYLLABLE KXO (16#012C0#, 16#012C0#), -- ETHIOPIC SYLLABLE KXWA .. ETHIOPIC SYLLABLE KXWA (16#012C2#, 16#012C5#), -- ETHIOPIC SYLLABLE KXWI .. ETHIOPIC SYLLABLE KXWE (16#012C8#, 16#012CE#), -- ETHIOPIC SYLLABLE WA .. ETHIOPIC SYLLABLE WO (16#012D0#, 16#012D6#), -- ETHIOPIC SYLLABLE PHARYNGEAL A .. ETHIOPIC SYLLABLE PHARYNGEAL O (16#012D8#, 16#012EE#), -- ETHIOPIC SYLLABLE ZA .. ETHIOPIC SYLLABLE YO (16#012F0#, 16#0130E#), -- ETHIOPIC SYLLABLE DA .. ETHIOPIC SYLLABLE GO (16#01310#, 16#01310#), -- ETHIOPIC SYLLABLE GWA .. ETHIOPIC SYLLABLE GWA (16#01312#, 16#01315#), -- ETHIOPIC SYLLABLE GWI .. ETHIOPIC SYLLABLE GWE (16#01318#, 16#0131E#), -- ETHIOPIC SYLLABLE GGA .. ETHIOPIC SYLLABLE GGO (16#01320#, 16#01346#), -- ETHIOPIC SYLLABLE THA .. ETHIOPIC SYLLABLE TZO (16#01348#, 16#0135A#), -- ETHIOPIC SYLLABLE FA .. ETHIOPIC SYLLABLE FYA (16#013A0#, 16#013F4#), -- CHEROKEE LETTER A .. CHEROKEE LETTER YV (16#01401#, 16#0166C#), -- CANADIAN SYLLABICS E .. CANADIAN SYLLABICS CARRIER TTSA (16#0166F#, 16#01676#), -- CANADIAN SYLLABICS QAI .. CANADIAN SYLLABICS NNGAA (16#01681#, 16#0169A#), -- OGHAM LETTER BEITH .. OGHAM LETTER PEITH (16#016A0#, 16#016EA#), -- RUNIC LETTER FEHU FEOH FE F .. RUNIC LETTER X (16#016EE#, 16#016F0#), -- RUNIC ARLAUG SYMBOL .. RUNIC BELGTHOR SYMBOL (16#01700#, 16#0170C#), -- TAGALOG LETTER A .. TAGALOG LETTER YA (16#0170E#, 16#01711#), -- TAGALOG LETTER LA .. TAGALOG LETTER HA (16#01720#, 16#01731#), -- HANUNOO LETTER A .. HANUNOO LETTER HA (16#01740#, 16#01751#), -- BUHID LETTER A .. BUHID LETTER HA (16#01760#, 16#0176C#), -- TAGBANWA LETTER A .. TAGBANWA LETTER YA (16#0176E#, 16#01770#), -- TAGBANWA LETTER LA .. TAGBANWA LETTER SA (16#01780#, 16#017B3#), -- KHMER LETTER KA .. KHMER INDEPENDENT VOWEL QAU (16#017D7#, 16#017D7#), -- KHMER SIGN LEK TOO .. KHMER SIGN LEK TOO (16#017DC#, 16#017DC#), -- KHMER SIGN AVAKRAHASANYA .. KHMER SIGN AVAKRAHASANYA (16#01820#, 16#01877#), -- MONGOLIAN LETTER A .. MONGOLIAN LETTER MANCHU ZHA (16#01880#, 16#018A8#), -- MONGOLIAN LETTER ALI GALI ANUSVARA ONE .. MONGOLIAN LETTER MANCHU ALI GALI BHA (16#01900#, 16#0191C#), -- LIMBU VOWEL-CARRIER LETTER .. LIMBU LETTER HA (16#01950#, 16#0196D#), -- TAI LE LETTER KA .. TAI LE LETTER AI (16#01970#, 16#01974#), -- TAI LE LETTER TONE-2 .. TAI LE LETTER TONE-6 (16#01D00#, 16#01D6B#), -- LATIN LETTER SMALL CAPITAL A .. LATIN SMALL LETTER UE (16#01E00#, 16#01E9B#), -- LATIN CAPITAL LETTER A WITH RING BELOW .. LATIN SMALL LETTER LONG S WITH DOT ABOVE (16#01EA0#, 16#01EF9#), -- LATIN CAPITAL LETTER A WITH DOT BELOW .. LATIN SMALL LETTER Y WITH TILDE (16#01F00#, 16#01F15#), -- GREEK SMALL LETTER ALPHA WITH PSILI .. GREEK SMALL LETTER EPSILON WITH DASIA AND OXIA (16#01F18#, 16#01F1D#), -- GREEK CAPITAL LETTER EPSILON WITH PSILI .. GREEK CAPITAL LETTER EPSILON WITH DASIA AND OXIA (16#01F20#, 16#01F45#), -- GREEK SMALL LETTER ETA WITH PSILI .. GREEK SMALL LETTER OMICRON WITH DASIA AND OXIA (16#01F48#, 16#01F4D#), -- GREEK CAPITAL LETTER OMICRON WITH PSILI .. GREEK CAPITAL LETTER OMICRON WITH DASIA AND OXIA (16#01F50#, 16#01F57#), -- GREEK SMALL LETTER UPSILON WITH PSILI .. GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI (16#01F59#, 16#01F59#), -- GREEK CAPITAL LETTER UPSILON WITH DASIA .. GREEK CAPITAL LETTER UPSILON WITH DASIA (16#01F5B#, 16#01F5B#), -- GREEK CAPITAL LETTER UPSILON WITH DASIA AND VARIA .. GREEK CAPITAL LETTER UPSILON WITH DASIA AND VARIA (16#01F5D#, 16#01F5D#), -- GREEK CAPITAL LETTER UPSILON WITH DASIA AND OXIA .. GREEK CAPITAL LETTER UPSILON WITH DASIA AND OXIA (16#01F5F#, 16#01F7D#), -- GREEK CAPITAL LETTER UPSILON WITH DASIA AND PERISPOMENI .. GREEK SMALL LETTER OMEGA WITH OXIA (16#01F80#, 16#01FB4#), -- GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI (16#01FB6#, 16#01FBC#), -- GREEK SMALL LETTER ALPHA WITH PERISPOMENI .. GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI (16#01FBE#, 16#01FBE#), -- GREEK PROSGEGRAMMENI .. GREEK PROSGEGRAMMENI (16#01FC2#, 16#01FC4#), -- GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI (16#01FC6#, 16#01FCC#), -- GREEK SMALL LETTER ETA WITH PERISPOMENI .. GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI (16#01FD0#, 16#01FD3#), -- GREEK SMALL LETTER IOTA WITH VRACHY .. GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA (16#01FD6#, 16#01FDB#), -- GREEK SMALL LETTER IOTA WITH PERISPOMENI .. GREEK CAPITAL LETTER IOTA WITH OXIA (16#01FE0#, 16#01FEC#), -- GREEK SMALL LETTER UPSILON WITH VRACHY .. GREEK CAPITAL LETTER RHO WITH DASIA (16#01FF2#, 16#01FF4#), -- GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI (16#01FF6#, 16#01FFC#), -- GREEK SMALL LETTER OMEGA WITH PERISPOMENI .. GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI (16#02071#, 16#02071#), -- SUPERSCRIPT LATIN SMALL LETTER I .. SUPERSCRIPT LATIN SMALL LETTER I (16#0207F#, 16#0207F#), -- SUPERSCRIPT LATIN SMALL LETTER N .. SUPERSCRIPT LATIN SMALL LETTER N (16#02102#, 16#02102#), -- DOUBLE-STRUCK CAPITAL C .. DOUBLE-STRUCK CAPITAL C (16#02107#, 16#02107#), -- EULER CONSTANT .. EULER CONSTANT (16#0210A#, 16#02113#), -- SCRIPT SMALL G .. SCRIPT SMALL L (16#02115#, 16#02115#), -- DOUBLE-STRUCK CAPITAL N .. DOUBLE-STRUCK CAPITAL N (16#02119#, 16#0211D#), -- DOUBLE-STRUCK CAPITAL P .. DOUBLE-STRUCK CAPITAL R (16#02124#, 16#02124#), -- DOUBLE-STRUCK CAPITAL Z .. DOUBLE-STRUCK CAPITAL Z (16#02126#, 16#02126#), -- OHM SIGN .. OHM SIGN (16#02128#, 16#02128#), -- BLACK-LETTER CAPITAL Z .. BLACK-LETTER CAPITAL Z (16#0212A#, 16#0212D#), -- KELVIN SIGN .. BLACK-LETTER CAPITAL C (16#0212F#, 16#02131#), -- SCRIPT SMALL E .. SCRIPT CAPITAL F (16#02133#, 16#02139#), -- SCRIPT CAPITAL M .. INFORMATION SOURCE (16#0213D#, 16#0213F#), -- DOUBLE-STRUCK SMALL GAMMA .. DOUBLE-STRUCK CAPITAL PI (16#02145#, 16#02149#), -- DOUBLE-STRUCK ITALIC CAPITAL D .. DOUBLE-STRUCK ITALIC SMALL J (16#02160#, 16#02183#), -- ROMAN NUMERAL ONE .. ROMAN NUMERAL REVERSED ONE HUNDRED (16#03005#, 16#03007#), -- IDEOGRAPHIC ITERATION MARK .. IDEOGRAPHIC NUMBER ZERO (16#03021#, 16#03029#), -- HANGZHOU NUMERAL ONE .. HANGZHOU NUMERAL NINE (16#03031#, 16#03035#), -- VERTICAL KANA REPEAT MARK .. VERTICAL KANA REPEAT MARK LOWER HALF (16#03038#, 16#0303C#), -- HANGZHOU NUMERAL TEN .. MASU MARK (16#03041#, 16#03096#), -- HIRAGANA LETTER SMALL A .. HIRAGANA LETTER SMALL KE (16#0309D#, 16#0309F#), -- HIRAGANA ITERATION MARK .. HIRAGANA DIGRAPH YORI (16#030A1#, 16#030FA#), -- KATAKANA LETTER SMALL A .. KATAKANA LETTER VO (16#030FC#, 16#030FF#), -- KATAKANA-HIRAGANA PROLONGED SOUND MARK .. KATAKANA DIGRAPH KOTO (16#03105#, 16#0312C#), -- BOPOMOFO LETTER B .. BOPOMOFO LETTER GN (16#03131#, 16#0318E#), -- HANGUL LETTER KIYEOK .. HANGUL LETTER ARAEAE (16#031A0#, 16#031B7#), -- BOPOMOFO LETTER BU .. BOPOMOFO FINAL LETTER H (16#031F0#, 16#031FF#), -- KATAKANA LETTER SMALL KU .. KATAKANA LETTER SMALL RO (16#03400#, 16#03400#), -- .. (16#04DB5#, 16#04DB5#), -- .. (16#04E00#, 16#04E00#), -- .. (16#09FA5#, 16#09FA5#), -- .. (16#0A000#, 16#0A48C#), -- YI SYLLABLE IT .. YI SYLLABLE YYR (16#0AC00#, 16#0AC00#), -- .. (16#0D7A3#, 16#0D7A3#), -- .. (16#0F900#, 16#0FA2D#), -- CJK COMPATIBILITY IDEOGRAPH-F900 .. CJK COMPATIBILITY IDEOGRAPH-FA2D (16#0FA30#, 16#0FA6A#), -- CJK COMPATIBILITY IDEOGRAPH-FA30 .. CJK COMPATIBILITY IDEOGRAPH-FA6A (16#0FB00#, 16#0FB06#), -- LATIN SMALL LIGATURE FF .. LATIN SMALL LIGATURE ST (16#0FB13#, 16#0FB17#), -- ARMENIAN SMALL LIGATURE MEN NOW .. ARMENIAN SMALL LIGATURE MEN XEH (16#0FB1D#, 16#0FB1D#), -- HEBREW LETTER YOD WITH HIRIQ .. HEBREW LETTER YOD WITH HIRIQ (16#0FB1F#, 16#0FB28#), -- HEBREW LIGATURE YIDDISH YOD YOD PATAH .. HEBREW LETTER WIDE TAV (16#0FB2A#, 16#0FB36#), -- HEBREW LETTER SHIN WITH SHIN DOT .. HEBREW LETTER ZAYIN WITH DAGESH (16#0FB38#, 16#0FB3C#), -- HEBREW LETTER TET WITH DAGESH .. HEBREW LETTER LAMED WITH DAGESH (16#0FB3E#, 16#0FB3E#), -- HEBREW LETTER MEM WITH DAGESH .. HEBREW LETTER MEM WITH DAGESH (16#0FB40#, 16#0FB41#), -- HEBREW LETTER NUN WITH DAGESH .. HEBREW LETTER SAMEKH WITH DAGESH (16#0FB43#, 16#0FB44#), -- HEBREW LETTER FINAL PE WITH DAGESH .. HEBREW LETTER PE WITH DAGESH (16#0FB46#, 16#0FBB1#), -- HEBREW LETTER TSADI WITH DAGESH .. ARABIC LETTER YEH BARREE WITH HAMZA ABOVE FINAL FORM (16#0FBD3#, 16#0FD3D#), -- ARABIC LETTER NG ISOLATED FORM .. ARABIC LIGATURE ALEF WITH FATHATAN ISOLATED FORM (16#0FD50#, 16#0FD8F#), -- ARABIC LIGATURE TEH WITH JEEM WITH MEEM INITIAL FORM .. ARABIC LIGATURE MEEM WITH KHAH WITH MEEM INITIAL FORM (16#0FD92#, 16#0FDC7#), -- ARABIC LIGATURE MEEM WITH JEEM WITH KHAH INITIAL FORM .. ARABIC LIGATURE NOON WITH JEEM WITH YEH FINAL FORM (16#0FDF0#, 16#0FDFB#), -- ARABIC LIGATURE SALLA USED AS KORANIC STOP SIGN ISOLATED FORM .. ARABIC LIGATURE JALLAJALALOUHOU (16#0FE70#, 16#0FE74#), -- ARABIC FATHATAN ISOLATED FORM .. ARABIC KASRATAN ISOLATED FORM (16#0FE76#, 16#0FEFC#), -- ARABIC FATHA ISOLATED FORM .. ARABIC LIGATURE LAM WITH ALEF FINAL FORM (16#0FF21#, 16#0FF3A#), -- FULLWIDTH LATIN CAPITAL LETTER A .. FULLWIDTH LATIN CAPITAL LETTER Z (16#0FF41#, 16#0FF5A#), -- FULLWIDTH LATIN SMALL LETTER A .. FULLWIDTH LATIN SMALL LETTER Z (16#0FF66#, 16#0FFBE#), -- HALFWIDTH KATAKANA LETTER WO .. HALFWIDTH HANGUL LETTER HIEUH (16#0FFC2#, 16#0FFC7#), -- HALFWIDTH HANGUL LETTER A .. HALFWIDTH HANGUL LETTER E (16#0FFCA#, 16#0FFCF#), -- HALFWIDTH HANGUL LETTER YEO .. HALFWIDTH HANGUL LETTER OE (16#0FFD2#, 16#0FFD7#), -- HALFWIDTH HANGUL LETTER YO .. HALFWIDTH HANGUL LETTER YU (16#0FFDA#, 16#0FFDC#), -- HALFWIDTH HANGUL LETTER EU .. HALFWIDTH HANGUL LETTER I (16#10000#, 16#1000B#), -- LINEAR B SYLLABLE B008 A .. LINEAR B SYLLABLE B046 JE (16#1000D#, 16#10026#), -- LINEAR B SYLLABLE B036 JO .. LINEAR B SYLLABLE B032 QO (16#10028#, 16#1003A#), -- LINEAR B SYLLABLE B060 RA .. LINEAR B SYLLABLE B042 WO (16#1003C#, 16#1003D#), -- LINEAR B SYLLABLE B017 ZA .. LINEAR B SYLLABLE B074 ZE (16#1003F#, 16#1004D#), -- LINEAR B SYLLABLE B020 ZO .. LINEAR B SYLLABLE B091 TWO (16#10050#, 16#1005D#), -- LINEAR B SYMBOL B018 .. LINEAR B SYMBOL B089 (16#10080#, 16#100FA#), -- LINEAR B IDEOGRAM B100 MAN .. LINEAR B IDEOGRAM VESSEL B305 (16#10300#, 16#1031E#), -- OLD ITALIC LETTER A .. OLD ITALIC LETTER UU (16#10330#, 16#1034A#), -- GOTHIC LETTER AHSA .. GOTHIC LETTER NINE HUNDRED (16#10380#, 16#1039D#), -- UGARITIC LETTER ALPA .. UGARITIC LETTER SSU (16#10400#, 16#1049D#), -- DESERET CAPITAL LETTER LONG I .. OSMANYA LETTER OO (16#10800#, 16#10805#), -- CYPRIOT SYLLABLE A .. CYPRIOT SYLLABLE JA (16#10808#, 16#10808#), -- CYPRIOT SYLLABLE JO .. CYPRIOT SYLLABLE JO (16#1080A#, 16#10835#), -- CYPRIOT SYLLABLE KA .. CYPRIOT SYLLABLE WO (16#10837#, 16#10838#), -- CYPRIOT SYLLABLE XA .. CYPRIOT SYLLABLE XE (16#1083C#, 16#1083C#), -- CYPRIOT SYLLABLE ZA .. CYPRIOT SYLLABLE ZA (16#1083F#, 16#1083F#), -- CYPRIOT SYLLABLE ZO .. CYPRIOT SYLLABLE ZO (16#1D400#, 16#1D454#), -- MATHEMATICAL BOLD CAPITAL A .. MATHEMATICAL ITALIC SMALL G (16#1D456#, 16#1D49C#), -- MATHEMATICAL ITALIC SMALL I .. MATHEMATICAL SCRIPT CAPITAL A (16#1D49E#, 16#1D49F#), -- MATHEMATICAL SCRIPT CAPITAL C .. MATHEMATICAL SCRIPT CAPITAL D (16#1D4A2#, 16#1D4A2#), -- MATHEMATICAL SCRIPT CAPITAL G .. MATHEMATICAL SCRIPT CAPITAL G (16#1D4A5#, 16#1D4A6#), -- MATHEMATICAL SCRIPT CAPITAL J .. MATHEMATICAL SCRIPT CAPITAL K (16#1D4A9#, 16#1D4AC#), -- MATHEMATICAL SCRIPT CAPITAL N .. MATHEMATICAL SCRIPT CAPITAL Q (16#1D4AE#, 16#1D4B9#), -- MATHEMATICAL SCRIPT CAPITAL S .. MATHEMATICAL SCRIPT SMALL D (16#1D4BB#, 16#1D4BB#), -- MATHEMATICAL SCRIPT SMALL F .. MATHEMATICAL SCRIPT SMALL F (16#1D4BD#, 16#1D4C3#), -- MATHEMATICAL SCRIPT SMALL H .. MATHEMATICAL SCRIPT SMALL N (16#1D4C5#, 16#1D505#), -- MATHEMATICAL SCRIPT SMALL P .. MATHEMATICAL FRAKTUR CAPITAL B (16#1D507#, 16#1D50A#), -- MATHEMATICAL FRAKTUR CAPITAL D .. MATHEMATICAL FRAKTUR CAPITAL G (16#1D50D#, 16#1D514#), -- MATHEMATICAL FRAKTUR CAPITAL J .. MATHEMATICAL FRAKTUR CAPITAL Q (16#1D516#, 16#1D51C#), -- MATHEMATICAL FRAKTUR CAPITAL S .. MATHEMATICAL FRAKTUR CAPITAL Y (16#1D51E#, 16#1D539#), -- MATHEMATICAL FRAKTUR SMALL A .. MATHEMATICAL DOUBLE-STRUCK CAPITAL B (16#1D53B#, 16#1D53E#), -- MATHEMATICAL DOUBLE-STRUCK CAPITAL D .. MATHEMATICAL DOUBLE-STRUCK CAPITAL G (16#1D540#, 16#1D544#), -- MATHEMATICAL DOUBLE-STRUCK CAPITAL I .. MATHEMATICAL DOUBLE-STRUCK CAPITAL M (16#1D546#, 16#1D546#), -- MATHEMATICAL DOUBLE-STRUCK CAPITAL O .. MATHEMATICAL DOUBLE-STRUCK CAPITAL O (16#1D54A#, 16#1D550#), -- MATHEMATICAL DOUBLE-STRUCK CAPITAL S .. MATHEMATICAL DOUBLE-STRUCK CAPITAL Y (16#1D552#, 16#1D6A3#), -- MATHEMATICAL DOUBLE-STRUCK SMALL A .. MATHEMATICAL MONOSPACE SMALL Z (16#1D6A8#, 16#1D6C0#), -- MATHEMATICAL BOLD CAPITAL ALPHA .. MATHEMATICAL BOLD CAPITAL OMEGA (16#1D6C2#, 16#1D6DA#), -- MATHEMATICAL BOLD SMALL ALPHA .. MATHEMATICAL BOLD SMALL OMEGA (16#1D6DC#, 16#1D6FA#), -- MATHEMATICAL BOLD EPSILON SYMBOL .. MATHEMATICAL ITALIC CAPITAL OMEGA (16#1D6FC#, 16#1D714#), -- MATHEMATICAL ITALIC SMALL ALPHA .. MATHEMATICAL ITALIC SMALL OMEGA (16#1D716#, 16#1D734#), -- MATHEMATICAL ITALIC EPSILON SYMBOL .. MATHEMATICAL BOLD ITALIC CAPITAL OMEGA (16#1D736#, 16#1D74E#), -- MATHEMATICAL BOLD ITALIC SMALL ALPHA .. MATHEMATICAL BOLD ITALIC SMALL OMEGA (16#1D750#, 16#1D76E#), -- MATHEMATICAL BOLD ITALIC EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD CAPITAL OMEGA (16#1D770#, 16#1D788#), -- MATHEMATICAL SANS-SERIF BOLD SMALL ALPHA .. MATHEMATICAL SANS-SERIF BOLD SMALL OMEGA (16#1D78A#, 16#1D7A8#), -- MATHEMATICAL SANS-SERIF BOLD EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL OMEGA (16#1D7AA#, 16#1D7C2#), -- MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL ALPHA .. MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL OMEGA (16#1D7C4#, 16#1D7C9#), -- MATHEMATICAL SANS-SERIF BOLD ITALIC EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL (16#20000#, 16#20000#), -- .. (16#2A6D6#, 16#2A6D6#), -- .. (16#2F800#, 16#2FA1D#)); -- CJK COMPATIBILITY IDEOGRAPH-2F800 .. CJK COMPATIBILITY IDEOGRAPH-2FA1D -- The following table includes all characters considered spaces, i.e. -- all characters from the Unicode table with categories: -- Separator, Space (Zs) UTF_32_Spaces : constant UTF_32_Ranges := ( (16#00020#, 16#00020#), -- SPACE .. SPACE (16#000A0#, 16#000A0#), -- NO-BREAK SPACE .. NO-BREAK SPACE (16#01680#, 16#01680#), -- OGHAM SPACE MARK .. OGHAM SPACE MARK (16#02000#, 16#0200B#), -- EN QUAD .. ZERO WIDTH SPACE (16#0202F#, 16#0202F#), -- NARROW NO-BREAK SPACE .. NARROW NO-BREAK SPACE (16#0205F#, 16#0205F#), -- MEDIUM MATHEMATICAL SPACE .. MEDIUM MATHEMATICAL SPACE (16#03000#, 16#03000#)); -- IDEOGRAPHIC SPACE .. IDEOGRAPHIC SPACE -- The following table includes all characters considered punctuation, -- i.e. all characters from the Unicode table with categories: -- Punctuation, Connector (Pc) UTF_32_Punctuation : constant UTF_32_Ranges := ( (16#0005F#, 16#0005F#), -- LOW LINE .. LOW LINE (16#0203F#, 16#02040#), -- UNDERTIE .. CHARACTER TIE (16#02054#, 16#02054#), -- INVERTED UNDERTIE .. INVERTED UNDERTIE (16#030FB#, 16#030FB#), -- KATAKANA MIDDLE DOT .. KATAKANA MIDDLE DOT (16#0FE33#, 16#0FE34#), -- PRESENTATION FORM FOR VERTICAL LOW LINE .. PRESENTATION FORM FOR VERTICAL WAVY LOW LINE (16#0FE4D#, 16#0FE4F#), -- DASHED LOW LINE .. WAVY LOW LINE (16#0FF3F#, 16#0FF3F#), -- FULLWIDTH LOW LINE .. FULLWIDTH LOW LINE (16#0FF65#, 16#0FF65#)); -- HALFWIDTH KATAKANA MIDDLE DOT .. HALFWIDTH KATAKANA MIDDLE DOT -- The following table includes all characters considered as other format, -- i.e. all characters from the Unicode table with categories: -- Other, Format (Cf) UTF_32_Other_Format : constant UTF_32_Ranges := ( (16#000AD#, 16#000AD#), -- SOFT HYPHEN .. SOFT HYPHEN (16#00600#, 16#00603#), -- ARABIC NUMBER SIGN .. ARABIC SIGN SAFHA (16#006DD#, 16#006DD#), -- ARABIC END OF AYAH .. ARABIC END OF AYAH (16#0070F#, 16#0070F#), -- SYRIAC ABBREVIATION MARK .. SYRIAC ABBREVIATION MARK (16#017B4#, 16#017B5#), -- KHMER VOWEL INHERENT AQ .. KHMER VOWEL INHERENT AA (16#0200C#, 16#0200F#), -- ZERO WIDTH NON-JOINER .. RIGHT-TO-LEFT MARK (16#0202A#, 16#0202E#), -- LEFT-TO-RIGHT EMBEDDING .. RIGHT-TO-LEFT OVERRIDE (16#02060#, 16#02063#), -- WORD JOINER .. INVISIBLE SEPARATOR (16#0206A#, 16#0206F#), -- INHIBIT SYMMETRIC SWAPPING .. NOMINAL DIGIT SHAPES (16#0FEFF#, 16#0FEFF#), -- ZERO WIDTH NO-BREAK SPACE .. ZERO WIDTH NO-BREAK SPACE (16#0FFF9#, 16#0FFFB#), -- INTERLINEAR ANNOTATION ANCHOR .. INTERLINEAR ANNOTATION TERMINATOR (16#1D173#, 16#1D17A#), -- MUSICAL SYMBOL BEGIN BEAM .. MUSICAL SYMBOL END PHRASE (16#E0001#, 16#E0001#), -- LANGUAGE TAG .. LANGUAGE TAG (16#E0020#, 16#E007F#)); -- TAG SPACE .. CANCEL TAG -- The following table includes all characters considered marks i.e. -- all characters from the Unicode table with categories: -- Mark, Nonspacing (Mn) -- Mark, Spacing Combining (Mc) UTF_32_Marks : constant UTF_32_Ranges := ( (16#00300#, 16#00357#), -- COMBINING GRAVE ACCENT .. COMBINING RIGHT HALF RING ABOVE (16#0035D#, 16#0036F#), -- COMBINING DOUBLE BREVE .. COMBINING LATIN SMALL LETTER X (16#00483#, 16#00486#), -- COMBINING CYRILLIC TITLO .. COMBINING CYRILLIC PSILI PNEUMATA (16#00591#, 16#005A1#), -- HEBREW ACCENT ETNAHTA .. HEBREW ACCENT PAZER (16#005A3#, 16#005B9#), -- HEBREW ACCENT MUNAH .. HEBREW POINT HOLAM (16#005BB#, 16#005BD#), -- HEBREW POINT QUBUTS .. HEBREW POINT METEG (16#005BF#, 16#005BF#), -- HEBREW POINT RAFE .. HEBREW POINT RAFE (16#005C1#, 16#005C2#), -- HEBREW POINT SHIN DOT .. HEBREW POINT SIN DOT (16#005C4#, 16#005C4#), -- HEBREW MARK UPPER DOT .. HEBREW MARK UPPER DOT (16#00610#, 16#00615#), -- ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM .. ARABIC SMALL HIGH TAH (16#0064B#, 16#00658#), -- ARABIC FATHATAN .. ARABIC MARK NOON GHUNNA (16#00670#, 16#00670#), -- ARABIC LETTER SUPERSCRIPT ALEF .. ARABIC LETTER SUPERSCRIPT ALEF (16#006D6#, 16#006DC#), -- ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA .. ARABIC SMALL HIGH SEEN (16#006DF#, 16#006E4#), -- ARABIC SMALL HIGH ROUNDED ZERO .. ARABIC SMALL HIGH MADDA (16#006E7#, 16#006E8#), -- ARABIC SMALL HIGH YEH .. ARABIC SMALL HIGH NOON (16#006EA#, 16#006ED#), -- ARABIC EMPTY CENTRE LOW STOP .. ARABIC SMALL LOW MEEM (16#00711#, 16#00711#), -- SYRIAC LETTER SUPERSCRIPT ALAPH .. SYRIAC LETTER SUPERSCRIPT ALAPH (16#00730#, 16#0074A#), -- SYRIAC PTHAHA ABOVE .. SYRIAC BARREKH (16#007A6#, 16#007B0#), -- THAANA ABAFILI .. THAANA SUKUN (16#00901#, 16#00903#), -- DEVANAGARI SIGN CANDRABINDU .. DEVANAGARI SIGN VISARGA (16#0093C#, 16#0093C#), -- DEVANAGARI SIGN NUKTA .. DEVANAGARI SIGN NUKTA (16#0093E#, 16#0094D#), -- DEVANAGARI VOWEL SIGN AA .. DEVANAGARI SIGN VIRAMA (16#00951#, 16#00954#), -- DEVANAGARI STRESS SIGN UDATTA .. DEVANAGARI ACUTE ACCENT (16#00962#, 16#00963#), -- DEVANAGARI VOWEL SIGN VOCALIC L .. DEVANAGARI VOWEL SIGN VOCALIC LL (16#00981#, 16#00983#), -- BENGALI SIGN CANDRABINDU .. BENGALI SIGN VISARGA (16#009BC#, 16#009BC#), -- BENGALI SIGN NUKTA .. BENGALI SIGN NUKTA (16#009BE#, 16#009C4#), -- BENGALI VOWEL SIGN AA .. BENGALI VOWEL SIGN VOCALIC RR (16#009C7#, 16#009C8#), -- BENGALI VOWEL SIGN E .. BENGALI VOWEL SIGN AI (16#009CB#, 16#009CD#), -- BENGALI VOWEL SIGN O .. BENGALI SIGN VIRAMA (16#009D7#, 16#009D7#), -- BENGALI AU LENGTH MARK .. BENGALI AU LENGTH MARK (16#009E2#, 16#009E3#), -- BENGALI VOWEL SIGN VOCALIC L .. BENGALI VOWEL SIGN VOCALIC LL (16#00A01#, 16#00A03#), -- GURMUKHI SIGN ADAK BINDI .. GURMUKHI SIGN VISARGA (16#00A3C#, 16#00A3C#), -- GURMUKHI SIGN NUKTA .. GURMUKHI SIGN NUKTA (16#00A3E#, 16#00A42#), -- GURMUKHI VOWEL SIGN AA .. GURMUKHI VOWEL SIGN UU (16#00A47#, 16#00A48#), -- GURMUKHI VOWEL SIGN EE .. GURMUKHI VOWEL SIGN AI (16#00A4B#, 16#00A4D#), -- GURMUKHI VOWEL SIGN OO .. GURMUKHI SIGN VIRAMA (16#00A70#, 16#00A71#), -- GURMUKHI TIPPI .. GURMUKHI ADDAK (16#00A81#, 16#00A83#), -- GUJARATI SIGN CANDRABINDU .. GUJARATI SIGN VISARGA (16#00ABC#, 16#00ABC#), -- GUJARATI SIGN NUKTA .. GUJARATI SIGN NUKTA (16#00ABE#, 16#00AC5#), -- GUJARATI VOWEL SIGN AA .. GUJARATI VOWEL SIGN CANDRA E (16#00AC7#, 16#00AC9#), -- GUJARATI VOWEL SIGN E .. GUJARATI VOWEL SIGN CANDRA O (16#00ACB#, 16#00ACD#), -- GUJARATI VOWEL SIGN O .. GUJARATI SIGN VIRAMA (16#00AE2#, 16#00AE3#), -- GUJARATI VOWEL SIGN VOCALIC L .. GUJARATI VOWEL SIGN VOCALIC LL (16#00B01#, 16#00B03#), -- ORIYA SIGN CANDRABINDU .. ORIYA SIGN VISARGA (16#00B3C#, 16#00B3C#), -- ORIYA SIGN NUKTA .. ORIYA SIGN NUKTA (16#00B3E#, 16#00B43#), -- ORIYA VOWEL SIGN AA .. ORIYA VOWEL SIGN VOCALIC R (16#00B47#, 16#00B48#), -- ORIYA VOWEL SIGN E .. ORIYA VOWEL SIGN AI (16#00B4B#, 16#00B4D#), -- ORIYA VOWEL SIGN O .. ORIYA SIGN VIRAMA (16#00B56#, 16#00B57#), -- ORIYA AI LENGTH MARK .. ORIYA AU LENGTH MARK (16#00B82#, 16#00B82#), -- TAMIL SIGN ANUSVARA .. TAMIL SIGN ANUSVARA (16#00BBE#, 16#00BC2#), -- TAMIL VOWEL SIGN AA .. TAMIL VOWEL SIGN UU (16#00BC6#, 16#00BC8#), -- TAMIL VOWEL SIGN E .. TAMIL VOWEL SIGN AI (16#00BCA#, 16#00BCD#), -- TAMIL VOWEL SIGN O .. TAMIL SIGN VIRAMA (16#00BD7#, 16#00BD7#), -- TAMIL AU LENGTH MARK .. TAMIL AU LENGTH MARK (16#00C01#, 16#00C03#), -- TELUGU SIGN CANDRABINDU .. TELUGU SIGN VISARGA (16#00C3E#, 16#00C44#), -- TELUGU VOWEL SIGN AA .. TELUGU VOWEL SIGN VOCALIC RR (16#00C46#, 16#00C48#), -- TELUGU VOWEL SIGN E .. TELUGU VOWEL SIGN AI (16#00C4A#, 16#00C4D#), -- TELUGU VOWEL SIGN O .. TELUGU SIGN VIRAMA (16#00C55#, 16#00C56#), -- TELUGU LENGTH MARK .. TELUGU AI LENGTH MARK (16#00C82#, 16#00C83#), -- KANNADA SIGN ANUSVARA .. KANNADA SIGN VISARGA (16#00CBC#, 16#00CBC#), -- KANNADA SIGN NUKTA .. KANNADA SIGN NUKTA (16#00CBE#, 16#00CC4#), -- KANNADA VOWEL SIGN AA .. KANNADA VOWEL SIGN VOCALIC RR (16#00CC6#, 16#00CC8#), -- KANNADA VOWEL SIGN E .. KANNADA VOWEL SIGN AI (16#00CCA#, 16#00CCD#), -- KANNADA VOWEL SIGN O .. KANNADA SIGN VIRAMA (16#00CD5#, 16#00CD6#), -- KANNADA LENGTH MARK .. KANNADA AI LENGTH MARK (16#00D02#, 16#00D03#), -- MALAYALAM SIGN ANUSVARA .. MALAYALAM SIGN VISARGA (16#00D3E#, 16#00D43#), -- MALAYALAM VOWEL SIGN AA .. MALAYALAM VOWEL SIGN VOCALIC R (16#00D46#, 16#00D48#), -- MALAYALAM VOWEL SIGN E .. MALAYALAM VOWEL SIGN AI (16#00D4A#, 16#00D4D#), -- MALAYALAM VOWEL SIGN O .. MALAYALAM SIGN VIRAMA (16#00D57#, 16#00D57#), -- MALAYALAM AU LENGTH MARK .. MALAYALAM AU LENGTH MARK (16#00D82#, 16#00D83#), -- SINHALA SIGN ANUSVARAYA .. SINHALA SIGN VISARGAYA (16#00DCA#, 16#00DCA#), -- SINHALA SIGN AL-LAKUNA .. SINHALA SIGN AL-LAKUNA (16#00DCF#, 16#00DD4#), -- SINHALA VOWEL SIGN AELA-PILLA .. SINHALA VOWEL SIGN KETTI PAA-PILLA (16#00DD6#, 16#00DD6#), -- SINHALA VOWEL SIGN DIGA PAA-PILLA .. SINHALA VOWEL SIGN DIGA PAA-PILLA (16#00DD8#, 16#00DDF#), -- SINHALA VOWEL SIGN GAETTA-PILLA .. SINHALA VOWEL SIGN GAYANUKITTA (16#00DF2#, 16#00DF3#), -- SINHALA VOWEL SIGN DIGA GAETTA-PILLA .. SINHALA VOWEL SIGN DIGA GAYANUKITTA (16#00E31#, 16#00E31#), -- THAI CHARACTER MAI HAN-AKAT .. THAI CHARACTER MAI HAN-AKAT (16#00E34#, 16#00E3A#), -- THAI CHARACTER SARA I .. THAI CHARACTER PHINTHU (16#00E47#, 16#00E4E#), -- THAI CHARACTER MAITAIKHU .. THAI CHARACTER YAMAKKAN (16#00EB1#, 16#00EB1#), -- LAO VOWEL SIGN MAI KAN .. LAO VOWEL SIGN MAI KAN (16#00EB4#, 16#00EB9#), -- LAO VOWEL SIGN I .. LAO VOWEL SIGN UU (16#00EBB#, 16#00EBC#), -- LAO VOWEL SIGN MAI KON .. LAO SEMIVOWEL SIGN LO (16#00EC8#, 16#00ECD#), -- LAO TONE MAI EK .. LAO NIGGAHITA (16#00F18#, 16#00F19#), -- TIBETAN ASTROLOGICAL SIGN -KHYUD PA .. TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS (16#00F35#, 16#00F35#), -- TIBETAN MARK NGAS BZUNG NYI ZLA .. TIBETAN MARK NGAS BZUNG NYI ZLA (16#00F37#, 16#00F37#), -- TIBETAN MARK NGAS BZUNG SGOR RTAGS .. TIBETAN MARK NGAS BZUNG SGOR RTAGS (16#00F39#, 16#00F39#), -- TIBETAN MARK TSA -PHRU .. TIBETAN MARK TSA -PHRU (16#00F3E#, 16#00F3F#), -- TIBETAN SIGN YAR TSHES .. TIBETAN SIGN MAR TSHES (16#00F71#, 16#00F84#), -- TIBETAN VOWEL SIGN AA .. TIBETAN MARK HALANTA (16#00F86#, 16#00F87#), -- TIBETAN SIGN LCI RTAGS .. TIBETAN SIGN YANG RTAGS (16#00F90#, 16#00F97#), -- TIBETAN SUBJOINED LETTER KA .. TIBETAN SUBJOINED LETTER JA (16#00F99#, 16#00FBC#), -- TIBETAN SUBJOINED LETTER NYA .. TIBETAN SUBJOINED LETTER FIXED-FORM RA (16#00FC6#, 16#00FC6#), -- TIBETAN SYMBOL PADMA GDAN .. TIBETAN SYMBOL PADMA GDAN (16#0102C#, 16#01032#), -- MYANMAR VOWEL SIGN AA .. MYANMAR VOWEL SIGN AI (16#01036#, 16#01039#), -- MYANMAR SIGN ANUSVARA .. MYANMAR SIGN VIRAMA (16#01056#, 16#01059#), -- MYANMAR VOWEL SIGN VOCALIC R .. MYANMAR VOWEL SIGN VOCALIC LL (16#01712#, 16#01714#), -- TAGALOG VOWEL SIGN I .. TAGALOG SIGN VIRAMA (16#01732#, 16#01734#), -- HANUNOO VOWEL SIGN I .. HANUNOO SIGN PAMUDPOD (16#01752#, 16#01753#), -- BUHID VOWEL SIGN I .. BUHID VOWEL SIGN U (16#01772#, 16#01773#), -- TAGBANWA VOWEL SIGN I .. TAGBANWA VOWEL SIGN U (16#017B6#, 16#017D3#), -- KHMER VOWEL SIGN AA .. KHMER SIGN BATHAMASAT (16#017DD#, 16#017DD#), -- KHMER SIGN ATTHACAN .. KHMER SIGN ATTHACAN (16#0180B#, 16#0180D#), -- MONGOLIAN FREE VARIATION SELECTOR ONE .. MONGOLIAN FREE VARIATION SELECTOR THREE (16#018A9#, 16#018A9#), -- MONGOLIAN LETTER ALI GALI DAGALGA .. MONGOLIAN LETTER ALI GALI DAGALGA (16#01920#, 16#0192B#), -- LIMBU VOWEL SIGN A .. LIMBU SUBJOINED LETTER WA (16#01930#, 16#0193B#), -- LIMBU SMALL LETTER KA .. LIMBU SIGN SA-I (16#020D0#, 16#020DC#), -- COMBINING LEFT HARPOON ABOVE .. COMBINING FOUR DOTS ABOVE (16#020E1#, 16#020E1#), -- COMBINING LEFT RIGHT ARROW ABOVE .. COMBINING LEFT RIGHT ARROW ABOVE (16#020E5#, 16#020EA#), -- COMBINING REVERSE SOLIDUS OVERLAY .. COMBINING LEFTWARDS ARROW OVERLAY (16#0302A#, 16#0302F#), -- IDEOGRAPHIC LEVEL TONE MARK .. HANGUL DOUBLE DOT TONE MARK (16#03099#, 16#0309A#), -- COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK .. COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK (16#0FB1E#, 16#0FB1E#), -- HEBREW POINT JUDEO-SPANISH VARIKA .. HEBREW POINT JUDEO-SPANISH VARIKA (16#0FE00#, 16#0FE0F#), -- VARIATION SELECTOR-1 .. VARIATION SELECTOR-16 (16#0FE20#, 16#0FE23#), -- COMBINING LIGATURE LEFT HALF .. COMBINING DOUBLE TILDE RIGHT HALF (16#1D165#, 16#1D169#), -- MUSICAL SYMBOL COMBINING STEM .. MUSICAL SYMBOL COMBINING TREMOLO-3 (16#1D16D#, 16#1D172#), -- MUSICAL SYMBOL COMBINING AUGMENTATION DOT .. MUSICAL SYMBOL COMBINING FLAG-5 (16#1D17B#, 16#1D182#), -- MUSICAL SYMBOL COMBINING ACCENT .. MUSICAL SYMBOL COMBINING LOURE (16#1D185#, 16#1D18B#), -- MUSICAL SYMBOL COMBINING DOIT .. MUSICAL SYMBOL COMBINING TRIPLE TONGUE (16#1D1AA#, 16#1D1AD#), -- MUSICAL SYMBOL COMBINING DOWN BOW .. MUSICAL SYMBOL COMBINING SNAP PIZZICATO (16#E0100#, 16#E01EF#)); -- VARIATION SELECTOR-17 .. VARIATION SELECTOR-256 -- The following table includes all characters considered non-graphic, -- i.e. all characters from the Unicode table with categories: -- Other, Control (Cc) -- Other, Private Use (Co) -- Other, Surrogate (Cs) -- Other, Format (Cf) -- Separator, Line (Zl) -- Separator, Paragraph (Zp) -- In addition, the characters FFFE and FFFF are excluded. Note that the -- defined Ada category of format effector is subsumed by the above set -- of Unicode categories. UTF_32_Non_Graphic : constant UTF_32_Ranges := ( (16#00000#, 16#0001F#), -- .. (16#0007F#, 16#0009F#), -- .. (16#000AD#, 16#000AD#), -- SOFT HYPHEN .. SOFT HYPHEN (16#00600#, 16#00603#), -- ARABIC NUMBER SIGN .. ARABIC SIGN SAFHA (16#006DD#, 16#006DD#), -- ARABIC END OF AYAH .. ARABIC END OF AYAH (16#0070F#, 16#0070F#), -- SYRIAC ABBREVIATION MARK .. SYRIAC ABBREVIATION MARK (16#017B4#, 16#017B5#), -- KHMER VOWEL INHERENT AQ .. KHMER VOWEL INHERENT AA (16#0200C#, 16#0200F#), -- ZERO WIDTH NON-JOINER .. RIGHT-TO-LEFT MARK (16#02028#, 16#0202E#), -- LINE SEPARATOR .. RIGHT-TO-LEFT OVERRIDE (16#02060#, 16#02063#), -- WORD JOINER .. INVISIBLE SEPARATOR (16#0206A#, 16#0206F#), -- INHIBIT SYMMETRIC SWAPPING .. NOMINAL DIGIT SHAPES (16#0D800#, 16#0D800#), -- .. (16#0DB7F#, 16#0DB80#), -- .. (16#0DBFF#, 16#0DC00#), -- .. (16#0DFFF#, 16#0E000#), -- .. (16#0F8FF#, 16#0F8FF#), -- .. (16#0FEFF#, 16#0FEFF#), -- ZERO WIDTH NO-BREAK SPACE .. ZERO WIDTH NO-BREAK SPACE (16#0FFF9#, 16#0FFFB#), -- INTERLINEAR ANNOTATION ANCHOR .. INTERLINEAR ANNOTATION TERMINATOR (16#0FFFE#, 16#0FFFF#), -- excluded code positions (16#1D173#, 16#1D17A#), -- MUSICAL SYMBOL BEGIN BEAM .. MUSICAL SYMBOL END PHRASE (16#E0001#, 16#E0001#), -- LANGUAGE TAG .. LANGUAGE TAG (16#E0020#, 16#E007F#), -- TAG SPACE .. CANCEL TAG (16#F0000#, 16#FFFFD#), -- .. (16#100000#, 16#10FFFD#)); -- .. -- The following two tables define the mapping to upper case. The first -- table gives the ranges of lower case letters. The corresponding entry -- in Uppercase_Adjust shows the amount to be added (or subtracted) from -- the code value to get the corresponding upper case letter. -- Note that this folding is not reversible, for example lower case -- dotless i folds to normal upper case I, and that cannot be reversed. Lower_Case_Letters : constant UTF_32_Ranges := ( (16#00061#, 16#0007A#), -- LATIN SMALL LETTER A .. LATIN SMALL LETTER Z (16#000B5#, 16#000B5#), -- MICRO SIGN .. MICRO SIGN (16#000E0#, 16#000F6#), -- LATIN SMALL LETTER A WITH GRAVE .. LATIN SMALL LETTER O WITH DIAERESIS (16#000F8#, 16#000FE#), -- LATIN SMALL LETTER O WITH STROKE .. LATIN SMALL LETTER THORN (16#000FF#, 16#000FF#), -- LATIN SMALL LETTER Y WITH DIAERESIS .. LATIN SMALL LETTER Y WITH DIAERESIS (16#00101#, 16#00101#), -- LATIN SMALL LETTER A WITH MACRON .. LATIN SMALL LETTER A WITH MACRON (16#00103#, 16#00103#), -- LATIN SMALL LETTER A WITH BREVE .. LATIN SMALL LETTER A WITH BREVE (16#00105#, 16#00105#), -- LATIN SMALL LETTER A WITH OGONEK .. LATIN SMALL LETTER A WITH OGONEK (16#00107#, 16#00107#), -- LATIN SMALL LETTER C WITH ACUTE .. LATIN SMALL LETTER C WITH ACUTE (16#00109#, 16#00109#), -- LATIN SMALL LETTER C WITH CIRCUMFLEX .. LATIN SMALL LETTER C WITH CIRCUMFLEX (16#0010B#, 16#0010B#), -- LATIN SMALL LETTER C WITH DOT ABOVE .. LATIN SMALL LETTER C WITH DOT ABOVE (16#0010D#, 16#0010D#), -- LATIN SMALL LETTER C WITH CARON .. LATIN SMALL LETTER C WITH CARON (16#0010F#, 16#0010F#), -- LATIN SMALL LETTER D WITH CARON .. LATIN SMALL LETTER D WITH CARON (16#00111#, 16#00111#), -- LATIN SMALL LETTER D WITH STROKE .. LATIN SMALL LETTER D WITH STROKE (16#00113#, 16#00113#), -- LATIN SMALL LETTER E WITH MACRON .. LATIN SMALL LETTER E WITH MACRON (16#00115#, 16#00115#), -- LATIN SMALL LETTER E WITH BREVE .. LATIN SMALL LETTER E WITH BREVE (16#00117#, 16#00117#), -- LATIN SMALL LETTER E WITH DOT ABOVE .. LATIN SMALL LETTER E WITH DOT ABOVE (16#00119#, 16#00119#), -- LATIN SMALL LETTER E WITH OGONEK .. LATIN SMALL LETTER E WITH OGONEK (16#0011B#, 16#0011B#), -- LATIN SMALL LETTER E WITH CARON .. LATIN SMALL LETTER E WITH CARON (16#0011D#, 16#0011D#), -- LATIN SMALL LETTER G WITH CIRCUMFLEX .. LATIN SMALL LETTER G WITH CIRCUMFLEX (16#0011F#, 16#0011F#), -- LATIN SMALL LETTER G WITH BREVE .. LATIN SMALL LETTER G WITH BREVE (16#00121#, 16#00121#), -- LATIN SMALL LETTER G WITH DOT ABOVE .. LATIN SMALL LETTER G WITH DOT ABOVE (16#00123#, 16#00123#), -- LATIN SMALL LETTER G WITH CEDILLA .. LATIN SMALL LETTER G WITH CEDILLA (16#00125#, 16#00125#), -- LATIN SMALL LETTER H WITH CIRCUMFLEX .. LATIN SMALL LETTER H WITH CIRCUMFLEX (16#00127#, 16#00127#), -- LATIN SMALL LETTER H WITH STROKE .. LATIN SMALL LETTER H WITH STROKE (16#00129#, 16#00129#), -- LATIN SMALL LETTER I WITH TILDE .. LATIN SMALL LETTER I WITH TILDE (16#0012B#, 16#0012B#), -- LATIN SMALL LETTER I WITH MACRON .. LATIN SMALL LETTER I WITH MACRON (16#0012D#, 16#0012D#), -- LATIN SMALL LETTER I WITH BREVE .. LATIN SMALL LETTER I WITH BREVE (16#0012F#, 16#0012F#), -- LATIN SMALL LETTER I WITH OGONEK .. LATIN SMALL LETTER I WITH OGONEK (16#00131#, 16#00131#), -- LATIN SMALL LETTER DOTLESS I .. LATIN SMALL LETTER DOTLESS I (16#00133#, 16#00133#), -- LATIN SMALL LIGATURE IJ .. LATIN SMALL LIGATURE IJ (16#00135#, 16#00135#), -- LATIN SMALL LETTER J WITH CIRCUMFLEX .. LATIN SMALL LETTER J WITH CIRCUMFLEX (16#00137#, 16#00137#), -- LATIN SMALL LETTER K WITH CEDILLA .. LATIN SMALL LETTER K WITH CEDILLA (16#0013A#, 16#0013A#), -- LATIN SMALL LETTER L WITH ACUTE .. LATIN SMALL LETTER L WITH ACUTE (16#0013C#, 16#0013C#), -- LATIN SMALL LETTER L WITH CEDILLA .. LATIN SMALL LETTER L WITH CEDILLA (16#0013E#, 16#0013E#), -- LATIN SMALL LETTER L WITH CARON .. LATIN SMALL LETTER L WITH CARON (16#00140#, 16#00140#), -- LATIN SMALL LETTER L WITH MIDDLE DOT .. LATIN SMALL LETTER L WITH MIDDLE DOT (16#00142#, 16#00142#), -- LATIN SMALL LETTER L WITH STROKE .. LATIN SMALL LETTER L WITH STROKE (16#00144#, 16#00144#), -- LATIN SMALL LETTER N WITH ACUTE .. LATIN SMALL LETTER N WITH ACUTE (16#00146#, 16#00146#), -- LATIN SMALL LETTER N WITH CEDILLA .. LATIN SMALL LETTER N WITH CEDILLA (16#00148#, 16#00148#), -- LATIN SMALL LETTER N WITH CARON .. LATIN SMALL LETTER N WITH CARON (16#0014B#, 16#0014B#), -- LATIN SMALL LETTER ENG .. LATIN SMALL LETTER ENG (16#0014D#, 16#0014D#), -- LATIN SMALL LETTER O WITH MACRON .. LATIN SMALL LETTER O WITH MACRON (16#0014F#, 16#0014F#), -- LATIN SMALL LETTER O WITH BREVE .. LATIN SMALL LETTER O WITH BREVE (16#00151#, 16#00151#), -- LATIN SMALL LETTER O WITH DOUBLE ACUTE .. LATIN SMALL LETTER O WITH DOUBLE ACUTE (16#00153#, 16#00153#), -- LATIN SMALL LIGATURE OE .. LATIN SMALL LIGATURE OE (16#00155#, 16#00155#), -- LATIN SMALL LETTER R WITH ACUTE .. LATIN SMALL LETTER R WITH ACUTE (16#00157#, 16#00157#), -- LATIN SMALL LETTER R WITH CEDILLA .. LATIN SMALL LETTER R WITH CEDILLA (16#00159#, 16#00159#), -- LATIN SMALL LETTER R WITH CARON .. LATIN SMALL LETTER R WITH CARON (16#0015B#, 16#0015B#), -- LATIN SMALL LETTER S WITH ACUTE .. LATIN SMALL LETTER S WITH ACUTE (16#0015D#, 16#0015D#), -- LATIN SMALL LETTER S WITH CIRCUMFLEX .. LATIN SMALL LETTER S WITH CIRCUMFLEX (16#0015F#, 16#0015F#), -- LATIN SMALL LETTER S WITH CEDILLA .. LATIN SMALL LETTER S WITH CEDILLA (16#00161#, 16#00161#), -- LATIN SMALL LETTER S WITH CARON .. LATIN SMALL LETTER S WITH CARON (16#00163#, 16#00163#), -- LATIN SMALL LETTER T WITH CEDILLA .. LATIN SMALL LETTER T WITH CEDILLA (16#00165#, 16#00165#), -- LATIN SMALL LETTER T WITH CARON .. LATIN SMALL LETTER T WITH CARON (16#00167#, 16#00167#), -- LATIN SMALL LETTER T WITH STROKE .. LATIN SMALL LETTER T WITH STROKE (16#00169#, 16#00169#), -- LATIN SMALL LETTER U WITH TILDE .. LATIN SMALL LETTER U WITH TILDE (16#0016B#, 16#0016B#), -- LATIN SMALL LETTER U WITH MACRON .. LATIN SMALL LETTER U WITH MACRON (16#0016D#, 16#0016D#), -- LATIN SMALL LETTER U WITH BREVE .. LATIN SMALL LETTER U WITH BREVE (16#0016F#, 16#0016F#), -- LATIN SMALL LETTER U WITH RING ABOVE .. LATIN SMALL LETTER U WITH RING ABOVE (16#00171#, 16#00171#), -- LATIN SMALL LETTER U WITH DOUBLE ACUTE .. LATIN SMALL LETTER U WITH DOUBLE ACUTE (16#00173#, 16#00173#), -- LATIN SMALL LETTER U WITH OGONEK .. LATIN SMALL LETTER U WITH OGONEK (16#00175#, 16#00175#), -- LATIN SMALL LETTER W WITH CIRCUMFLEX .. LATIN SMALL LETTER W WITH CIRCUMFLEX (16#00177#, 16#00177#), -- LATIN SMALL LETTER Y WITH CIRCUMFLEX .. LATIN SMALL LETTER Y WITH CIRCUMFLEX (16#0017A#, 16#0017A#), -- LATIN SMALL LETTER Z WITH ACUTE .. LATIN SMALL LETTER Z WITH ACUTE (16#0017C#, 16#0017C#), -- LATIN SMALL LETTER Z WITH DOT ABOVE .. LATIN SMALL LETTER Z WITH DOT ABOVE (16#0017E#, 16#0017E#), -- LATIN SMALL LETTER Z WITH CARON .. LATIN SMALL LETTER Z WITH CARON (16#0017F#, 16#0017F#), -- LATIN SMALL LETTER LONG S .. LATIN SMALL LETTER LONG S (16#00183#, 16#00183#), -- LATIN SMALL LETTER B WITH TOPBAR .. LATIN SMALL LETTER B WITH TOPBAR (16#00185#, 16#00185#), -- LATIN SMALL LETTER TONE SIX .. LATIN SMALL LETTER TONE SIX (16#00188#, 16#00188#), -- LATIN SMALL LETTER C WITH HOOK .. LATIN SMALL LETTER C WITH HOOK (16#0018C#, 16#0018C#), -- LATIN SMALL LETTER D WITH TOPBAR .. LATIN SMALL LETTER D WITH TOPBAR (16#00192#, 16#00192#), -- LATIN SMALL LETTER F WITH HOOK .. LATIN SMALL LETTER F WITH HOOK (16#00195#, 16#00195#), -- LATIN SMALL LETTER HV .. LATIN SMALL LETTER HV (16#00199#, 16#00199#), -- LATIN SMALL LETTER K WITH HOOK .. LATIN SMALL LETTER K WITH HOOK (16#0019E#, 16#0019E#), -- LATIN SMALL LETTER N WITH LONG RIGHT LEG .. LATIN SMALL LETTER N WITH LONG RIGHT LEG (16#001A1#, 16#001A1#), -- LATIN SMALL LETTER O WITH HORN .. LATIN SMALL LETTER O WITH HORN (16#001A3#, 16#001A3#), -- LATIN SMALL LETTER OI .. LATIN SMALL LETTER OI (16#001A5#, 16#001A5#), -- LATIN SMALL LETTER P WITH HOOK .. LATIN SMALL LETTER P WITH HOOK (16#001A8#, 16#001A8#), -- LATIN SMALL LETTER TONE TWO .. LATIN SMALL LETTER TONE TWO (16#001AD#, 16#001AD#), -- LATIN SMALL LETTER T WITH HOOK .. LATIN SMALL LETTER T WITH HOOK (16#001B0#, 16#001B0#), -- LATIN SMALL LETTER U WITH HORN .. LATIN SMALL LETTER U WITH HORN (16#001B4#, 16#001B4#), -- LATIN SMALL LETTER Y WITH HOOK .. LATIN SMALL LETTER Y WITH HOOK (16#001B6#, 16#001B6#), -- LATIN SMALL LETTER Z WITH STROKE .. LATIN SMALL LETTER Z WITH STROKE (16#001B9#, 16#001B9#), -- LATIN SMALL LETTER EZH REVERSED .. LATIN SMALL LETTER EZH REVERSED (16#001BD#, 16#001BD#), -- LATIN SMALL LETTER TONE FIVE .. LATIN SMALL LETTER TONE FIVE (16#001BF#, 16#001BF#), -- LATIN LETTER WYNN .. LATIN LETTER WYNN (16#001C5#, 16#001C5#), -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON (16#001C6#, 16#001C6#), -- LATIN SMALL LETTER DZ WITH CARON .. LATIN SMALL LETTER DZ WITH CARON (16#001C8#, 16#001C8#), -- LATIN CAPITAL LETTER L WITH SMALL LETTER J .. LATIN CAPITAL LETTER L WITH SMALL LETTER J (16#001C9#, 16#001C9#), -- LATIN SMALL LETTER LJ .. LATIN SMALL LETTER LJ (16#001CB#, 16#001CB#), -- LATIN CAPITAL LETTER N WITH SMALL LETTER J .. LATIN CAPITAL LETTER N WITH SMALL LETTER J (16#001CC#, 16#001CC#), -- LATIN SMALL LETTER NJ .. LATIN SMALL LETTER NJ (16#001CE#, 16#001CE#), -- LATIN SMALL LETTER A WITH CARON .. LATIN SMALL LETTER A WITH CARON (16#001D0#, 16#001D0#), -- LATIN SMALL LETTER I WITH CARON .. LATIN SMALL LETTER I WITH CARON (16#001D2#, 16#001D2#), -- LATIN SMALL LETTER O WITH CARON .. LATIN SMALL LETTER O WITH CARON (16#001D4#, 16#001D4#), -- LATIN SMALL LETTER U WITH CARON .. LATIN SMALL LETTER U WITH CARON (16#001D6#, 16#001D6#), -- LATIN SMALL LETTER U WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER U WITH DIAERESIS AND MACRON (16#001D8#, 16#001D8#), -- LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE (16#001DA#, 16#001DA#), -- LATIN SMALL LETTER U WITH DIAERESIS AND CARON .. LATIN SMALL LETTER U WITH DIAERESIS AND CARON (16#001DC#, 16#001DC#), -- LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE .. LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE (16#001DD#, 16#001DD#), -- LATIN SMALL LETTER TURNED E .. LATIN SMALL LETTER TURNED E (16#001DF#, 16#001DF#), -- LATIN SMALL LETTER A WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER A WITH DIAERESIS AND MACRON (16#001E1#, 16#001E1#), -- LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON (16#001E3#, 16#001E3#), -- LATIN SMALL LETTER AE WITH MACRON .. LATIN SMALL LETTER AE WITH MACRON (16#001E5#, 16#001E5#), -- LATIN SMALL LETTER G WITH STROKE .. LATIN SMALL LETTER G WITH STROKE (16#001E7#, 16#001E7#), -- LATIN SMALL LETTER G WITH CARON .. LATIN SMALL LETTER G WITH CARON (16#001E9#, 16#001E9#), -- LATIN SMALL LETTER K WITH CARON .. LATIN SMALL LETTER K WITH CARON (16#001EB#, 16#001EB#), -- LATIN SMALL LETTER O WITH OGONEK .. LATIN SMALL LETTER O WITH OGONEK (16#001ED#, 16#001ED#), -- LATIN SMALL LETTER O WITH OGONEK AND MACRON .. LATIN SMALL LETTER O WITH OGONEK AND MACRON (16#001EF#, 16#001EF#), -- LATIN SMALL LETTER EZH WITH CARON .. LATIN SMALL LETTER EZH WITH CARON (16#001F2#, 16#001F2#), -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z (16#001F3#, 16#001F3#), -- LATIN SMALL LETTER DZ .. LATIN SMALL LETTER DZ (16#001F5#, 16#001F5#), -- LATIN SMALL LETTER G WITH ACUTE .. LATIN SMALL LETTER G WITH ACUTE (16#001F9#, 16#001F9#), -- LATIN SMALL LETTER N WITH GRAVE .. LATIN SMALL LETTER N WITH GRAVE (16#001FB#, 16#001FB#), -- LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE .. LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE (16#001FD#, 16#001FD#), -- LATIN SMALL LETTER AE WITH ACUTE .. LATIN SMALL LETTER AE WITH ACUTE (16#001FF#, 16#001FF#), -- LATIN SMALL LETTER O WITH STROKE AND ACUTE .. LATIN SMALL LETTER O WITH STROKE AND ACUTE (16#00201#, 16#00201#), -- LATIN SMALL LETTER A WITH DOUBLE GRAVE .. LATIN SMALL LETTER A WITH DOUBLE GRAVE (16#00203#, 16#00203#), -- LATIN SMALL LETTER A WITH INVERTED BREVE .. LATIN SMALL LETTER A WITH INVERTED BREVE (16#00205#, 16#00205#), -- LATIN SMALL LETTER E WITH DOUBLE GRAVE .. LATIN SMALL LETTER E WITH DOUBLE GRAVE (16#00207#, 16#00207#), -- LATIN SMALL LETTER E WITH INVERTED BREVE .. LATIN SMALL LETTER E WITH INVERTED BREVE (16#00209#, 16#00209#), -- LATIN SMALL LETTER I WITH DOUBLE GRAVE .. LATIN SMALL LETTER I WITH DOUBLE GRAVE (16#0020B#, 16#0020B#), -- LATIN SMALL LETTER I WITH INVERTED BREVE .. LATIN SMALL LETTER I WITH INVERTED BREVE (16#0020D#, 16#0020D#), -- LATIN SMALL LETTER O WITH DOUBLE GRAVE .. LATIN SMALL LETTER O WITH DOUBLE GRAVE (16#0020F#, 16#0020F#), -- LATIN SMALL LETTER O WITH INVERTED BREVE .. LATIN SMALL LETTER O WITH INVERTED BREVE (16#00211#, 16#00211#), -- LATIN SMALL LETTER R WITH DOUBLE GRAVE .. LATIN SMALL LETTER R WITH DOUBLE GRAVE (16#00213#, 16#00213#), -- LATIN SMALL LETTER R WITH INVERTED BREVE .. LATIN SMALL LETTER R WITH INVERTED BREVE (16#00215#, 16#00215#), -- LATIN SMALL LETTER U WITH DOUBLE GRAVE .. LATIN SMALL LETTER U WITH DOUBLE GRAVE (16#00217#, 16#00217#), -- LATIN SMALL LETTER U WITH INVERTED BREVE .. LATIN SMALL LETTER U WITH INVERTED BREVE (16#00219#, 16#00219#), -- LATIN SMALL LETTER S WITH COMMA BELOW .. LATIN SMALL LETTER S WITH COMMA BELOW (16#0021B#, 16#0021B#), -- LATIN SMALL LETTER T WITH COMMA BELOW .. LATIN SMALL LETTER T WITH COMMA BELOW (16#0021D#, 16#0021D#), -- LATIN SMALL LETTER YOGH .. LATIN SMALL LETTER YOGH (16#0021F#, 16#0021F#), -- LATIN SMALL LETTER H WITH CARON .. LATIN SMALL LETTER H WITH CARON (16#00223#, 16#00223#), -- LATIN SMALL LETTER OU .. LATIN SMALL LETTER OU (16#00225#, 16#00225#), -- LATIN SMALL LETTER Z WITH HOOK .. LATIN SMALL LETTER Z WITH HOOK (16#00227#, 16#00227#), -- LATIN SMALL LETTER A WITH DOT ABOVE .. LATIN SMALL LETTER A WITH DOT ABOVE (16#00229#, 16#00229#), -- LATIN SMALL LETTER E WITH CEDILLA .. LATIN SMALL LETTER E WITH CEDILLA (16#0022B#, 16#0022B#), -- LATIN SMALL LETTER O WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER O WITH DIAERESIS AND MACRON (16#0022D#, 16#0022D#), -- LATIN SMALL LETTER O WITH TILDE AND MACRON .. LATIN SMALL LETTER O WITH TILDE AND MACRON (16#0022F#, 16#0022F#), -- LATIN SMALL LETTER O WITH DOT ABOVE .. LATIN SMALL LETTER O WITH DOT ABOVE (16#00231#, 16#00231#), -- LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON (16#00233#, 16#00233#), -- LATIN SMALL LETTER Y WITH MACRON .. LATIN SMALL LETTER Y WITH MACRON (16#00253#, 16#00253#), -- LATIN SMALL LETTER B WITH HOOK .. LATIN SMALL LETTER B WITH HOOK (16#00254#, 16#00254#), -- LATIN SMALL LETTER OPEN O .. LATIN SMALL LETTER OPEN O (16#00256#, 16#00257#), -- LATIN SMALL LETTER D WITH TAIL .. LATIN SMALL LETTER D WITH HOOK (16#00259#, 16#00259#), -- LATIN SMALL LETTER SCHWA .. LATIN SMALL LETTER SCHWA (16#0025B#, 16#0025B#), -- LATIN SMALL LETTER OPEN E .. LATIN SMALL LETTER OPEN E (16#00260#, 16#00260#), -- LATIN SMALL LETTER G WITH HOOK .. LATIN SMALL LETTER G WITH HOOK (16#00263#, 16#00263#), -- LATIN SMALL LETTER GAMMA .. LATIN SMALL LETTER GAMMA (16#00268#, 16#00268#), -- LATIN SMALL LETTER I WITH STROKE .. LATIN SMALL LETTER I WITH STROKE (16#00269#, 16#00269#), -- LATIN SMALL LETTER IOTA .. LATIN SMALL LETTER IOTA (16#0026F#, 16#0026F#), -- LATIN SMALL LETTER TURNED M .. LATIN SMALL LETTER TURNED M (16#00272#, 16#00272#), -- LATIN SMALL LETTER N WITH LEFT HOOK .. LATIN SMALL LETTER N WITH LEFT HOOK (16#00275#, 16#00275#), -- LATIN SMALL LETTER BARRED O .. LATIN SMALL LETTER BARRED O (16#00280#, 16#00280#), -- LATIN LETTER SMALL CAPITAL R .. LATIN LETTER SMALL CAPITAL R (16#00283#, 16#00283#), -- LATIN SMALL LETTER ESH .. LATIN SMALL LETTER ESH (16#00288#, 16#00288#), -- LATIN SMALL LETTER T WITH RETROFLEX HOOK .. LATIN SMALL LETTER T WITH RETROFLEX HOOK (16#0028A#, 16#0028B#), -- LATIN SMALL LETTER UPSILON .. LATIN SMALL LETTER V WITH HOOK (16#00292#, 16#00292#), -- LATIN SMALL LETTER EZH .. LATIN SMALL LETTER EZH (16#003AC#, 16#003AC#), -- GREEK SMALL LETTER ALPHA WITH TONOS .. GREEK SMALL LETTER ALPHA WITH TONOS (16#003AD#, 16#003AF#), -- GREEK SMALL LETTER EPSILON WITH TONOS .. GREEK SMALL LETTER IOTA WITH TONOS (16#003B1#, 16#003C1#), -- GREEK SMALL LETTER ALPHA .. GREEK SMALL LETTER RHO (16#003C2#, 16#003C2#), -- GREEK SMALL LETTER FINAL SIGMA .. GREEK SMALL LETTER FINAL SIGMA (16#003C3#, 16#003CB#), -- GREEK SMALL LETTER SIGMA .. GREEK SMALL LETTER UPSILON WITH DIALYTIKA (16#003CC#, 16#003CC#), -- GREEK SMALL LETTER OMICRON WITH TONOS .. GREEK SMALL LETTER OMICRON WITH TONOS (16#003CD#, 16#003CE#), -- GREEK SMALL LETTER UPSILON WITH TONOS .. GREEK SMALL LETTER OMEGA WITH TONOS (16#003D0#, 16#003D0#), -- GREEK BETA SYMBOL .. GREEK BETA SYMBOL (16#003D1#, 16#003D1#), -- GREEK THETA SYMBOL .. GREEK THETA SYMBOL (16#003D5#, 16#003D5#), -- GREEK PHI SYMBOL .. GREEK PHI SYMBOL (16#003D6#, 16#003D6#), -- GREEK PI SYMBOL .. GREEK PI SYMBOL (16#003D9#, 16#003D9#), -- GREEK SMALL LETTER ARCHAIC KOPPA .. GREEK SMALL LETTER ARCHAIC KOPPA (16#003DB#, 16#003DB#), -- GREEK SMALL LETTER STIGMA .. GREEK SMALL LETTER STIGMA (16#003DD#, 16#003DD#), -- GREEK SMALL LETTER DIGAMMA .. GREEK SMALL LETTER DIGAMMA (16#003DF#, 16#003DF#), -- GREEK SMALL LETTER KOPPA .. GREEK SMALL LETTER KOPPA (16#003E1#, 16#003E1#), -- GREEK SMALL LETTER SAMPI .. GREEK SMALL LETTER SAMPI (16#003E3#, 16#003E3#), -- COPTIC SMALL LETTER SHEI .. COPTIC SMALL LETTER SHEI (16#003E5#, 16#003E5#), -- COPTIC SMALL LETTER FEI .. COPTIC SMALL LETTER FEI (16#003E7#, 16#003E7#), -- COPTIC SMALL LETTER KHEI .. COPTIC SMALL LETTER KHEI (16#003E9#, 16#003E9#), -- COPTIC SMALL LETTER HORI .. COPTIC SMALL LETTER HORI (16#003EB#, 16#003EB#), -- COPTIC SMALL LETTER GANGIA .. COPTIC SMALL LETTER GANGIA (16#003ED#, 16#003ED#), -- COPTIC SMALL LETTER SHIMA .. COPTIC SMALL LETTER SHIMA (16#003EF#, 16#003EF#), -- COPTIC SMALL LETTER DEI .. COPTIC SMALL LETTER DEI (16#003F0#, 16#003F0#), -- GREEK KAPPA SYMBOL .. GREEK KAPPA SYMBOL (16#003F1#, 16#003F1#), -- GREEK RHO SYMBOL .. GREEK RHO SYMBOL (16#003F2#, 16#003F2#), -- GREEK LUNATE SIGMA SYMBOL .. GREEK LUNATE SIGMA SYMBOL (16#003F5#, 16#003F5#), -- GREEK LUNATE EPSILON SYMBOL .. GREEK LUNATE EPSILON SYMBOL (16#00430#, 16#0044F#), -- CYRILLIC SMALL LETTER A .. CYRILLIC SMALL LETTER YA (16#00450#, 16#0045F#), -- CYRILLIC SMALL LETTER IE WITH GRAVE .. CYRILLIC SMALL LETTER DZHE (16#00461#, 16#00461#), -- CYRILLIC SMALL LETTER OMEGA .. CYRILLIC SMALL LETTER OMEGA (16#00463#, 16#00463#), -- CYRILLIC SMALL LETTER YAT .. CYRILLIC SMALL LETTER YAT (16#00465#, 16#00465#), -- CYRILLIC SMALL LETTER IOTIFIED E .. CYRILLIC SMALL LETTER IOTIFIED E (16#00467#, 16#00467#), -- CYRILLIC SMALL LETTER LITTLE YUS .. CYRILLIC SMALL LETTER LITTLE YUS (16#00469#, 16#00469#), -- CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS .. CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS (16#0046B#, 16#0046B#), -- CYRILLIC SMALL LETTER BIG YUS .. CYRILLIC SMALL LETTER BIG YUS (16#0046D#, 16#0046D#), -- CYRILLIC SMALL LETTER IOTIFIED BIG YUS .. CYRILLIC SMALL LETTER IOTIFIED BIG YUS (16#0046F#, 16#0046F#), -- CYRILLIC SMALL LETTER KSI .. CYRILLIC SMALL LETTER KSI (16#00471#, 16#00471#), -- CYRILLIC SMALL LETTER PSI .. CYRILLIC SMALL LETTER PSI (16#00473#, 16#00473#), -- CYRILLIC SMALL LETTER FITA .. CYRILLIC SMALL LETTER FITA (16#00475#, 16#00475#), -- CYRILLIC SMALL LETTER IZHITSA .. CYRILLIC SMALL LETTER IZHITSA (16#00477#, 16#00477#), -- CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT .. CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT (16#00479#, 16#00479#), -- CYRILLIC SMALL LETTER UK .. CYRILLIC SMALL LETTER UK (16#0047B#, 16#0047B#), -- CYRILLIC SMALL LETTER ROUND OMEGA .. CYRILLIC SMALL LETTER ROUND OMEGA (16#0047D#, 16#0047D#), -- CYRILLIC SMALL LETTER OMEGA WITH TITLO .. CYRILLIC SMALL LETTER OMEGA WITH TITLO (16#0047F#, 16#0047F#), -- CYRILLIC SMALL LETTER OT .. CYRILLIC SMALL LETTER OT (16#00481#, 16#00481#), -- CYRILLIC SMALL LETTER KOPPA .. CYRILLIC SMALL LETTER KOPPA (16#0048B#, 16#0048B#), -- CYRILLIC SMALL LETTER SHORT I WITH TAIL .. CYRILLIC SMALL LETTER SHORT I WITH TAIL (16#0048D#, 16#0048D#), -- CYRILLIC SMALL LETTER SEMISOFT SIGN .. CYRILLIC SMALL LETTER SEMISOFT SIGN (16#0048F#, 16#0048F#), -- CYRILLIC SMALL LETTER ER WITH TICK .. CYRILLIC SMALL LETTER ER WITH TICK (16#00491#, 16#00491#), -- CYRILLIC SMALL LETTER GHE WITH UPTURN .. CYRILLIC SMALL LETTER GHE WITH UPTURN (16#00493#, 16#00493#), -- CYRILLIC SMALL LETTER GHE WITH STROKE .. CYRILLIC SMALL LETTER GHE WITH STROKE (16#00495#, 16#00495#), -- CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK (16#00497#, 16#00497#), -- CYRILLIC SMALL LETTER ZHE WITH DESCENDER .. CYRILLIC SMALL LETTER ZHE WITH DESCENDER (16#00499#, 16#00499#), -- CYRILLIC SMALL LETTER ZE WITH DESCENDER .. CYRILLIC SMALL LETTER ZE WITH DESCENDER (16#0049B#, 16#0049B#), -- CYRILLIC SMALL LETTER KA WITH DESCENDER .. CYRILLIC SMALL LETTER KA WITH DESCENDER (16#0049D#, 16#0049D#), -- CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE (16#0049F#, 16#0049F#), -- CYRILLIC SMALL LETTER KA WITH STROKE .. CYRILLIC SMALL LETTER KA WITH STROKE (16#004A1#, 16#004A1#), -- CYRILLIC SMALL LETTER BASHKIR KA .. CYRILLIC SMALL LETTER BASHKIR KA (16#004A3#, 16#004A3#), -- CYRILLIC SMALL LETTER EN WITH DESCENDER .. CYRILLIC SMALL LETTER EN WITH DESCENDER (16#004A5#, 16#004A5#), -- CYRILLIC SMALL LIGATURE EN GHE .. CYRILLIC SMALL LIGATURE EN GHE (16#004A7#, 16#004A7#), -- CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK (16#004A9#, 16#004A9#), -- CYRILLIC SMALL LETTER ABKHASIAN HA .. CYRILLIC SMALL LETTER ABKHASIAN HA (16#004AB#, 16#004AB#), -- CYRILLIC SMALL LETTER ES WITH DESCENDER .. CYRILLIC SMALL LETTER ES WITH DESCENDER (16#004AD#, 16#004AD#), -- CYRILLIC SMALL LETTER TE WITH DESCENDER .. CYRILLIC SMALL LETTER TE WITH DESCENDER (16#004AF#, 16#004AF#), -- CYRILLIC SMALL LETTER STRAIGHT U .. CYRILLIC SMALL LETTER STRAIGHT U (16#004B1#, 16#004B1#), -- CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE .. CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE (16#004B3#, 16#004B3#), -- CYRILLIC SMALL LETTER HA WITH DESCENDER .. CYRILLIC SMALL LETTER HA WITH DESCENDER (16#004B5#, 16#004B5#), -- CYRILLIC SMALL LIGATURE TE TSE .. CYRILLIC SMALL LIGATURE TE TSE (16#004B7#, 16#004B7#), -- CYRILLIC SMALL LETTER CHE WITH DESCENDER .. CYRILLIC SMALL LETTER CHE WITH DESCENDER (16#004B9#, 16#004B9#), -- CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE (16#004BB#, 16#004BB#), -- CYRILLIC SMALL LETTER SHHA .. CYRILLIC SMALL LETTER SHHA (16#004BD#, 16#004BD#), -- CYRILLIC SMALL LETTER ABKHASIAN CHE .. CYRILLIC SMALL LETTER ABKHASIAN CHE (16#004BF#, 16#004BF#), -- CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER .. CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER (16#004C2#, 16#004C2#), -- CYRILLIC SMALL LETTER ZHE WITH BREVE .. CYRILLIC SMALL LETTER ZHE WITH BREVE (16#004C4#, 16#004C4#), -- CYRILLIC SMALL LETTER KA WITH HOOK .. CYRILLIC SMALL LETTER KA WITH HOOK (16#004C6#, 16#004C6#), -- CYRILLIC SMALL LETTER EL WITH TAIL .. CYRILLIC SMALL LETTER EL WITH TAIL (16#004C8#, 16#004C8#), -- CYRILLIC SMALL LETTER EN WITH HOOK .. CYRILLIC SMALL LETTER EN WITH HOOK (16#004CA#, 16#004CA#), -- CYRILLIC SMALL LETTER EN WITH TAIL .. CYRILLIC SMALL LETTER EN WITH TAIL (16#004CC#, 16#004CC#), -- CYRILLIC SMALL LETTER KHAKASSIAN CHE .. CYRILLIC SMALL LETTER KHAKASSIAN CHE (16#004CE#, 16#004CE#), -- CYRILLIC SMALL LETTER EM WITH TAIL .. CYRILLIC SMALL LETTER EM WITH TAIL (16#004D1#, 16#004D1#), -- CYRILLIC SMALL LETTER A WITH BREVE .. CYRILLIC SMALL LETTER A WITH BREVE (16#004D3#, 16#004D3#), -- CYRILLIC SMALL LETTER A WITH DIAERESIS .. CYRILLIC SMALL LETTER A WITH DIAERESIS (16#004D5#, 16#004D5#), -- CYRILLIC SMALL LIGATURE A IE .. CYRILLIC SMALL LIGATURE A IE (16#004D7#, 16#004D7#), -- CYRILLIC SMALL LETTER IE WITH BREVE .. CYRILLIC SMALL LETTER IE WITH BREVE (16#004D9#, 16#004D9#), -- CYRILLIC SMALL LETTER SCHWA .. CYRILLIC SMALL LETTER SCHWA (16#004DB#, 16#004DB#), -- CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS .. CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS (16#004DD#, 16#004DD#), -- CYRILLIC SMALL LETTER ZHE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZHE WITH DIAERESIS (16#004DF#, 16#004DF#), -- CYRILLIC SMALL LETTER ZE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZE WITH DIAERESIS (16#004E1#, 16#004E1#), -- CYRILLIC SMALL LETTER ABKHASIAN DZE .. CYRILLIC SMALL LETTER ABKHASIAN DZE (16#004E3#, 16#004E3#), -- CYRILLIC SMALL LETTER I WITH MACRON .. CYRILLIC SMALL LETTER I WITH MACRON (16#004E5#, 16#004E5#), -- CYRILLIC SMALL LETTER I WITH DIAERESIS .. CYRILLIC SMALL LETTER I WITH DIAERESIS (16#004E7#, 16#004E7#), -- CYRILLIC SMALL LETTER O WITH DIAERESIS .. CYRILLIC SMALL LETTER O WITH DIAERESIS (16#004E9#, 16#004E9#), -- CYRILLIC SMALL LETTER BARRED O .. CYRILLIC SMALL LETTER BARRED O (16#004EB#, 16#004EB#), -- CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS .. CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS (16#004ED#, 16#004ED#), -- CYRILLIC SMALL LETTER E WITH DIAERESIS .. CYRILLIC SMALL LETTER E WITH DIAERESIS (16#004EF#, 16#004EF#), -- CYRILLIC SMALL LETTER U WITH MACRON .. CYRILLIC SMALL LETTER U WITH MACRON (16#004F1#, 16#004F1#), -- CYRILLIC SMALL LETTER U WITH DIAERESIS .. CYRILLIC SMALL LETTER U WITH DIAERESIS (16#004F3#, 16#004F3#), -- CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE .. CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE (16#004F5#, 16#004F5#), -- CYRILLIC SMALL LETTER CHE WITH DIAERESIS .. CYRILLIC SMALL LETTER CHE WITH DIAERESIS (16#004F9#, 16#004F9#), -- CYRILLIC SMALL LETTER YERU WITH DIAERESIS .. CYRILLIC SMALL LETTER YERU WITH DIAERESIS (16#00501#, 16#00501#), -- CYRILLIC SMALL LETTER KOMI DE .. CYRILLIC SMALL LETTER KOMI DE (16#00503#, 16#00503#), -- CYRILLIC SMALL LETTER KOMI DJE .. CYRILLIC SMALL LETTER KOMI DJE (16#00505#, 16#00505#), -- CYRILLIC SMALL LETTER KOMI ZJE .. CYRILLIC SMALL LETTER KOMI ZJE (16#00507#, 16#00507#), -- CYRILLIC SMALL LETTER KOMI DZJE .. CYRILLIC SMALL LETTER KOMI DZJE (16#00509#, 16#00509#), -- CYRILLIC SMALL LETTER KOMI LJE .. CYRILLIC SMALL LETTER KOMI LJE (16#0050B#, 16#0050B#), -- CYRILLIC SMALL LETTER KOMI NJE .. CYRILLIC SMALL LETTER KOMI NJE (16#0050D#, 16#0050D#), -- CYRILLIC SMALL LETTER KOMI SJE .. CYRILLIC SMALL LETTER KOMI SJE (16#0050F#, 16#0050F#), -- CYRILLIC SMALL LETTER KOMI TJE .. CYRILLIC SMALL LETTER KOMI TJE (16#00561#, 16#00586#), -- ARMENIAN SMALL LETTER AYB .. ARMENIAN SMALL LETTER FEH (16#01E01#, 16#01E01#), -- LATIN SMALL LETTER A WITH RING BELOW .. LATIN SMALL LETTER A WITH RING BELOW (16#01E03#, 16#01E03#), -- LATIN SMALL LETTER B WITH DOT ABOVE .. LATIN SMALL LETTER B WITH DOT ABOVE (16#01E05#, 16#01E05#), -- LATIN SMALL LETTER B WITH DOT BELOW .. LATIN SMALL LETTER B WITH DOT BELOW (16#01E07#, 16#01E07#), -- LATIN SMALL LETTER B WITH LINE BELOW .. LATIN SMALL LETTER B WITH LINE BELOW (16#01E09#, 16#01E09#), -- LATIN SMALL LETTER C WITH CEDILLA AND ACUTE .. LATIN SMALL LETTER C WITH CEDILLA AND ACUTE (16#01E0B#, 16#01E0B#), -- LATIN SMALL LETTER D WITH DOT ABOVE .. LATIN SMALL LETTER D WITH DOT ABOVE (16#01E0D#, 16#01E0D#), -- LATIN SMALL LETTER D WITH DOT BELOW .. LATIN SMALL LETTER D WITH DOT BELOW (16#01E0F#, 16#01E0F#), -- LATIN SMALL LETTER D WITH LINE BELOW .. LATIN SMALL LETTER D WITH LINE BELOW (16#01E11#, 16#01E11#), -- LATIN SMALL LETTER D WITH CEDILLA .. LATIN SMALL LETTER D WITH CEDILLA (16#01E13#, 16#01E13#), -- LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW (16#01E15#, 16#01E15#), -- LATIN SMALL LETTER E WITH MACRON AND GRAVE .. LATIN SMALL LETTER E WITH MACRON AND GRAVE (16#01E17#, 16#01E17#), -- LATIN SMALL LETTER E WITH MACRON AND ACUTE .. LATIN SMALL LETTER E WITH MACRON AND ACUTE (16#01E19#, 16#01E19#), -- LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW (16#01E1B#, 16#01E1B#), -- LATIN SMALL LETTER E WITH TILDE BELOW .. LATIN SMALL LETTER E WITH TILDE BELOW (16#01E1D#, 16#01E1D#), -- LATIN SMALL LETTER E WITH CEDILLA AND BREVE .. LATIN SMALL LETTER E WITH CEDILLA AND BREVE (16#01E1F#, 16#01E1F#), -- LATIN SMALL LETTER F WITH DOT ABOVE .. LATIN SMALL LETTER F WITH DOT ABOVE (16#01E21#, 16#01E21#), -- LATIN SMALL LETTER G WITH MACRON .. LATIN SMALL LETTER G WITH MACRON (16#01E23#, 16#01E23#), -- LATIN SMALL LETTER H WITH DOT ABOVE .. LATIN SMALL LETTER H WITH DOT ABOVE (16#01E25#, 16#01E25#), -- LATIN SMALL LETTER H WITH DOT BELOW .. LATIN SMALL LETTER H WITH DOT BELOW (16#01E27#, 16#01E27#), -- LATIN SMALL LETTER H WITH DIAERESIS .. LATIN SMALL LETTER H WITH DIAERESIS (16#01E29#, 16#01E29#), -- LATIN SMALL LETTER H WITH CEDILLA .. LATIN SMALL LETTER H WITH CEDILLA (16#01E2B#, 16#01E2B#), -- LATIN SMALL LETTER H WITH BREVE BELOW .. LATIN SMALL LETTER H WITH BREVE BELOW (16#01E2D#, 16#01E2D#), -- LATIN SMALL LETTER I WITH TILDE BELOW .. LATIN SMALL LETTER I WITH TILDE BELOW (16#01E2F#, 16#01E2F#), -- LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE (16#01E31#, 16#01E31#), -- LATIN SMALL LETTER K WITH ACUTE .. LATIN SMALL LETTER K WITH ACUTE (16#01E33#, 16#01E33#), -- LATIN SMALL LETTER K WITH DOT BELOW .. LATIN SMALL LETTER K WITH DOT BELOW (16#01E35#, 16#01E35#), -- LATIN SMALL LETTER K WITH LINE BELOW .. LATIN SMALL LETTER K WITH LINE BELOW (16#01E37#, 16#01E37#), -- LATIN SMALL LETTER L WITH DOT BELOW .. LATIN SMALL LETTER L WITH DOT BELOW (16#01E39#, 16#01E39#), -- LATIN SMALL LETTER L WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER L WITH DOT BELOW AND MACRON (16#01E3B#, 16#01E3B#), -- LATIN SMALL LETTER L WITH LINE BELOW .. LATIN SMALL LETTER L WITH LINE BELOW (16#01E3D#, 16#01E3D#), -- LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW (16#01E3F#, 16#01E3F#), -- LATIN SMALL LETTER M WITH ACUTE .. LATIN SMALL LETTER M WITH ACUTE (16#01E41#, 16#01E41#), -- LATIN SMALL LETTER M WITH DOT ABOVE .. LATIN SMALL LETTER M WITH DOT ABOVE (16#01E43#, 16#01E43#), -- LATIN SMALL LETTER M WITH DOT BELOW .. LATIN SMALL LETTER M WITH DOT BELOW (16#01E45#, 16#01E45#), -- LATIN SMALL LETTER N WITH DOT ABOVE .. LATIN SMALL LETTER N WITH DOT ABOVE (16#01E47#, 16#01E47#), -- LATIN SMALL LETTER N WITH DOT BELOW .. LATIN SMALL LETTER N WITH DOT BELOW (16#01E49#, 16#01E49#), -- LATIN SMALL LETTER N WITH LINE BELOW .. LATIN SMALL LETTER N WITH LINE BELOW (16#01E4B#, 16#01E4B#), -- LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW (16#01E4D#, 16#01E4D#), -- LATIN SMALL LETTER O WITH TILDE AND ACUTE .. LATIN SMALL LETTER O WITH TILDE AND ACUTE (16#01E4F#, 16#01E4F#), -- LATIN SMALL LETTER O WITH TILDE AND DIAERESIS .. LATIN SMALL LETTER O WITH TILDE AND DIAERESIS (16#01E51#, 16#01E51#), -- LATIN SMALL LETTER O WITH MACRON AND GRAVE .. LATIN SMALL LETTER O WITH MACRON AND GRAVE (16#01E53#, 16#01E53#), -- LATIN SMALL LETTER O WITH MACRON AND ACUTE .. LATIN SMALL LETTER O WITH MACRON AND ACUTE (16#01E55#, 16#01E55#), -- LATIN SMALL LETTER P WITH ACUTE .. LATIN SMALL LETTER P WITH ACUTE (16#01E57#, 16#01E57#), -- LATIN SMALL LETTER P WITH DOT ABOVE .. LATIN SMALL LETTER P WITH DOT ABOVE (16#01E59#, 16#01E59#), -- LATIN SMALL LETTER R WITH DOT ABOVE .. LATIN SMALL LETTER R WITH DOT ABOVE (16#01E5B#, 16#01E5B#), -- LATIN SMALL LETTER R WITH DOT BELOW .. LATIN SMALL LETTER R WITH DOT BELOW (16#01E5D#, 16#01E5D#), -- LATIN SMALL LETTER R WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER R WITH DOT BELOW AND MACRON (16#01E5F#, 16#01E5F#), -- LATIN SMALL LETTER R WITH LINE BELOW .. LATIN SMALL LETTER R WITH LINE BELOW (16#01E61#, 16#01E61#), -- LATIN SMALL LETTER S WITH DOT ABOVE .. LATIN SMALL LETTER S WITH DOT ABOVE (16#01E63#, 16#01E63#), -- LATIN SMALL LETTER S WITH DOT BELOW .. LATIN SMALL LETTER S WITH DOT BELOW (16#01E65#, 16#01E65#), -- LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE .. LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE (16#01E67#, 16#01E67#), -- LATIN SMALL LETTER S WITH CARON AND DOT ABOVE .. LATIN SMALL LETTER S WITH CARON AND DOT ABOVE (16#01E69#, 16#01E69#), -- LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE .. LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE (16#01E6B#, 16#01E6B#), -- LATIN SMALL LETTER T WITH DOT ABOVE .. LATIN SMALL LETTER T WITH DOT ABOVE (16#01E6D#, 16#01E6D#), -- LATIN SMALL LETTER T WITH DOT BELOW .. LATIN SMALL LETTER T WITH DOT BELOW (16#01E6F#, 16#01E6F#), -- LATIN SMALL LETTER T WITH LINE BELOW .. LATIN SMALL LETTER T WITH LINE BELOW (16#01E71#, 16#01E71#), -- LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW (16#01E73#, 16#01E73#), -- LATIN SMALL LETTER U WITH DIAERESIS BELOW .. LATIN SMALL LETTER U WITH DIAERESIS BELOW (16#01E75#, 16#01E75#), -- LATIN SMALL LETTER U WITH TILDE BELOW .. LATIN SMALL LETTER U WITH TILDE BELOW (16#01E77#, 16#01E77#), -- LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW (16#01E79#, 16#01E79#), -- LATIN SMALL LETTER U WITH TILDE AND ACUTE .. LATIN SMALL LETTER U WITH TILDE AND ACUTE (16#01E7B#, 16#01E7B#), -- LATIN SMALL LETTER U WITH MACRON AND DIAERESIS .. LATIN SMALL LETTER U WITH MACRON AND DIAERESIS (16#01E7D#, 16#01E7D#), -- LATIN SMALL LETTER V WITH TILDE .. LATIN SMALL LETTER V WITH TILDE (16#01E7F#, 16#01E7F#), -- LATIN SMALL LETTER V WITH DOT BELOW .. LATIN SMALL LETTER V WITH DOT BELOW (16#01E81#, 16#01E81#), -- LATIN SMALL LETTER W WITH GRAVE .. LATIN SMALL LETTER W WITH GRAVE (16#01E83#, 16#01E83#), -- LATIN SMALL LETTER W WITH ACUTE .. LATIN SMALL LETTER W WITH ACUTE (16#01E85#, 16#01E85#), -- LATIN SMALL LETTER W WITH DIAERESIS .. LATIN SMALL LETTER W WITH DIAERESIS (16#01E87#, 16#01E87#), -- LATIN SMALL LETTER W WITH DOT ABOVE .. LATIN SMALL LETTER W WITH DOT ABOVE (16#01E89#, 16#01E89#), -- LATIN SMALL LETTER W WITH DOT BELOW .. LATIN SMALL LETTER W WITH DOT BELOW (16#01E8B#, 16#01E8B#), -- LATIN SMALL LETTER X WITH DOT ABOVE .. LATIN SMALL LETTER X WITH DOT ABOVE (16#01E8D#, 16#01E8D#), -- LATIN SMALL LETTER X WITH DIAERESIS .. LATIN SMALL LETTER X WITH DIAERESIS (16#01E8F#, 16#01E8F#), -- LATIN SMALL LETTER Y WITH DOT ABOVE .. LATIN SMALL LETTER Y WITH DOT ABOVE (16#01E91#, 16#01E91#), -- LATIN SMALL LETTER Z WITH CIRCUMFLEX .. LATIN SMALL LETTER Z WITH CIRCUMFLEX (16#01E93#, 16#01E93#), -- LATIN SMALL LETTER Z WITH DOT BELOW .. LATIN SMALL LETTER Z WITH DOT BELOW (16#01E95#, 16#01E95#), -- LATIN SMALL LETTER Z WITH LINE BELOW .. LATIN SMALL LETTER Z WITH LINE BELOW (16#01E9B#, 16#01E9B#), -- LATIN SMALL LETTER LONG S WITH DOT ABOVE .. LATIN SMALL LETTER LONG S WITH DOT ABOVE (16#01EA1#, 16#01EA1#), -- LATIN SMALL LETTER A WITH DOT BELOW .. LATIN SMALL LETTER A WITH DOT BELOW (16#01EA3#, 16#01EA3#), -- LATIN SMALL LETTER A WITH HOOK ABOVE .. LATIN SMALL LETTER A WITH HOOK ABOVE (16#01EA5#, 16#01EA5#), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE (16#01EA7#, 16#01EA7#), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE (16#01EA9#, 16#01EA9#), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE (16#01EAB#, 16#01EAB#), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE (16#01EAD#, 16#01EAD#), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW (16#01EAF#, 16#01EAF#), -- LATIN SMALL LETTER A WITH BREVE AND ACUTE .. LATIN SMALL LETTER A WITH BREVE AND ACUTE (16#01EB1#, 16#01EB1#), -- LATIN SMALL LETTER A WITH BREVE AND GRAVE .. LATIN SMALL LETTER A WITH BREVE AND GRAVE (16#01EB3#, 16#01EB3#), -- LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE .. LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE (16#01EB5#, 16#01EB5#), -- LATIN SMALL LETTER A WITH BREVE AND TILDE .. LATIN SMALL LETTER A WITH BREVE AND TILDE (16#01EB7#, 16#01EB7#), -- LATIN SMALL LETTER A WITH BREVE AND DOT BELOW .. LATIN SMALL LETTER A WITH BREVE AND DOT BELOW (16#01EB9#, 16#01EB9#), -- LATIN SMALL LETTER E WITH DOT BELOW .. LATIN SMALL LETTER E WITH DOT BELOW (16#01EBB#, 16#01EBB#), -- LATIN SMALL LETTER E WITH HOOK ABOVE .. LATIN SMALL LETTER E WITH HOOK ABOVE (16#01EBD#, 16#01EBD#), -- LATIN SMALL LETTER E WITH TILDE .. LATIN SMALL LETTER E WITH TILDE (16#01EBF#, 16#01EBF#), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE (16#01EC1#, 16#01EC1#), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE (16#01EC3#, 16#01EC3#), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE (16#01EC5#, 16#01EC5#), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE (16#01EC7#, 16#01EC7#), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW (16#01EC9#, 16#01EC9#), -- LATIN SMALL LETTER I WITH HOOK ABOVE .. LATIN SMALL LETTER I WITH HOOK ABOVE (16#01ECB#, 16#01ECB#), -- LATIN SMALL LETTER I WITH DOT BELOW .. LATIN SMALL LETTER I WITH DOT BELOW (16#01ECD#, 16#01ECD#), -- LATIN SMALL LETTER O WITH DOT BELOW .. LATIN SMALL LETTER O WITH DOT BELOW (16#01ECF#, 16#01ECF#), -- LATIN SMALL LETTER O WITH HOOK ABOVE .. LATIN SMALL LETTER O WITH HOOK ABOVE (16#01ED1#, 16#01ED1#), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE (16#01ED3#, 16#01ED3#), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE (16#01ED5#, 16#01ED5#), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE (16#01ED7#, 16#01ED7#), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE (16#01ED9#, 16#01ED9#), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW (16#01EDB#, 16#01EDB#), -- LATIN SMALL LETTER O WITH HORN AND ACUTE .. LATIN SMALL LETTER O WITH HORN AND ACUTE (16#01EDD#, 16#01EDD#), -- LATIN SMALL LETTER O WITH HORN AND GRAVE .. LATIN SMALL LETTER O WITH HORN AND GRAVE (16#01EDF#, 16#01EDF#), -- LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE (16#01EE1#, 16#01EE1#), -- LATIN SMALL LETTER O WITH HORN AND TILDE .. LATIN SMALL LETTER O WITH HORN AND TILDE (16#01EE3#, 16#01EE3#), -- LATIN SMALL LETTER O WITH HORN AND DOT BELOW .. LATIN SMALL LETTER O WITH HORN AND DOT BELOW (16#01EE5#, 16#01EE5#), -- LATIN SMALL LETTER U WITH DOT BELOW .. LATIN SMALL LETTER U WITH DOT BELOW (16#01EE7#, 16#01EE7#), -- LATIN SMALL LETTER U WITH HOOK ABOVE .. LATIN SMALL LETTER U WITH HOOK ABOVE (16#01EE9#, 16#01EE9#), -- LATIN SMALL LETTER U WITH HORN AND ACUTE .. LATIN SMALL LETTER U WITH HORN AND ACUTE (16#01EEB#, 16#01EEB#), -- LATIN SMALL LETTER U WITH HORN AND GRAVE .. LATIN SMALL LETTER U WITH HORN AND GRAVE (16#01EED#, 16#01EED#), -- LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE (16#01EEF#, 16#01EEF#), -- LATIN SMALL LETTER U WITH HORN AND TILDE .. LATIN SMALL LETTER U WITH HORN AND TILDE (16#01EF1#, 16#01EF1#), -- LATIN SMALL LETTER U WITH HORN AND DOT BELOW .. LATIN SMALL LETTER U WITH HORN AND DOT BELOW (16#01EF3#, 16#01EF3#), -- LATIN SMALL LETTER Y WITH GRAVE .. LATIN SMALL LETTER Y WITH GRAVE (16#01EF5#, 16#01EF5#), -- LATIN SMALL LETTER Y WITH DOT BELOW .. LATIN SMALL LETTER Y WITH DOT BELOW (16#01EF7#, 16#01EF7#), -- LATIN SMALL LETTER Y WITH HOOK ABOVE .. LATIN SMALL LETTER Y WITH HOOK ABOVE (16#01EF9#, 16#01EF9#), -- LATIN SMALL LETTER Y WITH TILDE .. LATIN SMALL LETTER Y WITH TILDE (16#01F00#, 16#01F07#), -- GREEK SMALL LETTER ALPHA WITH PSILI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI (16#01F10#, 16#01F15#), -- GREEK SMALL LETTER EPSILON WITH PSILI .. GREEK SMALL LETTER EPSILON WITH DASIA AND OXIA (16#01F20#, 16#01F27#), -- GREEK SMALL LETTER ETA WITH PSILI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI (16#01F30#, 16#01F37#), -- GREEK SMALL LETTER IOTA WITH PSILI .. GREEK SMALL LETTER IOTA WITH DASIA AND PERISPOMENI (16#01F40#, 16#01F45#), -- GREEK SMALL LETTER OMICRON WITH PSILI .. GREEK SMALL LETTER OMICRON WITH DASIA AND OXIA (16#01F51#, 16#01F51#), -- GREEK SMALL LETTER UPSILON WITH DASIA .. GREEK SMALL LETTER UPSILON WITH DASIA (16#01F53#, 16#01F53#), -- GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA (16#01F55#, 16#01F55#), -- GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA (16#01F57#, 16#01F57#), -- GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI .. GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI (16#01F60#, 16#01F67#), -- GREEK SMALL LETTER OMEGA WITH PSILI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI (16#01F70#, 16#01F71#), -- GREEK SMALL LETTER ALPHA WITH VARIA .. GREEK SMALL LETTER ALPHA WITH OXIA (16#01F72#, 16#01F75#), -- GREEK SMALL LETTER EPSILON WITH VARIA .. GREEK SMALL LETTER ETA WITH OXIA (16#01F76#, 16#01F77#), -- GREEK SMALL LETTER IOTA WITH VARIA .. GREEK SMALL LETTER IOTA WITH OXIA (16#01F78#, 16#01F79#), -- GREEK SMALL LETTER OMICRON WITH VARIA .. GREEK SMALL LETTER OMICRON WITH OXIA (16#01F7A#, 16#01F7B#), -- GREEK SMALL LETTER UPSILON WITH VARIA .. GREEK SMALL LETTER UPSILON WITH OXIA (16#01F7C#, 16#01F7D#), -- GREEK SMALL LETTER OMEGA WITH VARIA .. GREEK SMALL LETTER OMEGA WITH OXIA (16#01F80#, 16#01F87#), -- GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI (16#01F90#, 16#01F97#), -- GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI (16#01FA0#, 16#01FA7#), -- GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI (16#01FB0#, 16#01FB1#), -- GREEK SMALL LETTER ALPHA WITH VRACHY .. GREEK SMALL LETTER ALPHA WITH MACRON (16#01FB3#, 16#01FB3#), -- GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI (16#01FBE#, 16#01FBE#), -- GREEK PROSGEGRAMMENI .. GREEK PROSGEGRAMMENI (16#01FC3#, 16#01FC3#), -- GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI (16#01FD0#, 16#01FD1#), -- GREEK SMALL LETTER IOTA WITH VRACHY .. GREEK SMALL LETTER IOTA WITH MACRON (16#01FE0#, 16#01FE1#), -- GREEK SMALL LETTER UPSILON WITH VRACHY .. GREEK SMALL LETTER UPSILON WITH MACRON (16#01FE5#, 16#01FE5#), -- GREEK SMALL LETTER RHO WITH DASIA .. GREEK SMALL LETTER RHO WITH DASIA (16#01FF3#, 16#01FF3#), -- GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI (16#0FF41#, 16#0FF5A#), -- FULLWIDTH LATIN SMALL LETTER A .. FULLWIDTH LATIN SMALL LETTER Z (16#10428#, 16#1044D#)); -- DESERET SMALL LETTER LONG I .. DESERET SMALL LETTER ENG Upper_Case_Adjust : constant array (Lower_Case_Letters'Range) of Integer := ( -32, -- LATIN SMALL LETTER A .. LATIN SMALL LETTER Z 743, -- MICRO SIGN .. MICRO SIGN -32, -- LATIN SMALL LETTER A WITH GRAVE .. LATIN SMALL LETTER O WITH DIAERESIS -32, -- LATIN SMALL LETTER O WITH STROKE .. LATIN SMALL LETTER THORN 121, -- LATIN SMALL LETTER Y WITH DIAERESIS .. LATIN SMALL LETTER Y WITH DIAERESIS -1, -- LATIN SMALL LETTER A WITH MACRON .. LATIN SMALL LETTER A WITH MACRON -1, -- LATIN SMALL LETTER A WITH BREVE .. LATIN SMALL LETTER A WITH BREVE -1, -- LATIN SMALL LETTER A WITH OGONEK .. LATIN SMALL LETTER A WITH OGONEK -1, -- LATIN SMALL LETTER C WITH ACUTE .. LATIN SMALL LETTER C WITH ACUTE -1, -- LATIN SMALL LETTER C WITH CIRCUMFLEX .. LATIN SMALL LETTER C WITH CIRCUMFLEX -1, -- LATIN SMALL LETTER C WITH DOT ABOVE .. LATIN SMALL LETTER C WITH DOT ABOVE -1, -- LATIN SMALL LETTER C WITH CARON .. LATIN SMALL LETTER C WITH CARON -1, -- LATIN SMALL LETTER D WITH CARON .. LATIN SMALL LETTER D WITH CARON -1, -- LATIN SMALL LETTER D WITH STROKE .. LATIN SMALL LETTER D WITH STROKE -1, -- LATIN SMALL LETTER E WITH MACRON .. LATIN SMALL LETTER E WITH MACRON -1, -- LATIN SMALL LETTER E WITH BREVE .. LATIN SMALL LETTER E WITH BREVE -1, -- LATIN SMALL LETTER E WITH DOT ABOVE .. LATIN SMALL LETTER E WITH DOT ABOVE -1, -- LATIN SMALL LETTER E WITH OGONEK .. LATIN SMALL LETTER E WITH OGONEK -1, -- LATIN SMALL LETTER E WITH CARON .. LATIN SMALL LETTER E WITH CARON -1, -- LATIN SMALL LETTER G WITH CIRCUMFLEX .. LATIN SMALL LETTER G WITH CIRCUMFLEX -1, -- LATIN SMALL LETTER G WITH BREVE .. LATIN SMALL LETTER G WITH BREVE -1, -- LATIN SMALL LETTER G WITH DOT ABOVE .. LATIN SMALL LETTER G WITH DOT ABOVE -1, -- LATIN SMALL LETTER G WITH CEDILLA .. LATIN SMALL LETTER G WITH CEDILLA -1, -- LATIN SMALL LETTER H WITH CIRCUMFLEX .. LATIN SMALL LETTER H WITH CIRCUMFLEX -1, -- LATIN SMALL LETTER H WITH STROKE .. LATIN SMALL LETTER H WITH STROKE -1, -- LATIN SMALL LETTER I WITH TILDE .. LATIN SMALL LETTER I WITH TILDE -1, -- LATIN SMALL LETTER I WITH MACRON .. LATIN SMALL LETTER I WITH MACRON -1, -- LATIN SMALL LETTER I WITH BREVE .. LATIN SMALL LETTER I WITH BREVE -1, -- LATIN SMALL LETTER I WITH OGONEK .. LATIN SMALL LETTER I WITH OGONEK -232, -- LATIN SMALL LETTER DOTLESS I .. LATIN SMALL LETTER DOTLESS I -1, -- LATIN SMALL LIGATURE IJ .. LATIN SMALL LIGATURE IJ -1, -- LATIN SMALL LETTER J WITH CIRCUMFLEX .. LATIN SMALL LETTER J WITH CIRCUMFLEX -1, -- LATIN SMALL LETTER K WITH CEDILLA .. LATIN SMALL LETTER K WITH CEDILLA -1, -- LATIN SMALL LETTER L WITH ACUTE .. LATIN SMALL LETTER L WITH ACUTE -1, -- LATIN SMALL LETTER L WITH CEDILLA .. LATIN SMALL LETTER L WITH CEDILLA -1, -- LATIN SMALL LETTER L WITH CARON .. LATIN SMALL LETTER L WITH CARON -1, -- LATIN SMALL LETTER L WITH MIDDLE DOT .. LATIN SMALL LETTER L WITH MIDDLE DOT -1, -- LATIN SMALL LETTER L WITH STROKE .. LATIN SMALL LETTER L WITH STROKE -1, -- LATIN SMALL LETTER N WITH ACUTE .. LATIN SMALL LETTER N WITH ACUTE -1, -- LATIN SMALL LETTER N WITH CEDILLA .. LATIN SMALL LETTER N WITH CEDILLA -1, -- LATIN SMALL LETTER N WITH CARON .. LATIN SMALL LETTER N WITH CARON -1, -- LATIN SMALL LETTER ENG .. LATIN SMALL LETTER ENG -1, -- LATIN SMALL LETTER O WITH MACRON .. LATIN SMALL LETTER O WITH MACRON -1, -- LATIN SMALL LETTER O WITH BREVE .. LATIN SMALL LETTER O WITH BREVE -1, -- LATIN SMALL LETTER O WITH DOUBLE ACUTE .. LATIN SMALL LETTER O WITH DOUBLE ACUTE -1, -- LATIN SMALL LIGATURE OE .. LATIN SMALL LIGATURE OE -1, -- LATIN SMALL LETTER R WITH ACUTE .. LATIN SMALL LETTER R WITH ACUTE -1, -- LATIN SMALL LETTER R WITH CEDILLA .. LATIN SMALL LETTER R WITH CEDILLA -1, -- LATIN SMALL LETTER R WITH CARON .. LATIN SMALL LETTER R WITH CARON -1, -- LATIN SMALL LETTER S WITH ACUTE .. LATIN SMALL LETTER S WITH ACUTE -1, -- LATIN SMALL LETTER S WITH CIRCUMFLEX .. LATIN SMALL LETTER S WITH CIRCUMFLEX -1, -- LATIN SMALL LETTER S WITH CEDILLA .. LATIN SMALL LETTER S WITH CEDILLA -1, -- LATIN SMALL LETTER S WITH CARON .. LATIN SMALL LETTER S WITH CARON -1, -- LATIN SMALL LETTER T WITH CEDILLA .. LATIN SMALL LETTER T WITH CEDILLA -1, -- LATIN SMALL LETTER T WITH CARON .. LATIN SMALL LETTER T WITH CARON -1, -- LATIN SMALL LETTER T WITH STROKE .. LATIN SMALL LETTER T WITH STROKE -1, -- LATIN SMALL LETTER U WITH TILDE .. LATIN SMALL LETTER U WITH TILDE -1, -- LATIN SMALL LETTER U WITH MACRON .. LATIN SMALL LETTER U WITH MACRON -1, -- LATIN SMALL LETTER U WITH BREVE .. LATIN SMALL LETTER U WITH BREVE -1, -- LATIN SMALL LETTER U WITH RING ABOVE .. LATIN SMALL LETTER U WITH RING ABOVE -1, -- LATIN SMALL LETTER U WITH DOUBLE ACUTE .. LATIN SMALL LETTER U WITH DOUBLE ACUTE -1, -- LATIN SMALL LETTER U WITH OGONEK .. LATIN SMALL LETTER U WITH OGONEK -1, -- LATIN SMALL LETTER W WITH CIRCUMFLEX .. LATIN SMALL LETTER W WITH CIRCUMFLEX -1, -- LATIN SMALL LETTER Y WITH CIRCUMFLEX .. LATIN SMALL LETTER Y WITH CIRCUMFLEX -1, -- LATIN SMALL LETTER Z WITH ACUTE .. LATIN SMALL LETTER Z WITH ACUTE -1, -- LATIN SMALL LETTER Z WITH DOT ABOVE .. LATIN SMALL LETTER Z WITH DOT ABOVE -1, -- LATIN SMALL LETTER Z WITH CARON .. LATIN SMALL LETTER Z WITH CARON -300, -- LATIN SMALL LETTER LONG S .. LATIN SMALL LETTER LONG S -1, -- LATIN SMALL LETTER B WITH TOPBAR .. LATIN SMALL LETTER B WITH TOPBAR -1, -- LATIN SMALL LETTER TONE SIX .. LATIN SMALL LETTER TONE SIX -1, -- LATIN SMALL LETTER C WITH HOOK .. LATIN SMALL LETTER C WITH HOOK -1, -- LATIN SMALL LETTER D WITH TOPBAR .. LATIN SMALL LETTER D WITH TOPBAR -1, -- LATIN SMALL LETTER F WITH HOOK .. LATIN SMALL LETTER F WITH HOOK 97, -- LATIN SMALL LETTER HV .. LATIN SMALL LETTER HV -1, -- LATIN SMALL LETTER K WITH HOOK .. LATIN SMALL LETTER K WITH HOOK 130, -- LATIN SMALL LETTER N WITH LONG RIGHT LEG .. LATIN SMALL LETTER N WITH LONG RIGHT LEG -1, -- LATIN SMALL LETTER O WITH HORN .. LATIN SMALL LETTER O WITH HORN -1, -- LATIN SMALL LETTER OI .. LATIN SMALL LETTER OI -1, -- LATIN SMALL LETTER P WITH HOOK .. LATIN SMALL LETTER P WITH HOOK -1, -- LATIN SMALL LETTER TONE TWO .. LATIN SMALL LETTER TONE TWO -1, -- LATIN SMALL LETTER T WITH HOOK .. LATIN SMALL LETTER T WITH HOOK -1, -- LATIN SMALL LETTER U WITH HORN .. LATIN SMALL LETTER U WITH HORN -1, -- LATIN SMALL LETTER Y WITH HOOK .. LATIN SMALL LETTER Y WITH HOOK -1, -- LATIN SMALL LETTER Z WITH STROKE .. LATIN SMALL LETTER Z WITH STROKE -1, -- LATIN SMALL LETTER EZH REVERSED .. LATIN SMALL LETTER EZH REVERSED -1, -- LATIN SMALL LETTER TONE FIVE .. LATIN SMALL LETTER TONE FIVE 56, -- LATIN LETTER WYNN .. LATIN LETTER WYNN -1, -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON -2, -- LATIN SMALL LETTER DZ WITH CARON .. LATIN SMALL LETTER DZ WITH CARON -1, -- LATIN CAPITAL LETTER L WITH SMALL LETTER J .. LATIN CAPITAL LETTER L WITH SMALL LETTER J -2, -- LATIN SMALL LETTER LJ .. LATIN SMALL LETTER LJ -1, -- LATIN CAPITAL LETTER N WITH SMALL LETTER J .. LATIN CAPITAL LETTER N WITH SMALL LETTER J -2, -- LATIN SMALL LETTER NJ .. LATIN SMALL LETTER NJ -1, -- LATIN SMALL LETTER A WITH CARON .. LATIN SMALL LETTER A WITH CARON -1, -- LATIN SMALL LETTER I WITH CARON .. LATIN SMALL LETTER I WITH CARON -1, -- LATIN SMALL LETTER O WITH CARON .. LATIN SMALL LETTER O WITH CARON -1, -- LATIN SMALL LETTER U WITH CARON .. LATIN SMALL LETTER U WITH CARON -1, -- LATIN SMALL LETTER U WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER U WITH DIAERESIS AND MACRON -1, -- LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE -1, -- LATIN SMALL LETTER U WITH DIAERESIS AND CARON .. LATIN SMALL LETTER U WITH DIAERESIS AND CARON -1, -- LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE .. LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE -79, -- LATIN SMALL LETTER TURNED E .. LATIN SMALL LETTER TURNED E -1, -- LATIN SMALL LETTER A WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER A WITH DIAERESIS AND MACRON -1, -- LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON -1, -- LATIN SMALL LETTER AE WITH MACRON .. LATIN SMALL LETTER AE WITH MACRON -1, -- LATIN SMALL LETTER G WITH STROKE .. LATIN SMALL LETTER G WITH STROKE -1, -- LATIN SMALL LETTER G WITH CARON .. LATIN SMALL LETTER G WITH CARON -1, -- LATIN SMALL LETTER K WITH CARON .. LATIN SMALL LETTER K WITH CARON -1, -- LATIN SMALL LETTER O WITH OGONEK .. LATIN SMALL LETTER O WITH OGONEK -1, -- LATIN SMALL LETTER O WITH OGONEK AND MACRON .. LATIN SMALL LETTER O WITH OGONEK AND MACRON -1, -- LATIN SMALL LETTER EZH WITH CARON .. LATIN SMALL LETTER EZH WITH CARON -1, -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z -2, -- LATIN SMALL LETTER DZ .. LATIN SMALL LETTER DZ -1, -- LATIN SMALL LETTER G WITH ACUTE .. LATIN SMALL LETTER G WITH ACUTE -1, -- LATIN SMALL LETTER N WITH GRAVE .. LATIN SMALL LETTER N WITH GRAVE -1, -- LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE .. LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE -1, -- LATIN SMALL LETTER AE WITH ACUTE .. LATIN SMALL LETTER AE WITH ACUTE -1, -- LATIN SMALL LETTER O WITH STROKE AND ACUTE .. LATIN SMALL LETTER O WITH STROKE AND ACUTE -1, -- LATIN SMALL LETTER A WITH DOUBLE GRAVE .. LATIN SMALL LETTER A WITH DOUBLE GRAVE -1, -- LATIN SMALL LETTER A WITH INVERTED BREVE .. LATIN SMALL LETTER A WITH INVERTED BREVE -1, -- LATIN SMALL LETTER E WITH DOUBLE GRAVE .. LATIN SMALL LETTER E WITH DOUBLE GRAVE -1, -- LATIN SMALL LETTER E WITH INVERTED BREVE .. LATIN SMALL LETTER E WITH INVERTED BREVE -1, -- LATIN SMALL LETTER I WITH DOUBLE GRAVE .. LATIN SMALL LETTER I WITH DOUBLE GRAVE -1, -- LATIN SMALL LETTER I WITH INVERTED BREVE .. LATIN SMALL LETTER I WITH INVERTED BREVE -1, -- LATIN SMALL LETTER O WITH DOUBLE GRAVE .. LATIN SMALL LETTER O WITH DOUBLE GRAVE -1, -- LATIN SMALL LETTER O WITH INVERTED BREVE .. LATIN SMALL LETTER O WITH INVERTED BREVE -1, -- LATIN SMALL LETTER R WITH DOUBLE GRAVE .. LATIN SMALL LETTER R WITH DOUBLE GRAVE -1, -- LATIN SMALL LETTER R WITH INVERTED BREVE .. LATIN SMALL LETTER R WITH INVERTED BREVE -1, -- LATIN SMALL LETTER U WITH DOUBLE GRAVE .. LATIN SMALL LETTER U WITH DOUBLE GRAVE -1, -- LATIN SMALL LETTER U WITH INVERTED BREVE .. LATIN SMALL LETTER U WITH INVERTED BREVE -1, -- LATIN SMALL LETTER S WITH COMMA BELOW .. LATIN SMALL LETTER S WITH COMMA BELOW -1, -- LATIN SMALL LETTER T WITH COMMA BELOW .. LATIN SMALL LETTER T WITH COMMA BELOW -1, -- LATIN SMALL LETTER YOGH .. LATIN SMALL LETTER YOGH -1, -- LATIN SMALL LETTER H WITH CARON .. LATIN SMALL LETTER H WITH CARON -1, -- LATIN SMALL LETTER OU .. LATIN SMALL LETTER OU -1, -- LATIN SMALL LETTER Z WITH HOOK .. LATIN SMALL LETTER Z WITH HOOK -1, -- LATIN SMALL LETTER A WITH DOT ABOVE .. LATIN SMALL LETTER A WITH DOT ABOVE -1, -- LATIN SMALL LETTER E WITH CEDILLA .. LATIN SMALL LETTER E WITH CEDILLA -1, -- LATIN SMALL LETTER O WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER O WITH DIAERESIS AND MACRON -1, -- LATIN SMALL LETTER O WITH TILDE AND MACRON .. LATIN SMALL LETTER O WITH TILDE AND MACRON -1, -- LATIN SMALL LETTER O WITH DOT ABOVE .. LATIN SMALL LETTER O WITH DOT ABOVE -1, -- LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON -1, -- LATIN SMALL LETTER Y WITH MACRON .. LATIN SMALL LETTER Y WITH MACRON -210, -- LATIN SMALL LETTER B WITH HOOK .. LATIN SMALL LETTER B WITH HOOK -206, -- LATIN SMALL LETTER OPEN O .. LATIN SMALL LETTER OPEN O -205, -- LATIN SMALL LETTER D WITH TAIL .. LATIN SMALL LETTER D WITH HOOK -202, -- LATIN SMALL LETTER SCHWA .. LATIN SMALL LETTER SCHWA -203, -- LATIN SMALL LETTER OPEN E .. LATIN SMALL LETTER OPEN E -205, -- LATIN SMALL LETTER G WITH HOOK .. LATIN SMALL LETTER G WITH HOOK -207, -- LATIN SMALL LETTER GAMMA .. LATIN SMALL LETTER GAMMA -209, -- LATIN SMALL LETTER I WITH STROKE .. LATIN SMALL LETTER I WITH STROKE -211, -- LATIN SMALL LETTER IOTA .. LATIN SMALL LETTER IOTA -211, -- LATIN SMALL LETTER TURNED M .. LATIN SMALL LETTER TURNED M -213, -- LATIN SMALL LETTER N WITH LEFT HOOK .. LATIN SMALL LETTER N WITH LEFT HOOK -214, -- LATIN SMALL LETTER BARRED O .. LATIN SMALL LETTER BARRED O -218, -- LATIN LETTER SMALL CAPITAL R .. LATIN LETTER SMALL CAPITAL R -218, -- LATIN SMALL LETTER ESH .. LATIN SMALL LETTER ESH -218, -- LATIN SMALL LETTER T WITH RETROFLEX HOOK .. LATIN SMALL LETTER T WITH RETROFLEX HOOK -217, -- LATIN SMALL LETTER UPSILON .. LATIN SMALL LETTER V WITH HOOK -219, -- LATIN SMALL LETTER EZH .. LATIN SMALL LETTER EZH -38, -- GREEK SMALL LETTER ALPHA WITH TONOS .. GREEK SMALL LETTER ALPHA WITH TONOS -37, -- GREEK SMALL LETTER EPSILON WITH TONOS .. GREEK SMALL LETTER IOTA WITH TONOS -32, -- GREEK SMALL LETTER ALPHA .. GREEK SMALL LETTER RHO -31, -- GREEK SMALL LETTER FINAL SIGMA .. GREEK SMALL LETTER FINAL SIGMA -32, -- GREEK SMALL LETTER SIGMA .. GREEK SMALL LETTER UPSILON WITH DIALYTIKA -64, -- GREEK SMALL LETTER OMICRON WITH TONOS .. GREEK SMALL LETTER OMICRON WITH TONOS -63, -- GREEK SMALL LETTER UPSILON WITH TONOS .. GREEK SMALL LETTER OMEGA WITH TONOS -62, -- GREEK BETA SYMBOL .. GREEK BETA SYMBOL -57, -- GREEK THETA SYMBOL .. GREEK THETA SYMBOL -47, -- GREEK PHI SYMBOL .. GREEK PHI SYMBOL -54, -- GREEK PI SYMBOL .. GREEK PI SYMBOL -1, -- GREEK SMALL LETTER ARCHAIC KOPPA .. GREEK SMALL LETTER ARCHAIC KOPPA -1, -- GREEK SMALL LETTER STIGMA .. GREEK SMALL LETTER STIGMA -1, -- GREEK SMALL LETTER DIGAMMA .. GREEK SMALL LETTER DIGAMMA -1, -- GREEK SMALL LETTER KOPPA .. GREEK SMALL LETTER KOPPA -1, -- GREEK SMALL LETTER SAMPI .. GREEK SMALL LETTER SAMPI -1, -- COPTIC SMALL LETTER SHEI .. COPTIC SMALL LETTER SHEI -1, -- COPTIC SMALL LETTER FEI .. COPTIC SMALL LETTER FEI -1, -- COPTIC SMALL LETTER KHEI .. COPTIC SMALL LETTER KHEI -1, -- COPTIC SMALL LETTER HORI .. COPTIC SMALL LETTER HORI -1, -- COPTIC SMALL LETTER GANGIA .. COPTIC SMALL LETTER GANGIA -1, -- COPTIC SMALL LETTER SHIMA .. COPTIC SMALL LETTER SHIMA -1, -- COPTIC SMALL LETTER DEI .. COPTIC SMALL LETTER DEI -86, -- GREEK KAPPA SYMBOL .. GREEK KAPPA SYMBOL -80, -- GREEK RHO SYMBOL .. GREEK RHO SYMBOL -79, -- GREEK LUNATE SIGMA SYMBOL .. GREEK LUNATE SIGMA SYMBOL -96, -- GREEK LUNATE EPSILON SYMBOL .. GREEK LUNATE EPSILON SYMBOL -32, -- CYRILLIC SMALL LETTER A .. CYRILLIC SMALL LETTER YA -80, -- CYRILLIC SMALL LETTER IE WITH GRAVE .. CYRILLIC SMALL LETTER DZHE -1, -- CYRILLIC SMALL LETTER OMEGA .. CYRILLIC SMALL LETTER OMEGA -1, -- CYRILLIC SMALL LETTER YAT .. CYRILLIC SMALL LETTER YAT -1, -- CYRILLIC SMALL LETTER IOTIFIED E .. CYRILLIC SMALL LETTER IOTIFIED E -1, -- CYRILLIC SMALL LETTER LITTLE YUS .. CYRILLIC SMALL LETTER LITTLE YUS -1, -- CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS .. CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS -1, -- CYRILLIC SMALL LETTER BIG YUS .. CYRILLIC SMALL LETTER BIG YUS -1, -- CYRILLIC SMALL LETTER IOTIFIED BIG YUS .. CYRILLIC SMALL LETTER IOTIFIED BIG YUS -1, -- CYRILLIC SMALL LETTER KSI .. CYRILLIC SMALL LETTER KSI -1, -- CYRILLIC SMALL LETTER PSI .. CYRILLIC SMALL LETTER PSI -1, -- CYRILLIC SMALL LETTER FITA .. CYRILLIC SMALL LETTER FITA -1, -- CYRILLIC SMALL LETTER IZHITSA .. CYRILLIC SMALL LETTER IZHITSA -1, -- CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT .. CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT -1, -- CYRILLIC SMALL LETTER UK .. CYRILLIC SMALL LETTER UK -1, -- CYRILLIC SMALL LETTER ROUND OMEGA .. CYRILLIC SMALL LETTER ROUND OMEGA -1, -- CYRILLIC SMALL LETTER OMEGA WITH TITLO .. CYRILLIC SMALL LETTER OMEGA WITH TITLO -1, -- CYRILLIC SMALL LETTER OT .. CYRILLIC SMALL LETTER OT -1, -- CYRILLIC SMALL LETTER KOPPA .. CYRILLIC SMALL LETTER KOPPA -1, -- CYRILLIC SMALL LETTER SHORT I WITH TAIL .. CYRILLIC SMALL LETTER SHORT I WITH TAIL -1, -- CYRILLIC SMALL LETTER SEMISOFT SIGN .. CYRILLIC SMALL LETTER SEMISOFT SIGN -1, -- CYRILLIC SMALL LETTER ER WITH TICK .. CYRILLIC SMALL LETTER ER WITH TICK -1, -- CYRILLIC SMALL LETTER GHE WITH UPTURN .. CYRILLIC SMALL LETTER GHE WITH UPTURN -1, -- CYRILLIC SMALL LETTER GHE WITH STROKE .. CYRILLIC SMALL LETTER GHE WITH STROKE -1, -- CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK -1, -- CYRILLIC SMALL LETTER ZHE WITH DESCENDER .. CYRILLIC SMALL LETTER ZHE WITH DESCENDER -1, -- CYRILLIC SMALL LETTER ZE WITH DESCENDER .. CYRILLIC SMALL LETTER ZE WITH DESCENDER -1, -- CYRILLIC SMALL LETTER KA WITH DESCENDER .. CYRILLIC SMALL LETTER KA WITH DESCENDER -1, -- CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE -1, -- CYRILLIC SMALL LETTER KA WITH STROKE .. CYRILLIC SMALL LETTER KA WITH STROKE -1, -- CYRILLIC SMALL LETTER BASHKIR KA .. CYRILLIC SMALL LETTER BASHKIR KA -1, -- CYRILLIC SMALL LETTER EN WITH DESCENDER .. CYRILLIC SMALL LETTER EN WITH DESCENDER -1, -- CYRILLIC SMALL LIGATURE EN GHE .. CYRILLIC SMALL LIGATURE EN GHE -1, -- CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK -1, -- CYRILLIC SMALL LETTER ABKHASIAN HA .. CYRILLIC SMALL LETTER ABKHASIAN HA -1, -- CYRILLIC SMALL LETTER ES WITH DESCENDER .. CYRILLIC SMALL LETTER ES WITH DESCENDER -1, -- CYRILLIC SMALL LETTER TE WITH DESCENDER .. CYRILLIC SMALL LETTER TE WITH DESCENDER -1, -- CYRILLIC SMALL LETTER STRAIGHT U .. CYRILLIC SMALL LETTER STRAIGHT U -1, -- CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE .. CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE -1, -- CYRILLIC SMALL LETTER HA WITH DESCENDER .. CYRILLIC SMALL LETTER HA WITH DESCENDER -1, -- CYRILLIC SMALL LIGATURE TE TSE .. CYRILLIC SMALL LIGATURE TE TSE -1, -- CYRILLIC SMALL LETTER CHE WITH DESCENDER .. CYRILLIC SMALL LETTER CHE WITH DESCENDER -1, -- CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE -1, -- CYRILLIC SMALL LETTER SHHA .. CYRILLIC SMALL LETTER SHHA -1, -- CYRILLIC SMALL LETTER ABKHASIAN CHE .. CYRILLIC SMALL LETTER ABKHASIAN CHE -1, -- CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER .. CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER -1, -- CYRILLIC SMALL LETTER ZHE WITH BREVE .. CYRILLIC SMALL LETTER ZHE WITH BREVE -1, -- CYRILLIC SMALL LETTER KA WITH HOOK .. CYRILLIC SMALL LETTER KA WITH HOOK -1, -- CYRILLIC SMALL LETTER EL WITH TAIL .. CYRILLIC SMALL LETTER EL WITH TAIL -1, -- CYRILLIC SMALL LETTER EN WITH HOOK .. CYRILLIC SMALL LETTER EN WITH HOOK -1, -- CYRILLIC SMALL LETTER EN WITH TAIL .. CYRILLIC SMALL LETTER EN WITH TAIL -1, -- CYRILLIC SMALL LETTER KHAKASSIAN CHE .. CYRILLIC SMALL LETTER KHAKASSIAN CHE -1, -- CYRILLIC SMALL LETTER EM WITH TAIL .. CYRILLIC SMALL LETTER EM WITH TAIL -1, -- CYRILLIC SMALL LETTER A WITH BREVE .. CYRILLIC SMALL LETTER A WITH BREVE -1, -- CYRILLIC SMALL LETTER A WITH DIAERESIS .. CYRILLIC SMALL LETTER A WITH DIAERESIS -1, -- CYRILLIC SMALL LIGATURE A IE .. CYRILLIC SMALL LIGATURE A IE -1, -- CYRILLIC SMALL LETTER IE WITH BREVE .. CYRILLIC SMALL LETTER IE WITH BREVE -1, -- CYRILLIC SMALL LETTER SCHWA .. CYRILLIC SMALL LETTER SCHWA -1, -- CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS .. CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS -1, -- CYRILLIC SMALL LETTER ZHE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZHE WITH DIAERESIS -1, -- CYRILLIC SMALL LETTER ZE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZE WITH DIAERESIS -1, -- CYRILLIC SMALL LETTER ABKHASIAN DZE .. CYRILLIC SMALL LETTER ABKHASIAN DZE -1, -- CYRILLIC SMALL LETTER I WITH MACRON .. CYRILLIC SMALL LETTER I WITH MACRON -1, -- CYRILLIC SMALL LETTER I WITH DIAERESIS .. CYRILLIC SMALL LETTER I WITH DIAERESIS -1, -- CYRILLIC SMALL LETTER O WITH DIAERESIS .. CYRILLIC SMALL LETTER O WITH DIAERESIS -1, -- CYRILLIC SMALL LETTER BARRED O .. CYRILLIC SMALL LETTER BARRED O -1, -- CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS .. CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS -1, -- CYRILLIC SMALL LETTER E WITH DIAERESIS .. CYRILLIC SMALL LETTER E WITH DIAERESIS -1, -- CYRILLIC SMALL LETTER U WITH MACRON .. CYRILLIC SMALL LETTER U WITH MACRON -1, -- CYRILLIC SMALL LETTER U WITH DIAERESIS .. CYRILLIC SMALL LETTER U WITH DIAERESIS -1, -- CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE .. CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE -1, -- CYRILLIC SMALL LETTER CHE WITH DIAERESIS .. CYRILLIC SMALL LETTER CHE WITH DIAERESIS -1, -- CYRILLIC SMALL LETTER YERU WITH DIAERESIS .. CYRILLIC SMALL LETTER YERU WITH DIAERESIS -1, -- CYRILLIC SMALL LETTER KOMI DE .. CYRILLIC SMALL LETTER KOMI DE -1, -- CYRILLIC SMALL LETTER KOMI DJE .. CYRILLIC SMALL LETTER KOMI DJE -1, -- CYRILLIC SMALL LETTER KOMI ZJE .. CYRILLIC SMALL LETTER KOMI ZJE -1, -- CYRILLIC SMALL LETTER KOMI DZJE .. CYRILLIC SMALL LETTER KOMI DZJE -1, -- CYRILLIC SMALL LETTER KOMI LJE .. CYRILLIC SMALL LETTER KOMI LJE -1, -- CYRILLIC SMALL LETTER KOMI NJE .. CYRILLIC SMALL LETTER KOMI NJE -1, -- CYRILLIC SMALL LETTER KOMI SJE .. CYRILLIC SMALL LETTER KOMI SJE -1, -- CYRILLIC SMALL LETTER KOMI TJE .. CYRILLIC SMALL LETTER KOMI TJE -48, -- ARMENIAN SMALL LETTER AYB .. ARMENIAN SMALL LETTER FEH -1, -- LATIN SMALL LETTER A WITH RING BELOW .. LATIN SMALL LETTER A WITH RING BELOW -1, -- LATIN SMALL LETTER B WITH DOT ABOVE .. LATIN SMALL LETTER B WITH DOT ABOVE -1, -- LATIN SMALL LETTER B WITH DOT BELOW .. LATIN SMALL LETTER B WITH DOT BELOW -1, -- LATIN SMALL LETTER B WITH LINE BELOW .. LATIN SMALL LETTER B WITH LINE BELOW -1, -- LATIN SMALL LETTER C WITH CEDILLA AND ACUTE .. LATIN SMALL LETTER C WITH CEDILLA AND ACUTE -1, -- LATIN SMALL LETTER D WITH DOT ABOVE .. LATIN SMALL LETTER D WITH DOT ABOVE -1, -- LATIN SMALL LETTER D WITH DOT BELOW .. LATIN SMALL LETTER D WITH DOT BELOW -1, -- LATIN SMALL LETTER D WITH LINE BELOW .. LATIN SMALL LETTER D WITH LINE BELOW -1, -- LATIN SMALL LETTER D WITH CEDILLA .. LATIN SMALL LETTER D WITH CEDILLA -1, -- LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW -1, -- LATIN SMALL LETTER E WITH MACRON AND GRAVE .. LATIN SMALL LETTER E WITH MACRON AND GRAVE -1, -- LATIN SMALL LETTER E WITH MACRON AND ACUTE .. LATIN SMALL LETTER E WITH MACRON AND ACUTE -1, -- LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW -1, -- LATIN SMALL LETTER E WITH TILDE BELOW .. LATIN SMALL LETTER E WITH TILDE BELOW -1, -- LATIN SMALL LETTER E WITH CEDILLA AND BREVE .. LATIN SMALL LETTER E WITH CEDILLA AND BREVE -1, -- LATIN SMALL LETTER F WITH DOT ABOVE .. LATIN SMALL LETTER F WITH DOT ABOVE -1, -- LATIN SMALL LETTER G WITH MACRON .. LATIN SMALL LETTER G WITH MACRON -1, -- LATIN SMALL LETTER H WITH DOT ABOVE .. LATIN SMALL LETTER H WITH DOT ABOVE -1, -- LATIN SMALL LETTER H WITH DOT BELOW .. LATIN SMALL LETTER H WITH DOT BELOW -1, -- LATIN SMALL LETTER H WITH DIAERESIS .. LATIN SMALL LETTER H WITH DIAERESIS -1, -- LATIN SMALL LETTER H WITH CEDILLA .. LATIN SMALL LETTER H WITH CEDILLA -1, -- LATIN SMALL LETTER H WITH BREVE BELOW .. LATIN SMALL LETTER H WITH BREVE BELOW -1, -- LATIN SMALL LETTER I WITH TILDE BELOW .. LATIN SMALL LETTER I WITH TILDE BELOW -1, -- LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE -1, -- LATIN SMALL LETTER K WITH ACUTE .. LATIN SMALL LETTER K WITH ACUTE -1, -- LATIN SMALL LETTER K WITH DOT BELOW .. LATIN SMALL LETTER K WITH DOT BELOW -1, -- LATIN SMALL LETTER K WITH LINE BELOW .. LATIN SMALL LETTER K WITH LINE BELOW -1, -- LATIN SMALL LETTER L WITH DOT BELOW .. LATIN SMALL LETTER L WITH DOT BELOW -1, -- LATIN SMALL LETTER L WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER L WITH DOT BELOW AND MACRON -1, -- LATIN SMALL LETTER L WITH LINE BELOW .. LATIN SMALL LETTER L WITH LINE BELOW -1, -- LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW -1, -- LATIN SMALL LETTER M WITH ACUTE .. LATIN SMALL LETTER M WITH ACUTE -1, -- LATIN SMALL LETTER M WITH DOT ABOVE .. LATIN SMALL LETTER M WITH DOT ABOVE -1, -- LATIN SMALL LETTER M WITH DOT BELOW .. LATIN SMALL LETTER M WITH DOT BELOW -1, -- LATIN SMALL LETTER N WITH DOT ABOVE .. LATIN SMALL LETTER N WITH DOT ABOVE -1, -- LATIN SMALL LETTER N WITH DOT BELOW .. LATIN SMALL LETTER N WITH DOT BELOW -1, -- LATIN SMALL LETTER N WITH LINE BELOW .. LATIN SMALL LETTER N WITH LINE BELOW -1, -- LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW -1, -- LATIN SMALL LETTER O WITH TILDE AND ACUTE .. LATIN SMALL LETTER O WITH TILDE AND ACUTE -1, -- LATIN SMALL LETTER O WITH TILDE AND DIAERESIS .. LATIN SMALL LETTER O WITH TILDE AND DIAERESIS -1, -- LATIN SMALL LETTER O WITH MACRON AND GRAVE .. LATIN SMALL LETTER O WITH MACRON AND GRAVE -1, -- LATIN SMALL LETTER O WITH MACRON AND ACUTE .. LATIN SMALL LETTER O WITH MACRON AND ACUTE -1, -- LATIN SMALL LETTER P WITH ACUTE .. LATIN SMALL LETTER P WITH ACUTE -1, -- LATIN SMALL LETTER P WITH DOT ABOVE .. LATIN SMALL LETTER P WITH DOT ABOVE -1, -- LATIN SMALL LETTER R WITH DOT ABOVE .. LATIN SMALL LETTER R WITH DOT ABOVE -1, -- LATIN SMALL LETTER R WITH DOT BELOW .. LATIN SMALL LETTER R WITH DOT BELOW -1, -- LATIN SMALL LETTER R WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER R WITH DOT BELOW AND MACRON -1, -- LATIN SMALL LETTER R WITH LINE BELOW .. LATIN SMALL LETTER R WITH LINE BELOW -1, -- LATIN SMALL LETTER S WITH DOT ABOVE .. LATIN SMALL LETTER S WITH DOT ABOVE -1, -- LATIN SMALL LETTER S WITH DOT BELOW .. LATIN SMALL LETTER S WITH DOT BELOW -1, -- LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE .. LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE -1, -- LATIN SMALL LETTER S WITH CARON AND DOT ABOVE .. LATIN SMALL LETTER S WITH CARON AND DOT ABOVE -1, -- LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE .. LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE -1, -- LATIN SMALL LETTER T WITH DOT ABOVE .. LATIN SMALL LETTER T WITH DOT ABOVE -1, -- LATIN SMALL LETTER T WITH DOT BELOW .. LATIN SMALL LETTER T WITH DOT BELOW -1, -- LATIN SMALL LETTER T WITH LINE BELOW .. LATIN SMALL LETTER T WITH LINE BELOW -1, -- LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW -1, -- LATIN SMALL LETTER U WITH DIAERESIS BELOW .. LATIN SMALL LETTER U WITH DIAERESIS BELOW -1, -- LATIN SMALL LETTER U WITH TILDE BELOW .. LATIN SMALL LETTER U WITH TILDE BELOW -1, -- LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW -1, -- LATIN SMALL LETTER U WITH TILDE AND ACUTE .. LATIN SMALL LETTER U WITH TILDE AND ACUTE -1, -- LATIN SMALL LETTER U WITH MACRON AND DIAERESIS .. LATIN SMALL LETTER U WITH MACRON AND DIAERESIS -1, -- LATIN SMALL LETTER V WITH TILDE .. LATIN SMALL LETTER V WITH TILDE -1, -- LATIN SMALL LETTER V WITH DOT BELOW .. LATIN SMALL LETTER V WITH DOT BELOW -1, -- LATIN SMALL LETTER W WITH GRAVE .. LATIN SMALL LETTER W WITH GRAVE -1, -- LATIN SMALL LETTER W WITH ACUTE .. LATIN SMALL LETTER W WITH ACUTE -1, -- LATIN SMALL LETTER W WITH DIAERESIS .. LATIN SMALL LETTER W WITH DIAERESIS -1, -- LATIN SMALL LETTER W WITH DOT ABOVE .. LATIN SMALL LETTER W WITH DOT ABOVE -1, -- LATIN SMALL LETTER W WITH DOT BELOW .. LATIN SMALL LETTER W WITH DOT BELOW -1, -- LATIN SMALL LETTER X WITH DOT ABOVE .. LATIN SMALL LETTER X WITH DOT ABOVE -1, -- LATIN SMALL LETTER X WITH DIAERESIS .. LATIN SMALL LETTER X WITH DIAERESIS -1, -- LATIN SMALL LETTER Y WITH DOT ABOVE .. LATIN SMALL LETTER Y WITH DOT ABOVE -1, -- LATIN SMALL LETTER Z WITH CIRCUMFLEX .. LATIN SMALL LETTER Z WITH CIRCUMFLEX -1, -- LATIN SMALL LETTER Z WITH DOT BELOW .. LATIN SMALL LETTER Z WITH DOT BELOW -1, -- LATIN SMALL LETTER Z WITH LINE BELOW .. LATIN SMALL LETTER Z WITH LINE BELOW -59, -- LATIN SMALL LETTER LONG S WITH DOT ABOVE .. LATIN SMALL LETTER LONG S WITH DOT ABOVE -1, -- LATIN SMALL LETTER A WITH DOT BELOW .. LATIN SMALL LETTER A WITH DOT BELOW -1, -- LATIN SMALL LETTER A WITH HOOK ABOVE .. LATIN SMALL LETTER A WITH HOOK ABOVE -1, -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE -1, -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE -1, -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE -1, -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE -1, -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW -1, -- LATIN SMALL LETTER A WITH BREVE AND ACUTE .. LATIN SMALL LETTER A WITH BREVE AND ACUTE -1, -- LATIN SMALL LETTER A WITH BREVE AND GRAVE .. LATIN SMALL LETTER A WITH BREVE AND GRAVE -1, -- LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE .. LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE -1, -- LATIN SMALL LETTER A WITH BREVE AND TILDE .. LATIN SMALL LETTER A WITH BREVE AND TILDE -1, -- LATIN SMALL LETTER A WITH BREVE AND DOT BELOW .. LATIN SMALL LETTER A WITH BREVE AND DOT BELOW -1, -- LATIN SMALL LETTER E WITH DOT BELOW .. LATIN SMALL LETTER E WITH DOT BELOW -1, -- LATIN SMALL LETTER E WITH HOOK ABOVE .. LATIN SMALL LETTER E WITH HOOK ABOVE -1, -- LATIN SMALL LETTER E WITH TILDE .. LATIN SMALL LETTER E WITH TILDE -1, -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE -1, -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE -1, -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE -1, -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE -1, -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW -1, -- LATIN SMALL LETTER I WITH HOOK ABOVE .. LATIN SMALL LETTER I WITH HOOK ABOVE -1, -- LATIN SMALL LETTER I WITH DOT BELOW .. LATIN SMALL LETTER I WITH DOT BELOW -1, -- LATIN SMALL LETTER O WITH DOT BELOW .. LATIN SMALL LETTER O WITH DOT BELOW -1, -- LATIN SMALL LETTER O WITH HOOK ABOVE .. LATIN SMALL LETTER O WITH HOOK ABOVE -1, -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE -1, -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE -1, -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE -1, -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE -1, -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW -1, -- LATIN SMALL LETTER O WITH HORN AND ACUTE .. LATIN SMALL LETTER O WITH HORN AND ACUTE -1, -- LATIN SMALL LETTER O WITH HORN AND GRAVE .. LATIN SMALL LETTER O WITH HORN AND GRAVE -1, -- LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE -1, -- LATIN SMALL LETTER O WITH HORN AND TILDE .. LATIN SMALL LETTER O WITH HORN AND TILDE -1, -- LATIN SMALL LETTER O WITH HORN AND DOT BELOW .. LATIN SMALL LETTER O WITH HORN AND DOT BELOW -1, -- LATIN SMALL LETTER U WITH DOT BELOW .. LATIN SMALL LETTER U WITH DOT BELOW -1, -- LATIN SMALL LETTER U WITH HOOK ABOVE .. LATIN SMALL LETTER U WITH HOOK ABOVE -1, -- LATIN SMALL LETTER U WITH HORN AND ACUTE .. LATIN SMALL LETTER U WITH HORN AND ACUTE -1, -- LATIN SMALL LETTER U WITH HORN AND GRAVE .. LATIN SMALL LETTER U WITH HORN AND GRAVE -1, -- LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE -1, -- LATIN SMALL LETTER U WITH HORN AND TILDE .. LATIN SMALL LETTER U WITH HORN AND TILDE -1, -- LATIN SMALL LETTER U WITH HORN AND DOT BELOW .. LATIN SMALL LETTER U WITH HORN AND DOT BELOW -1, -- LATIN SMALL LETTER Y WITH GRAVE .. LATIN SMALL LETTER Y WITH GRAVE -1, -- LATIN SMALL LETTER Y WITH DOT BELOW .. LATIN SMALL LETTER Y WITH DOT BELOW -1, -- LATIN SMALL LETTER Y WITH HOOK ABOVE .. LATIN SMALL LETTER Y WITH HOOK ABOVE -1, -- LATIN SMALL LETTER Y WITH TILDE .. LATIN SMALL LETTER Y WITH TILDE 8, -- GREEK SMALL LETTER ALPHA WITH PSILI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI 8, -- GREEK SMALL LETTER EPSILON WITH PSILI .. GREEK SMALL LETTER EPSILON WITH DASIA AND OXIA 8, -- GREEK SMALL LETTER ETA WITH PSILI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI 8, -- GREEK SMALL LETTER IOTA WITH PSILI .. GREEK SMALL LETTER IOTA WITH DASIA AND PERISPOMENI 8, -- GREEK SMALL LETTER OMICRON WITH PSILI .. GREEK SMALL LETTER OMICRON WITH DASIA AND OXIA 8, -- GREEK SMALL LETTER UPSILON WITH DASIA .. GREEK SMALL LETTER UPSILON WITH DASIA 8, -- GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA 8, -- GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA 8, -- GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI .. GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI 8, -- GREEK SMALL LETTER OMEGA WITH PSILI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI 74, -- GREEK SMALL LETTER ALPHA WITH VARIA .. GREEK SMALL LETTER ALPHA WITH OXIA 86, -- GREEK SMALL LETTER EPSILON WITH VARIA .. GREEK SMALL LETTER ETA WITH OXIA 100, -- GREEK SMALL LETTER IOTA WITH VARIA .. GREEK SMALL LETTER IOTA WITH OXIA 128, -- GREEK SMALL LETTER OMICRON WITH VARIA .. GREEK SMALL LETTER OMICRON WITH OXIA 112, -- GREEK SMALL LETTER UPSILON WITH VARIA .. GREEK SMALL LETTER UPSILON WITH OXIA 126, -- GREEK SMALL LETTER OMEGA WITH VARIA .. GREEK SMALL LETTER OMEGA WITH OXIA 8, -- GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI 8, -- GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI 8, -- GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI 8, -- GREEK SMALL LETTER ALPHA WITH VRACHY .. GREEK SMALL LETTER ALPHA WITH MACRON 9, -- GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI -7205, -- GREEK PROSGEGRAMMENI .. GREEK PROSGEGRAMMENI 9, -- GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI 8, -- GREEK SMALL LETTER IOTA WITH VRACHY .. GREEK SMALL LETTER IOTA WITH MACRON 8, -- GREEK SMALL LETTER UPSILON WITH VRACHY .. GREEK SMALL LETTER UPSILON WITH MACRON 7, -- GREEK SMALL LETTER RHO WITH DASIA .. GREEK SMALL LETTER RHO WITH DASIA 9, -- GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI -32, -- FULLWIDTH LATIN SMALL LETTER A .. FULLWIDTH LATIN SMALL LETTER Z -40); -- DESERET SMALL LETTER LONG I .. DESERET SMALL LETTER ENG **************************************************************** From: Dan Eilers Sent: Saturday, January 29, 2005 8:51 PM Since suggesting that you post unicode tables, I have read http://www.unicode.org/copyright.html which states (paraphrased) that by downloading, copying, installing or otherwise using Unicode Inc.'s data files, you agree to include their copyright notice in any copies. **************************************************************** From: Robert Dewar Sent: Sunday, January 29, 2005 7:29 AM Right, if anyone posts unicode documents as such here, they should follow that process. In my opinion this does not apply to technical use of the standard. Dan, if you think otherwise, feel free to follow your own inclination. **************************************************************** From: Robert Dewar Sent: Saturday, January 29, 2005 5:09 PM My colleage Vincent Celier has been keeping me honest on the Unicode tables by running a completely independent test based on an independently written program analyzing the unicode data base. This is very helpful and averted a potential disaster that might have befalled hapless Mongolian Ada programmers looking forward to Ada 2005. Vincent reported: I have run my test program again, and it found only one problem: 180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;; Is_UTF_Space = FALSE MONGOLIAN VOWEL SEPARATOR should be in category Space, but it is not. -- Vincent This error can be corrected by using the following updated version of the space table. Apparently I had forgotten to regenerate this: -- The following table includes all characters considered spaces, i.e. -- all characters from the Unicode table with categories: -- Separator, Space (Zs) UTF_32_Spaces : constant UTF_32_Ranges := ( (16#00020#, 16#00020#), -- SPACE .. SPACE (16#000A0#, 16#000A0#), -- NO-BREAK SPACE .. NO-BREAK SPACE (16#01680#, 16#01680#), -- OGHAM SPACE MARK .. OGHAM SPACE MARK (16#0180E#, 16#0180E#), -- MONGOLIAN VOWEL SEPARATOR .. MONGOLIAN VOWEL SEPARATOR (16#02000#, 16#0200B#), -- EN QUAD .. ZERO WIDTH SPACE (16#0202F#, 16#0202F#), -- NARROW NO-BREAK SPACE .. NARROW NO-BREAK SPACE (16#0205F#, 16#0205F#), -- MEDIUM MATHEMATICAL SPACE .. MEDIUM MATHEMATICAL SPACE (16#03000#, 16#03000#)); -- IDEOGRAPHIC SPACE .. IDEOGRAPHIC SPACE **************************************************************** From: Robert Dewar Sent: Sunday, January 30, 2005 8:45 AM Vincent Celier (who seems to be turning himself into a Unicode expert :-) makes the following valid point: In http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html: For backwards compatibility, in the file UnicodeData.txt a range is specified not by the form "X..Y", but by their start and end characters. In such cases, the names of characters in the range are algorithmically derivable. Surrogate code points and private use characters have no names. In http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html: There are six special ranges of characters that are represented only by their start and end characters, since the properties in the file are uniform, except for code values (which are all sequential and assigned). This means that the code tables I posted need adjustment with respect to: > It seems to me that, when in the Unicode database we have <... First> > and <... Last> that are of some category, then the whole range First .. > Last of characters are of this category. > > So, these entries in the table UTF_32_Letters: > > (16#03400#, 16#03400#), -- .. > > (16#04DB5#, 16#04DB5#), -- .. > > > should be replaced with this single entry: > > (16#03400#, 16#04DB5#), -- .. > > > Similarly: > > (16#04E00#, 16#04E00#), -- .. Ideograph, First> > (16#09FA5#, 16#09FA5#), -- .. Ideograph, Last> > => > (16#04E00#, 16#09FA5#), -- .. Ideograph, Last> > > (16#0AC00#, 16#0AC00#), -- .. Syllable, First> > (16#0D7A3#, 16#0D7A3#), -- .. Syllable, Last> > => > (16#0AC00#, 16#0D7A3#), -- .. Syllable, Last> > > (16#20000#, 16#20000#), -- .. > > (16#2A6D6#, 16#2A6D6#), -- .. > > => > (16#20000#, 16#2A6D6#), -- .. > > > There are similar modification to be done to the UTF_32_Non_Graphic table. **************************************************************** From: Robert Dewar Sent: Wednesday, February 2, 2005 11:24 AM Seems like there are some obvious omissions Ada.Strings.Wide_Wide_Hash Ada.Strings.Wide_Wide_Unbounded.Hash I also find oddly missing the case insensitive stuff for Wide_Wide. Given that we tool the decision to make this available for identifiers, shouldn't we give run time access to this facility? I certainly intend to do that in GNAT, but currently this is in a GNAT specific package, and I am not sure that seems right. **************************************************************** From: Randy Brukardt Sent: Wednesday, February 2, 2005 2:19 PM > Seems like there are some obvious omissions > > Ada.Strings.Wide_Wide_Hash > Ada.Strings.Wide_Wide_Unbounded.Hash That's an intergration issue of course (it's the intersection of two AIs, AI-285 and AI-302). These packages don't belong in either AI. There are a number of features like that. But I admit that this one isn't on any of my lists, so it had just been overlooked. On the third hand, I haven't updated the section in question yet, so I probably would have caught it at that time. > I also find oddly missing the case insensitive stuff for Wide_Wide. > Given that we tool the decision to make this available for identifiers, > shouldn't we give run time access to this facility? I certainly intend > to do that in GNAT, but currently this is in a GNAT specific package, > and I am not sure that seems right. Certainly the case conversion mappings are defined in Ada.Strings.Wide_Wide_Maps.Constants (i.e. Lower_Case_Map and Upper_Case_Map). But it does seem odd that there aren't any similar facilities in Ada.Characters.Handling. **************************************************************** From: Robert Dewar Sent: Wednesday, February 2, 2005 2:48 PM > Certainly the case conversion mappings are defined in > Ada.Strings.Wide_Wide_Maps.Constants (i.e. Lower_Case_Map and > Upper_Case_Map). But it does seem odd that there aren't any similar > facilities in Ada.Characters.Handling. Since the spec in Ada.Characters.Handling is specifically thought out, rather than derived with the "similar" wand, I took this as a deliberate decision NOT to provide this facility here. For sure, it would be junky to have the wide wide case conversion tables present for programs that do no wide character or wide wide character stuff. **************************************************************** From: Randy Brukardt Sent: Wednesday, February 2, 2005 3:05 PM > For sure, it would be junky to have the wide wide case > conversion tables present for programs that do no wide > character or wide wide character stuff. Yes, it probably would make more sense for that to be a (different) child of Ada.Characters. But it certainly seems odd not to provide it at all. I wonder if that was deliberate or just an oversight? Pascal? **************************************************************** From: Robert Dewar Sent: Wednesday, February 2, 2005 2:47 PM > Certainly the case conversion mappings are defined in > Ada.Strings.Wide_Wide_Maps.Constants (i.e. Lower_Case_Map and > Upper_Case_Map). But it does seem odd that there aren't any similar > facilities in Ada.Characters.Handling. But in the Ada 95 RM, the statement is that the stuff in the package Ada.Strings.Wide_Maps.Wide_Constants provides "the same string handling operations". Are we really meant to interpolate case folding in Ada 95? Certainly we didn't in GNAT, and I know of no test to the contrary, so in GNAT in the Wide_COnstants case, we just fold the Latin-1 set. All we have in AI285 is that the package Wide_Wide_Constants "is similar" to the Wide_Constants package. Now if you can deduce from this that the case mapping is supposed to be the same as the rather arbitrary rules for identifiers, I sure can't. In particular, I would say that if fancy case foldingh is done by these run-time routines, it should be done correctly (e.g. in Turkey) using appropriate locale information. I sure can't tell one way or another from the AI. In fact I just assumed that the functions in Wide_Wide_Constants were functionally the same as those in Constants. We really need more definition here. What is the intention in Ada 95? What is the reality in Ada 95 compilers? (for Wide_Constants). What is the intention in ADa 2005 for Wide_Wide_Constants). **************************************************************** From: Randy Brukardt Sent: Wednesday, February 2, 2005 3:14 PM > We really need more definition here. What is the intention > in Ada 95? What is the reality in Ada 95 compilers? (for > Wide_Constants). What is the intention in ADa 2005 for > Wide_Wide_Constants). Good point. I'd naturally assumed that these tables did "the right thing", but, as you point out, that would be incompatible (well, "inconsistent") with Ada 95. (For the record, Janus/Ada just uses Latin-1 case conversion for Wide_Constants.) It certainly seems weird that the case folding that the compiler uses isn't available to the user (especially since the code clearly has been written). Another thing for AI-395, I guess. **************************************************************** From: Robert Dewar Sent: Wednesday, February 2, 2005 3:53 PM > Good point. I'd naturally assumed that these tables did "the right thing", > but, as you point out, that would be incompatible (well, "inconsistent") > with Ada 95. (For the record, Janus/Ada just uses Latin-1 case conversion > for Wide_Constants.) I bet all other Ada 95 compilers do the same :-) Certainly GNAT does > It certainly seems weird that the case folding that the compiler uses isn't > available to the user (especially since the code clearly has been written). But it is not clear that you want to use the same case folding conventions as for identifiers, where it is critical to be non-locale dependent. I think I would just leave well enough alone. **************************************************************** From: Randy Brukardt Sent: Wednesday, February 2, 2005 4:51 PM > But it is not clear that you want to use the same case folding conventions > as for identifiers, where it is critical to be non-locale dependent. > > I think I would just leave well enough alone. Well, OK, but that's not what some guy named Robert Dewar said this morning when he brought up this topic: > I also find oddly missing the case insensitive stuff for Wide_Wide. > Given that we tool the decision to make this available for identifiers, > shouldn't we give run time access to this facility? I certainly intend > to do that in GNAT, but currently this is in a GNAT specific package, > and I am not sure that seems right. Have you just changed your mind on this, or am I confused as to what you are proposing?? **************************************************************** From: Robert Dewar Sent: Wednesday, February 2, 2005 5:19 PM > Have you just changed your mind on this, or am I confused as to what you are > proposing?? What I would add is a new package, perhaps Ada.Wide_Wide_Characters_Handling or somesuch, which provides the categorization rules used in the compiler. Here is what GNAT provides: package GNAT.UTF_32 is type UTF_32 is mod 2 ** 32; -- The actual allowed range is 16#00_0000# .. 16#01_FFFF# function Is_UTF_32_Letter (U : UTF_32) return Boolean; pragma Inline (Is_UTF_32_Letter); -- Returns true iff U is a letter that can be used to start an identifier. -- This means that it is in one of the following categories: -- Letter, Uppercase (Lu) -- Letter, Lowercase (Ll) -- Letter, Titlecase (Lt) -- Letter, Modifier (Lm) -- Letter, Other (Lo) -- Number, Letter (Nl) function Is_UTF_32_Digit (U : UTF_32) return Boolean; pragma Inline (Is_UTF_32_Digit); -- Returns true iff U is a digit that can be used to extend an identifer, -- which means it is in one of the following categories: -- Number, Decimal_Digit (Nd) function Is_UTF_32_Line_Terminator (U : UTF_32) return Boolean; pragma Inline (Is_UTF_32_Line_Terminator); -- Returns true iff U is an allowed line terminator for source programs, -- which means it is in one of the following categories: -- Separator, Line (Zl) -- Separator, Paragraph (Zp) -- or that it is a conventional line terminator (CR, LF, VT, FF) function Is_UTF_32_Mark (U : UTF_32) return Boolean; pragma Inline (Is_UTF_32_Mark); -- Returns true iff U is a mark character which can be used to extend -- an identifier. This means it is in one of the following categories: -- Mark, Non-Spacing (Mn) -- Mark, Spacing Combining (Mc) function Is_UTF_32_Other (U : UTF_32) return Boolean; pragma Inline (Is_UTF_32_Other); -- Returns true iff U is an other format character, which means that it -- can be used to extend an identifier, but is ignored for the purposes of -- matching of identiers. This means that it is in one of the following -- categories: -- Other, Format (Cf) function Is_UTF_32_Punctuation (U : UTF_32) return Boolean; pragma Inline (Is_UTF_32_Punctuation); -- Returns true iff U is a punctuation character that can be used to -- separate pices of an identifier. This means that it is in one of the -- following categories: -- Punctuation, Connector (Pc) function Is_UTF_32_Space (U : UTF_32) return Boolean; pragma Inline (Is_UTF_32_Space); -- Returns true iff U is considered a space to be ignored, which means -- that it is in one of the following categories: -- Separator, Space (Zs) function Is_UTF_32_Non_Graphic (U : UTF_32) return Boolean; pragma Inline (Is_UTF_32_Non_Graphic); -- Returns true iff U is considered to be a non-graphic character, -- which means that it is in one of the following categories: -- Other, Control (Cc) -- Other, Private Use (Co) -- Other, Surrogate (Cs) -- Other, Format (Cf) -- Separator, Line (Zl) -- Separator, Paragraph (Zp) -- -- Note that the Ada category format effector is subsumed by the above -- list of Unicode categories. function UTF_32_To_Upper_Case (U : UTF_32) return UTF_32; pragma Inline (UTF_32_To_Upper_Case); -- If U represents a lower case letter, returns the corresponding upper -- case letter, otherwise U is returned unchanged. The folding is locale -- independent as defined by documents referenced in the note in section -- 1 of ISO/IEC 10646:2003 end GNAT.UTF_32; **************************************************************** From: Randy Brukardt Sent: Wednesday, February 2, 2005 5:39 PM > > Have you just changed your mind on this, or am I confused as to what you are > > proposing?? > > What I would add is a new package, perhaps > Ada.Wide_Wide_Characters_Handling > or somesuch, which provides the categorization rules used in the compiler. OK, that's what I was proposing too. I would think the name should be Ada.Characters.Wide_Wide_Handling. Ada.Wide_Wide_Characters.Handling would work, too, but I don't see much point in adding another empty package to the heirarchy. And it would be inconsist to have this directly as a child of Ada when all of the other similar packages are grandchildren. Topic for discussion, I guess. Upon reflection, I don't think that it would work to use the mapping features, because the conversion isn't necessarily 1-to-1. So I agree it is best to leave them as they are. **************************************************************** From: Robert Dewar Sent: Wednesday, February 2, 2005 6:34 PM I agree with the name (Wide_Wide_Handling), and presumably you woruld want to add Wide_Handling as well. I would suggest exactly teh categorizations in AI285, we don't want to spend more time on this. I thus withdraw my concern for Turkish I with dot :-) For sure, we need a clearer spec of the mapping units, to make sure that no one else makes the very natural assumption you did. **************************************************************** From: Pascal Leroy Sent: Thursday, February 3, 2005 4:24 PM > We really need more definition here. What is the intention > in Ada 95? Dunno, ask Bob/Tuck. > What is the reality in Ada 95 compilers? (for > Wide_Constants). We only cover Latin-1. In other words, only the lower case Latin-1 letters appear in Lower_Set, and only the Latin-1 upper case letters are transformed by Lower_Case_Map (the other characters are unaffected). > What is the intention in ADa 2005 for Wide_Wide_Constants. My intention was that Wide_Wide_Constant would only cover the Latin-1 set, just like Wide_Constant (I agree that the current RM is unclear here, but it seems that implementations agree, so it's just a matter of fixing the wording). The reason is that I wanted users to be able to move from Wide_Anything to Wide_Wide_Anything with minimum semantic changes. This being said, it looks to me like providing categorization and case mapping at run-time would be a good idea. I'll try to add that to AI 395. A few comments: 1 - I'd prefer to name the packages Ada.Wide_Characters.Handling and Ada.Wide_Wide_Characters.Handling. This way, Ada.Wide_Characters and Ada.Wide_Wide_Characters would act like umbrellas under which implementers and/or users could add packages appropriate to extended character sets and/or locale (collation, encoding, etc.). 2 - I see no reason to restrict categorization to the categories defined in 2.1. We are talking program execution, and a program might be interested in knowing that a character is a digit or a symbol, for instance. So I believe we need to cover all the Unicode categories. This will require bigger tables, but if you don't want these tables in your closure, don't with these packages. 3 - The three case mappings defined by Unicode are lower case, upper case and title case. I think it makes sense to support all three, as they are non-trivial. I don't want to support what RM95 calls basic mapping (dropping the diacriticals) because it's not a well-defined Unicode transformation. At this point I can hear Robert yelling that he cannot share the tables he has for the lexical analyzer, but come on, Robert, it's just another bunch of tables ;-) **************************************************************** From: Robert Dewar Sent: Thursday, February 3, 2005 7:57 AM > At this point I can hear Robert yelling that he cannot share the tables he > has for the lexical analyzer, but come on, Robert, it's just another bunch > of tables ;-) Actually my comment is of a different kind. This is a new feature, with a non-trivial implementation, and no user demand, being introduced much too late in the process. Let's leave well enough alone for this go around, and see if any demand develops. I agree with everything technical you said, I just think it is too late. **************************************************************** From: Bob Duff Sent: Thursday, February 3, 2005 1:17 PM > > We really need more definition here. What is the intention > > in Ada 95? > > Dunno, ask Bob/Tuck. I'm certainly no expert on these matters, and I suspect Tuck isn't either. I've no idea how to answer the above question. **************************************************************** From: Randy Brukardt Sent: Thursday, February 3, 2005 2:44 PM Bob wrote: > I'm certainly no expert on these matters, and I suspect Tuck isn't > either. I've no idea how to answer the above question. The intent is crystal clear: see A.4.7(48). (I found that last night when working on the section - yes, I had missed it too.) Pascal said: > My intention was that Wide_Wide_Constant would only cover the Latin-1 set, > just like Wide_Constant (I agree that the current RM is unclear here, but > it seems that implementations agree, so it's just a matter of fixing the > wording). The reason is that I wanted users to be able to move from > Wide_Anything to Wide_Wide_Anything with minimum semantic changes. That's fine, but then why did you drop the note A.4.7(48) when you created A.4.8? You copied everything else, so it would seem to have been an intentional change. Or, I suppose you could have just failed to copy all of the text. Anyway, an analog to the note A.4.7(48) should be included in A.4.8, and then all is well. **************************************************************** From: Robert Dewar Sent: Thursday, February 3, 2005 6:27 PM This should not be a note, it's content cannot be derived from the text in my opinion. I would retain the note but make it normative for both Wide_ and Wide_Wide_ cases. **************************************************************** From: Pascal Leroy Sent: Friday, February 4, 2005 3:07 AM I must have forgotten to copy this piece of text. I certainly don't remember any intentional change here. Given the amount of wording in this AI, the cut-and-paste error is by far the most likely explanation. **************************************************************** From: Robert Dewar Sent: Friday, February 4, 2005 8:34 AM Do you agree that while we are at it, we should make that note normative. Otherwise we are depending on the meaning of "similar" and since we as Ada experts got confused on the intent before seeing this note, we should make sure that the normative text makes the intent clear. This is really even more true for Ada 2005, where case folding for wide characters has reared its ugly head, as well as character classification. Given that the RM goes to so much effort to define this stuff, it is not unnatural to expect that this would be available at run time, and consequently not at all unnatural to expect the opposite of the note for wide_wide character at least. **************************************************************** From: Robert Dewar Sent: Thursday, February 3, 2005 9:08 PM This has: type char16_array is array (size_t range <>) of aliased char16_t; pragma Pack (char16_array); It seems meaningless to pragma Pack an array with aliased characters since the size of such components is fixed anyway. **************************************************************** From: Randy Brukardt Sent: Thursday, February 3, 2005 9:22 PM Certainly the only requirement of "aliased" is to make the components addressable. If a compiler typically aligns components on (32-bit) word boundaries, that would certainly have addressable components, but they wouldn't be packed (to 16-bits in this case). So the pragma is not quite unnecessary (and it certainly isn't "meaningless"). I agree that on most implementations, it wouldn't have any effect. The same holds for the 32-bit array (although that seems even harder to imagine how it could be unpacked). **************************************************************** From: Robert Dewar Sent: Thursday, February 3, 2005 9:31 PM No the sizes have to be the same as well, because pointers don't know what they are pointing to. Take your example of components being aligned to 32-bits. You can only imagine that making sense on a machine where it is easier to address 32-bit aligned stuff. But on such a machine aliased stand alone components would be allocated 32-bits anyway, so you would have to have 32-bit components. Remember that packing things loses independence, so there is a real negative effect, and in practice zero positive effect. Furthermore, the real requirement should be that the array is layed out the way C would lay it out. So the appropriate notation would be pragma Convention (C, ...) rather than pragma Pack. **************************************************************** From: Randy Brukardt Sent: Thursday, February 3, 2005 9:52 PM I should know better than to comment on things I barely understand. :-) The real reason for the pragm Pack is that it is given on all of the existing arrays in Interfaces.C: Char_Array and WChar_Array, both of which also have aliased components. Char_Array also gives a 'Component_Size clause, which seems like duplication to me - it would override pack in any case. I agree that Convention (C,...) would make more sense, as would a 'Component_Size clause. But we tend to copy what's already in the Standard, because if no one has complained, it must be right. :-) **************************************************************** From: Robert Dewar Sent: Thursday, February 3, 2005 9:57 PM Are you sure this was aliased in the first version of the Ada 95 standard (I know we added aliased somewhere :-) You don't want a component_size clause, that's inappropriate if a convention (C, ...) is there. I really think a convention (C is worth adding in any case, since it really gets to the heart of the intention here! **************************************************************** From: Randy Brukardt Sent: Thursday, February 3, 2005 10:19 PM > Are you sure this was aliased in the first version of the Ada 95 > standard (I know we added aliased somewhere :-) Yes, it was Stream_Element_Array that it was added to. These had to be aliased from the begining; it would be hard to map a char* (or is that *char? I use C just often enough to be dangerous :-) if they were not aliased. > You don't want a component_size clause, that's inappropriate if > a convention (C, ...) is there. I really think a convention (C > is worth adding in any case, since it really gets to the heart > of the intention here! Well, everything in Interfaces.C is defined to be C-compatible. I realize that's not *quite* the same, but it would seem that a Convention (C, ...) would be about as useful as Pack, and less obvious as to the intent. (It would be odd to put Convention only on a couple arrays, rather than everything in the package -- after all, the whole package is above C interfacing). I'd be more tempted to just delete all of the pragma Pack and call it a day. :-) **************************************************************** From: Robert Dewar Sent: Thursday, February 3, 2005 10:51 PM I agree with that temptation :-) **************************************************************** From: Pascal Leroy Sent: Friday, February 4, 2005 3:23 AM I am opposed to adding convention C there. It would have to be done for each and every type, it would make the spec harder to read, and everything in these packages is C-compatible anyway. I did put pragma Pack on the new arrays because the others had it. I agree that all these pragmas are unnecessary, but this seems insufficiently broken to change. Robert's argument about independent addressibility is formally true, but let's get real, no implementer who cares about C is going to make these components non-independently addressable (see in particular AARM 9.10(1.c)). **************************************************************** From: Robert Dewar Sent: Friday, February 4, 2005 8:54 AM In that case, let's get real and remove the Pack. The trouble is that to me, the Pack overrides the C-compatible requirement. Now in practice we may be rescued by the aliased, but this seems indirect. Actually we may also be rescued if Wide_Wide_Character'Size is 32, but is that in fact the case? I can't find anything that explicitly says this is the case, and the statement in package Standard: -- The declaration of type Wide_Wide_Character is based on the full -- ISO/IEC 10646:2003 character set. The first 65536 positions have the -- same contents as type Wide_Character. See 3.5.2. type Wide_Wide_Character is (nul, soh, ..., FFFE, FFFF, ...); is not entirely clear, but I guess the full character set is indeed the range 16#0000_0000# to 16#FFFF_FFFF# rather than the defined range of 16#0000_0000# to 16#0010_FFFF#? I think we should clearly mandate the size to be 32 bits in package Standard. Well anyway a useful discussion, makes me realize that my implementation of UTF-8 is incomplete, since it handles only up to 16#10_FFFF#. To be fixed! **************************************************************** From: Pascal Leroy Sent: Friday, February 4, 2005 10:51 AM > Actually we may also be rescued if Wide_Wide_Character'Size > is 32, but is that in fact the case? I can't find anything > that explicitly says this is the case... And there is no such thing. 3.5.2(3.1/2) has: "The predefined type Wide_Wide_Character is a character type whose values correspond to the 2147483648 code positions of the ISO/IEC 10646:2003 character set." So this type only has 2**31 values, you have 128 groups of 256 planes of 256 rows of 256 cells. In the absence of a size clause, I would therefore expect Wide_Wide_Character'Size to be 31. We don't put size clauses for Character and Wide_Character, so I didn't put one for Wide_Wide_Character. But then Wide_Wide_String is a packed array of Wide_Wide_Character, so each character would occupy 31 bits! I don't think this is exactly what we want. Better put a size clause, then. > is not entirely clear, but I guess the full character set is > indeed the range 16#0000_0000# to 16#FFFF_FFFF# rather than > the defined range of 16#0000_0000# to 16#0010_FFFF#? The upper bound is actually 16#7FFF_FFFF#. Get it right the first time, you don't want to have to revise your compiler when they encode Klingon :-) **************************************************************** From: Bob Duff Sent: Friday, February 4, 2005 11:23 AM > I think we should clearly mandate the size to be 32 bits in package > Standard. Hmm. Good point! Does every 32-bit bit pattern represent a valid value of type W_W_C? I would think the answer should be "yes". For example, unchecked conversion of any 32-bit integer to W_W_C should not cause erroneous execution. And 'Value should not raise Constraint_Error if that code point does not exist as a character. I don't think we want W_W_C to be one of those evil "holey" enumeration types. If you agree with that, this implies that the standard should say that W_W_C has exactly 2**32 enumeration literals, which would imply that 'Size = 32. And I suppose most of them are written in italics, and can't be typed in a program (just like nul of type Character). Is that right? I suppose ISO adds new characters from time to time? Does the Ada standard automatically track this? Do we have to publish a new AI every time it happens? **************************************************************** From: Robert Dewar Sent: Friday, February 4, 2005 12:04 PM > But then Wide_Wide_String is a packed array of Wide_Wide_Character, so > each character would occupy 31 bits! I don't think this is exactly what > we want. Better put a size clause, then. Right, I think so (a size clause is appropriate indeed). >>is not entirely clear, but I guess the full character set is >>indeed the range 16#0000_0000# to 16#FFFF_FFFF# rather than >>the defined range of 16#0000_0000# to 16#0010_FFFF#? > > The upper bound is actually 16#7FFF_FFFF#. Get it right the first time, > you don't want to have to revise your compiler when they encode Klingon > :-) I guess we should have tests in the suite for this range X := Wide_Wide_Character'Val (16#7FFF_FFFF#); -- OK X := Wide_Wide_Character'Val (16#FFFF_FFFF#); -- raise CE Actually I wonder whether we should not just make Wide_Wide_Character be 32-bits and be done with it, and just say that the first 2**31 values, correspond to the 10646 type and the upper half is implementation defined. It seems useful to be able to interchange raw bytes and characters, so why not raw words and 4-byte characters. As for Klingon, who knows what the case folding rules are? :-) Note the above discussion is even more reason to remove the pragma Pack from interfaces.c :-) **************************************************************** From: Robert Dewar Sent: Friday, February 4, 2005 12:05 PM > Does every 32-bit bit pattern represent a valid value of type W_W_C? > I would think the answer should be "yes". For example, unchecked > conversion of any 32-bit integer to W_W_C should not cause erroneous > execution. And 'Value should not raise Constraint_Error if that code > point does not exist as a character. I agree > I don't think we want W_W_C to be one of those evil "holey" enumeration > types. Right > If you agree with that, this implies that the standard should say that > W_W_C has exactly 2**32 enumeration literals, which would imply that > 'Size = 32. And I suppose most of them are written in italics, and > can't be typed in a program (just like nul of type Character). > Is that right? Right > I suppose ISO adds new characters from time to time? > Does the Ada standard automatically track this? > Do we have to publish a new AI every time it happens? No, the AI is designed to avoid this The limitations on identifiers are a problem, but that's a self-created one. In GNAT we have two modes for identifiers, the old one we have always allowed (all wide chars allowed, no case equivalence, and the new complex rules in AI-285. So Klingon programmers can still use klingon letters in identifiers in GNAT using the old method. **************************************************************** From: Pascal Leroy Sent: Friday, February 4, 2005 4:05 PM > I guess we should have tests in the suite for this range > > X := Wide_Wide_Character'Val (16#7FFF_FFFF#); -- OK > X := Wide_Wide_Character'Val (16#FFFF_FFFF#); -- raise CE Agreed. > Actually I wonder whether we should not just make > Wide_Wide_Character be 32-bits and be done with it, and just > say that the first 2**31 values, correspond to the 10646 type > and the upper half is implementation defined. No, we want to stick to the 10646 as closely as possible, if only for political reasons. Plus, why invite non-portability in the upper half when we really don't need it. By adding a size clause, we pretty much ensure that the right thing happens, but that the range complies with 10646. Note that the type is not "holey", so it doesn't create nasty erroneousness. And if you want to interchange W_W_C, just do an unchecked conversion to Integer_32, it will work. **************************************************************** From: Robert Dewar Sent: Friday, February 4, 2005 4:38 PM > No, we want to stick to the 10646 as closely as possible, if only for > political reasons. Plus, why invite non-portability in the upper half > when we really don't need it. By adding a size clause, we pretty much > ensure that the right thing happens, but that the range complies with > 10646. I think this is overplaying the political card. There is nothing in C that would limit the range of the 32 bit char type, so why should there be in Ada? This is nothing in 10646 that requires this limitation. Please let's not get into a mode of letting Ada make things into a pain in the neck when it is not necessary. There is no real non-portability that results either (if you think there is, show me a program, remember that Image is non-portable in any case). > > Note that the type is not "holey", so it doesn't create nasty > erroneousness. Sure it does, if you do an unchecked conversion from a 32 bit type, then you get erroneousity if the value is out of range. I really think this is a siginficant mistake. I see nothing anywhere that would require us to introduce this obvius inconvenience. One thing to remember here is that although Wide_Wide_Character is nominally 10646 there is nothing much at run time that makes it so (at compiel time we have the extremely irritating restrictions on character and string literals, which are also a mistake). In practice you should be able to use it for any 32-bit character coding, just as in the 8-bit case. Nominally the Ada Character type is ISO Latin-1, but almost no Ada users use it that way, instead they use it to represent whatever the native encoding of their environment is (almost never Latin-1 on a PC for example), and everything works out just fine. I must say these comments about political motivations sure explain what I regard as a bit of over-enthusiasm in trying to over-support 10646 with things that no other programming languages are doing. I see no reason for Ada to shoot itself in the foot! > And if you want to interchange W_W_C, just do an unchecked conversion to > Integer_32, it will work. I have no idea what this is supposed to mean, but I don't think it is relevant. **************************************************************** From: Bob Duff Sent: Friday, February 4, 2005 4:53 PM > I think this is overplaying the political card. There is nothing in C > that would limit the range of the 32 bit char type, so why should there > be in Ada? I'm with Robert on this. If W_W_C has 2**31 values, then I can't safely read a text file without tripping over erroneousness! That seems pretty bad to me. If the file is encoded as a sequence of 32-bit values (that's all UTF-32 is, right?), it can certainly contain bit patterns that are out of bounds. I think UTF-8 and/or the "compressed" representation (the sliding-windows thing) allow to represent out-of-range values, too. (I'm not sure about that -- anybody know for sure?) We're not violating the unicode standard by having 2**32 values. That's a perfectly reasonable way for Ada to represent the characters of that standard. (Robert's analogy with C is apt.) Note that all the out-of-bounds values will have untypable enumeration literals (italics), so there's no portability issue. I was not suggesting that implementations be allowed to provide impl-def literals for those values. Note that if there are 2**32 values as I suggested, you can't easily *generate* files containing bad characters, but you can easily *process* such files. That's the way it should be, right? **************************************************************** From: Robert Dewar Sent: Friday, February 4, 2005 5:41 PM > If the file is encoded as a sequence of 32-bit values (that's all UTF-32 > is, right?), it can certainly contain bit patterns that are out of > bounds. I think UTF-8 and/or the "compressed" representation (the > sliding-windows thing) allow to represent out-of-range values, too. > (I'm not sure about that -- anybody know for sure?) Actually, this is a problem. UTF-8 has no way of representing values greater than 31-bits, so if we do allow "upper half" wide wide characters they cannot be encoded in UTF-8 form. However, I think this is not terrible it just means that if you try to output such a value in UTF-8 form, you get an exception. **************************************************************** From: Randy Brukardt Sent: Friday, February 4, 2005 5:59 PM > Actually, this is a problem. UTF-8 has no way of representing values > greater than 31-bits, so if we do allow "upper half" wide wide characters > they cannot be encoded in UTF-8 form. However, I think this is not terrible > it just means that if you try to output such a value in UTF-8 form, you get > an exception. It could hardly be a problem for the Standard, which never even mentions UTF-8; implementation-defined stuff can do what it wants. **************************************************************** From: Robert Dewar Sent: Friday, February 4, 2005 6:06 PM Are you sure there is no connection between 10646 and UTF-8? That's not what I have read in several different places. **************************************************************** From: Randy Brukardt Sent: Friday, February 4, 2005 6:17 PM I meant our standard (Ada); it never mentions UTF-8, even in AARM notes. That means doing anything in UTF-8 is implementation-defined from the perspective of Ada. Certainly there is no requirement for Wide_Wide_Wide_Text_IO to write in UTF-8. **************************************************************** From: Robert Dewar Sent: Friday, February 4, 2005 7:19 PM What an extraordinary head-in-the-sand attitude :-) Reminds me of Algol-60 booting on all I/O. These 32-bit characters are only viable if there is a reasonably standardized encoding muechanism. UTF-8 is really the only reasonable candidate. You can't just ignore this. It would be like designing the language with very nice graphic characters and then saying it is up to the implementations to find out how to represent programs, nothing to do with us. Hmm! come to think of it, the Algol-60 folks did that as well :-) **************************************************************** From: Randy Brukardt Sent: Friday, February 4, 2005 8:51 PM > It would be like designing the language with very nice graphic characters and > then saying it is up to the implementations to find out how to represent > programs, nothing to do with us. Hmm! come to think of it, the Algol-60 > folks did that as well :-) Humm, this certainly applies to Ada 95, and applies as well to Ada 2005. Certainly, 2.1(18), and the representation sentences of 2.1(4/2) and 2.1(5/2) are unchanged in Ada 2005. It seems to me that this is exactly the issue that Dan was worried about: a backdoor requirement to support UTF-8 everywhere. If it really is a requirement, it should be specified in 2.1 (and debated as such). Anyway, I'm not sure which Robert Dewar this is. I know that the Robert Dewar I know was very opposed to any sort of runtime UTF-8 support back when that was discussed in October 2002. I think that's one of the reasons that we didn't wade into encoding. And the Robert Dewar I know always has been very vocal that Ada shouldn't require a source representation. Perhaps that Robert Dewar has been replaced by a newer model? Seriously, what do you think the Ada standard should say here that it doesn't currently say? Generally, we've avoided questions of representations. **************************************************************** From: Robert Dewar Sent: Friday, February 4, 2005 10:07 PM > Anyway, I'm not sure which Robert Dewar this is. I know that the Robert > Dewar I know was very opposed to any sort of runtime UTF-8 support back when > that was discussed in October 2002. I think that's one of the reasons that > we didn't wade into encoding. And the Robert Dewar I know always has been > very vocal that Ada shouldn't require a source representation. Perhaps that > Robert Dewar has been replaced by a newer model? I am more consistent than you think. I think it is fine for the standard to say nothing about representation. BUT, and it is a big but, there must be one or more fairly reasonable and obvious way of handling the representation. Although theoretically it is not required, I can't imagine an Ada compiler not supporting stream files in ASCII, and likewise for Ada 2005 I can't imagine a compiler not supporting UTF-8 (GNAT has supported UTF-8 for years). What representation do you have in mind for supporting full 32-bit characters? **************************************************************** From: Randy Brukardt Sent: Friday, February 4, 2005 10:43 PM > Although theoretically it is not required, I can't imagine an Ada > compiler not supporting stream files in ASCII, and likewise for Ada 2005 > I can't imagine a compiler not supporting UTF-8 (GNAT has supported > UTF-8 for years). If that's true (and I agree it is), then the Standard should stop the charade and require UTF-8 as a source representation. But that wasn't the issue that you raised. You asked what happened if someone *output * a character value > 2**31. That certainly has nothing to do with source representation; it's purely a runtime issue. > What representation do you have in mind for supporting full 32-bit > characters? Where? In memory, 32-bits, I would expect. (Although many applications need in-memory UTF-8 support, because the space used by rarely used 32-bit characters is too much. It's also needed for files as well, if its impractical to read the file with Text_IO - as in one string component in a larger database record. But that doesn't map to Very_Wide_String.) In Double_Wide_Text_IO, probably the same unless there was a customer demand for something else. **************************************************************** From: Pascal Leroy Sent: Saturday, February 5, 2005 3:40 AM > If W_W_C has 2**31 values, then I can't safely read a text > file without tripping over erroneousness! That seems pretty > bad to me. I would expect Wide_Wide_Text_IO to raise Data_Error if it finds a 32-bit element that is not in W_W_C. Similarly, I would expect Wide_Wide_Text_IO to raise Data_Error if it reads a UTF-8 file that is improperly encoded (assuming that you support UTF-8 files). I am not sure if this can be deduced from A.10.6(10), so a clarification might help, but I don't see any erroneousness here. Now if you read a file of 32-bit elements using, say, Stream_IO, and you uncheck convert the result to W_W_C, sure, you can get erroneousness, but you don't have my sympathy: you should be doing some checking on the data you get, or you should be using Wide_Wide_Character'Val. But then the situation is not different from, say, Boolean: if you read a raw byte and trust it to be a boolean, you have a problem. So I don't see what this erroneousness fuss is about. **************************************************************** From: Robert Dewar Sent: Saturday, February 5, 2005 8:23 AM Another issue with only allowing the limited range for WWC is that it means there are various constraint checks at run time. These seem undesirable for two reasons. 1. They waste time in the normal case 2. They achieve no useful result 3. They force the programmer to be prepared to handle CE's in situations where this is inconvenient and unexpected. We also have the unfinished business of whether the Interfaces.C routine To_Ada that translates from char32_t to WWC can raise CE. In my view, if it can, then that clearly points out a weakness that the two types do not properly correspond. Note that in Ada 83, the same mistake was made. Character was defined to be 7 bits, and some compilers did annoying checks to make sure that upper half characters were not used. This was a real pain. It is important to realize that this pain had nothing whatever to do with Latin-1. What people wanted was an 8-bit character code where neither the compiler nor the run-time intefered with the representations. They know what graphic corresponds to 16#A5# in their environment and the RM cannot possibly know, the best the RM can do is not get in the way. Well for 8-bit Character, the RM has lots of details about Latin-1, but most programmers can completely ignore this. For example, in my windows environment, if I want an upper case enya (is that the spelling, I mean the upper case N with a tilde), then I output Character'Val (16#A5#) and it works fine. The fact that the formal model of Ada thinks I am putting out a Yen_Sign is of no earthly relevance to me. Let's not repeat the Ada 83 mistake with Ada 2005. We can make sure the RM supports 10646, but let's not have it get in the way of supporting arbitrary 32-bit character sets at run time. It's bad enough to restrict the contents of string and character literals, but at least that you can get around with WWC'Value etc. But restricting the range is fatal. Even suppressing checks won't work, since the run time will still be riddled with unwanted checks, and the optimizer is always likely to do unhelpful things. **************************************************************** From: Pascal Leroy Sent: Saturday, February 5, 2005 3:47 AM > Actually, this is a problem. UTF-8 has no way of representing > values greater than 31-bits, so if we do allow "upper half" > wide wide characters they cannot be encoded in UTF-8 form. > However, I think this is not terrible it just means that if > you try to output such a value in UTF-8 form, you get an exception. But then why would this value exist in the W_W_C type in the first place? You cannot write it in a UTF-8 file, surely you cannot read it from a UTF-8 file. The same is presumable true for the other "standard" formats: UTF-16 and UCS-4, which are all intented to cover only 31 bits: I don't see why Wide_Wide_Text_IO should allow the creation of incorrect UCS-4 file. So the only advantage of having a 32-bit value internally is presumably to make Unchecked_Conversion safe. But since this is a type that has attributes 'Pos and 'Val, I don't see why you would use Unchecked_Conversion at all. **************************************************************** From: Robert Dewar Sent: Saturday, February 5, 2005 7:52 AM Let's put it another way round. Pascal, you are proposing a restriction that some of us find a pain in the neck which will cause trouble. We have given our reasons (you did not even respond to my point about arbitrary 32-bit encodings). How about you give some reason for putting *in* the restriction. So far, all we have is a (to my mind pretty bogus) "political" argument that the restriction makes us better conform to 10646. I have said why I think this is bogus a) 10646 requires support of certain stuff, nowhere does it say that languages cannot support more. b) other languages do not implement this restriction Let me give a concrete example of why this will cause trouble. In C, the type char32_t can be used to store arbitrary 32-bit values. The routines in interfaces-c allow such values to be converted to wide_wide_character. Do you really want checks all over the place here and raising of CE if the "conversion" fails. I think not. **************************************************************** From: Robert Dewar Sent: Saturday, February 5, 2005 7:57 AM > But then why would this value exist in the W_W_C type in the first place? > You cannot write it in a UTF-8 file, surely you cannot read it from a > UTF-8 file. The same is presumable true for the other "standard" formats: > UTF-16 and UCS-4, which are all intented to cover only 31 bits: I don't > see why Wide_Wide_Text_IO should allow the creation of incorrect UCS-4 > file. So the only advantage of having a 32-bit value internally is > presumably to make Unchecked_Conversion safe. But since this is a type > that has attributes 'Pos and 'Val, I don't see why you would use > Unchecked_Conversion at all. First, I have a very realistic scenario for getting WWC values of this type, namely by conversion from C types char32_t. Second, you miss the point I think about unchecked conversion. It's quite a normal programming practice to represent arbitrary sequences of bytes as type Character. Suppose in C you have a sequence of bytes that mixes real character values, and arbitrary integer values. Since C makes no difference between these types, such a mixture is quite reasonable. How do you map that in Ada? You have two choices: 1. Treat as array of character, either by unchecked conversion or simply by acquiring say an address of string from C. Then when you want an integer value, use Character'Pos. 2. Treat as array of unsigned byte, again either by unchecked conversion or simply by acquiring an address. Then when you what a character value, use Character'Val. Neither of these is fully satisfactory as a mapping for the C type, but both are workable. In practice approach 1 is likely to be more convenient, since it retains string literals, and access to the string functions. Well in the 32-bit case, exactly the same situation will arise, but you are arbitrarily insisting that for the 32-bit case, only approach 2 will be used. I find that annoying. That's what the "unchecked conversion" issue is all about. It's about low level mucking with foreign data among other things. **************************************************************** From: Pascal Leroy Sent: Monday, February 7, 2005 8:56 AM > How about you give some reason for putting *in* the > restriction. Ok, let me try to articulate my reasoning better. We have an external definition of a character set that happens to have 2**31 values. Why it was chosen that way by the Unicode folks is irrelevant. What we are trying to do is model this definition in Ada. We have of course many possible ways to do this. For the sake of the argument, we could use a 4-field record representing the structuring of the Unicode character space in terms of group, plane, row and cell. We would do that if we thought that this structuring was the most important property of the Unicode character set, and that applications would need to access it constantly. Evidently we believe that the structuring of the character space is of minor importance, and that operations like comparison, range, iteration, and of course string literals are essential. So we choose a character type. Whether we choose a character type with 2**32 values or a character type with 2**31 values (and size 32) depends largely on what we think are the most important operations from a user's perspective. - Robert and Bob apparently say that low-level operations like raw I/O, Unchecked_Conversion and interfacing with C will be prevalent for WWC, and therefore want to avoid the erroneousness that might arise if an invalid bit pattern is assigned to an object of WWC. - Pascal thinks that the major reason why users might want to use WWC at runtime is internationalization of applications. In this perspective, they may have to do rather fancy processing on WWC, such as encoding/decoding, sorting/collating, normalization, case conversion, locale-dependent text processing, etc. For this kind of algorithm, having 2**31 extraneous literals would be a nuisance. Users would have to constantly check if some WWC is one of the extraneous values, and failure to do so might lead to subtle bugs (not technically erroneousness, but C_E that would be raised once in a blue moon). To take but one example, on a 32-bit architecture, computations using WWC'Pos in conjunction with operators of root_integer might raise C_E because WWC'Pos would exceed the range of signed arithmetic. I'll put this question in AI 395 and it will be discussed in Paris. > Let me give a concrete example of why this will cause > trouble. In C, the type char32_t can be used to store > arbitrary 32-bit values. The routines in interfaces-c > allow such values to be converted to wide_wide_character. > Do you really want checks all over the place here and > raising of CE if the "conversion" fails. I think not. On this specific point, if you read carefully the C document, you'll see that char32_t has *at least* 32 bits. It is guaranteed to provide enough storage for a Unicode character, but I can imagine that on a 64-bit architecture where access to 32-bit quantities would be inefficient, this type might have 64 bits. So strictly speaking, a type WWC with 2**32 values does not necessarily eliminate the checks and/or erroneousness. (I admit that on most architectures it will.) **************************************************************** From: Robert I. Eachus Sent: Monday, February 7, 2005 11:00 AM >Whether we choose a character type with 2**32 values or a character type >with 2**31 values (and size 32) depends largely on what we think are the >most important operations from a user's perspective. For characters in Character and Wide_Character, Wide_Wide_Character'Val(Wide_Character'Pos(X)) should be the same character. This requires that Wide_Wide_Character'Pos(Wide_Wide_Character'First) = 0. This is guaranteed by RM 3.5.1(7). So far so good. But what about Wide_Wide_Character'Pos? It returns a _universal_integer_ so the fact that some values don't fit in Integer is acceptable, but could cause just as many potential conversion headaches if Wide_Wide_Character is defined to have 2**32 values as would Unchecked_Conversion if Wide_Wide_Character has only 2**31 values. I guess this means that I am slightly in favor of 2**31 values, but Wide_Wide_Character'Size = 32. (I'd rather use 'Pos and 'Val for conversions that Unchecked_Conversion where possible.) **************************************************************** From: Robert Dewar Sent: Monday, February 7, 2005 11:41 AM > - Robert and Bob apparently say that low-level operations like raw I/O, > Unchecked_Conversion and interfacing with C will be prevalent for WWC, and > therefore want to avoid the erroneousness that might arise if an invalid > bit pattern is assigned to an object of WWC. No, we are not saying that this is the prevanlent use, just one useful use, which can be accomodated perfectly fine (without compromising in any way the objectives of the AI). > - Pascal thinks that the major reason why users might want to use WWC at > runtime is internationalization of applications. In this perspective, > they may have to do rather fancy processing on WWC, such as > encoding/decoding, sorting/collating, normalization, case conversion, > locale-dependent text processing, etc. For this kind of algorithm, having > 2**31 extraneous literals would be a nuisance. Users would have to > constantly check if some WWC is one of the extraneous values, and failure > to do so might lead to subtle bugs (not technically erroneousness, but C_E > that would be raised once in a blue moon). To take but one example, on a > 32-bit architecture, computations using WWC'Pos in conjunction with > operators of root_integer might raise C_E because WWC'Pos would exceed the > range of signed arithmetic. Why would someone using WWC in a strictly 10646 manner ever have out of range values? Only by doing something peculiar or erroneous. > I'll put this question in AI 395 and it will be discussed in Paris. > >>>Let me give a concrete example of why this will cause >>trouble. In C, the type char32_t can be used to store >>arbitrary 32-bit values. The routines in interfaces-c >>allow such values to be converted to wide_wide_character. >>Do you really want checks all over the place here and >>raising of CE if the "conversion" fails. I think not. > > On this specific point, if you read carefully the C document, you'll see > that char32_t has *at least* 32 bits. It is guaranteed to provide enough > storage for a Unicode character, but I can imagine that on a 64-bit > architecture where access to 32-bit quantities would be inefficient, this > type might have 64 bits. Please don't imagine, this type will be 32-bits on all architectures, just like int (no one makes int 64 bits!) There are NO architectures on which access to 32-bit quantities is inefficient. Such an archtitecture would be an ill-designed unusable joke. > So strictly speaking, a type WWC with 2**32 > values does not necessarily eliminate the checks and/or erroneousness. (I > admit that on most architectures it will.) It will on all architectures, unless you can produce one counterexample, it is really unreasonable to argue by appealing to an imaginary situation. The Interfaces.C unit is about the practical world. In your imaginary world, C pointers might be pairs consisting of base address and offset and be totally incomaptible with Ada poingters, but in practice this is not the case, so we don't worry about it. **************************************************************** From: Bob Duff Sent: Monday, February 7, 2005 12:02 PM > Ok, let me try to articulate my reasoning better. Thank you. > - Robert and Bob apparently say that low-level operations like raw I/O, > Unchecked_Conversion and interfacing with C will be prevalent for WWC, and > therefore want to avoid the erroneousness that might arise if an invalid > bit pattern is assigned to an object of WWC. My main concern is input. (All input is "raw", I suppose.) I write code that reads sequences of Characters. I use Text_IO. I use Direct_IO. But mainly, I use my own concoction that is "lean and mean", and interfaces fairly directly (and portably!) to the OS. (A couple of weeks ago, I more-than-doubled the speed of one application by switching from Direct_IO to the "lean and mean" thing.) I want all of these things to be able to read arbitrary data without being erroneous or raising exceptions. It seems to me reasonable to want the same thing for Wide_Wide_Ever_So_Wide_Characters. You suggested Data_Error at one point. That seems like a big change. For plain old Character, Text_IO uses Data_Error for things like malformed floats, not for inability to represent the basic character set being read from the file. > - Pascal thinks that the major reason why users might want to use WWC at > runtime is internationalization of applications. In this perspective, > they may have to do rather fancy processing on WWC, such as > encoding/decoding, sorting/collating, normalization, case conversion, > locale-dependent text processing, etc. I agree that all that stuff is useful. But two points: First, you've got to get data in from files, even in programs that do all of the above. Second, I don't think Data_Error is making life easier for such applications. Now they have to catch Data_Error (and do something about it, necessarily losing information -- you can't tell what the data *is*), and *also* look the thing up in some tables to see if it's a defined code point. (Or more likely, introduce a bug by forgetting about the Data_Error.) I think most of the processing you mention would be *easier* with the suggested 2**32 character literals. My lean and mean package certainly would not want to be raising Data_Error under any circumstances! With 2**31 literals, I think I'd end up ignoring W_W_C, and using My_Wide_Wide_Character is mod 2**32 instead. (And yes, I'd be annoyed that I don't get string literals.) >... To take but one example, on a > 32-bit architecture, computations using WWC'Pos in conjunction with > operators of root_integer might raise C_E because WWC'Pos would exceed the > range of signed arithmetic. That's a very good point, and I don't have a good answer. Ada is broken in this regard, and the only good way to fix it is to support arbitrary-range integers, which ain't gonna happen! (I've always been annoyed that I'm not allowed to say "type T is range 1..10**100;". On some implementations.) **************************************************************** From: Tucker Taft Sent: Monday, February 7, 2005 12:24 PM One point perhaps worth mentioning is that just about no-one will be doing I/O of 32-bit characters directly -- it is hopelessly space inefficient. Some kind of variable- length encoding will be used (e.g. UTF-8). So I wonder whether it will make as much difference. I suspect when converting *to* UTF-8 it will be nice to know you have values in the range 0..2**31-1. **************************************************************** From: Robert Dewar Sent: Monday, February 7, 2005 12:37 PM Well of course in the general case you won't know this, e.g. if you got values from char_32t, or other strangeness is going on. If your program is well behaved and keeps to 31-bit, then this would be true even if WWC actually used 32-bits. **************************************************************** From: Robert Dewar Sent: Monday, February 7, 2005 1:04 PM My main concern here is not to repeat the mistakes of Ada 83 and Ada 95. In both cases, we tied the definitions of the character types far too closely to little used standards, and as a result ended up with a bunch of really annoying restrictions. Examples are 1. only 7-bit characters in Ada 83, when all the world was using 8-bits for all sorts of things. 2. only latin-1 in Ada 95, when in practice, the upper half is used for other things in almost all environments. It is for example an annoyance that you cannot conveniently put all the windows graphic chracters in a string literal. That's why I would prefer to be permissive, and accomodate the standard, but not restrict to it. In particular a) I would allow 32 bits in wide wide character b) I would allow arbitrary chars with codes > 16#FF# in string and character literals (with the possible exception of line/para terminators). For sure we do not want to introduce incompatibilities in programs that do not even use WWC, and the rejection of AD is a real mistake. Minimally we should correct that by allowing other,format characters in strings and characters, but that still leaves incomaptibilities with wide_string compared to Ada 95. Who knows what use people will make of the available upper half in 32-bit character mode in the next ten years? It is sure to be conveniently available in C, since we have 32-bits there for sure. **************************************************************** From: Robert Dewar Sent: Monday, February 7, 2005 7:36 PM > So far so good. But what about Wide_Wide_Character'Pos? It returns a > _universal_integer_ so the fact that some values don't fit in Integer is > acceptable, but could cause just as many potential conversion headaches > if Wide_Wide_Character is defined to have 2**32 values as would > Unchecked_Conversion if Wide_Wide_Character has only 2**31 values. I don't see a problem, universal integer is going to be 64 bits in any reasonable implementation anyway. > I guess this means that I am slightly in favor of 2**31 values, but > Wide_Wide_Character'Size = 32. (I'd rather use 'Pos and 'Val for > conversions that Unchecked_Conversion where possible.) I don't see what problem you are solving by this choice. Can you give sample code showing the "potential .. headaches". **************************************************************** From: Robert I. Eachus Sent: Tuesday, February 8, 2005 1:00 AM > I don't see what problem you are solving by this choice. Can you > give sample code showing the "potential .. headaches". Is it really that hard to imagine? Okay, say you have a file that claims to be ISO10646/Unicode text in some language, let's say Chinese, encoded in four octets per character. So you read the file with an instance of Integer_IO into an array Data_In. To protect against garbage values, you now go through with for I in Data_In'Range loop if Data_In(I) not in 1..Wide_Wide_Character'Pos(Wide_Wide_Character'Last) then Report_Some_Error; end if; end loop; What happens if Wide_Wide_Character'Pos(Wide_Wide_Character'Last) = 2**32-1? It is too late at night for me to figure out whether the current 11.6 and 4.6(28) allow compilers to raise *Constraint_Error* in the if statement. ;-) So in practice that is the net effect of making Wide_Wide_Character'Last 2**32-1 instead of 2**31-1, the data type that a user should choose when converting to an integer type is Interfaces.Unsigned_32 instead of Integer. Is this a big deal? No. But that is what that word *slightly* means above, I think that being able to convert the result of Wide_Wide_Character'Pos to Integer without worrying about exceptions is a bit more elegant. I certainly won't cry and moan if the choice is to use the larger range, but as I said, I have a slight preference. Oh, and notice that this has nothing to do with the data safety issue as such. IMHO, it is much safer to put constants in programs in the most meaningful way. I could have written if Data_In(I) < 1 then..., in fact, the compiler may change the code into exactly that. But the way I wrote it is much more informative--and more likely to be right if my understanding of the range of Wide_Wide_Character is wrong, or if it changes. **************************************************************** From: Pascal Leroy Sent: Tuesday, February 8, 2005 3:05 AM > I don't see a problem, universal integer is going to be 64 > bits in any reasonable implementation anyway. I am not quite sure how to interpret this sentence, since *universal_integer* has no run-time representation, and covers the infinite set of integer numbers. I suppose that you mean *root_integer*, not universal_integer above. If that's the case, then that's a ludicrously bogus assertion (do I sound like RBKD? I try ;-). In our technology root_integer has 32 bits on 32-bit platforms, so overflows involving root_integer are a very real problem. And yes, I do claim that this is a perfectly reasonable implementation, and we are not going to change that. (Of course, on 64-bit platforms root_integer has 64 bits.) **************************************************************** From: Pascal Leroy Sent: Tuesday, February 8, 2005 3:44 AM > My main concern is input. (All input is "raw", I suppose.) > I write code that reads sequences of Characters. I use > Text_IO. I use Direct_IO. But mainly, I use my own > concoction that is "lean and mean", and interfaces fairly > directly (and portably!) to the OS. (A couple of weeks ago, > I more-than-doubled the speed of one application by switching > from Direct_IO to the "lean and mean" thing.) I want all of > these things to be able to read arbitrary data without being > erroneous or raising exceptions. As Tuck pointed out, I don't think it works that way in real life. The vast majority of files containing WWC are *not* going to be made of raw 32-bit elements. They will be encoded in one way or another (UTF-8 being apparently the most common format these days). So Wide_Wide_Bob_IO will need to read *bytes* and feed them to some decoding state machine. If that state machine discovers that the external file is not well formed, it could either report the error (by raising an exception, returning a status, etc) or try to recover (which could mean erroneousness). At any rate, you cannot just ignore the problem, and I don't really see that having 2**32 literals would help. > It seems to me reasonable to want the same thing for > Wide_Wide_Ever_So_Wide_Characters. > > You suggested Data_Error at one point. That seems like a big > change. For plain old Character, Text_IO uses Data_Error for > things like malformed floats, not for inability to represent > the basic character set being read from the file. Hmm. I wonder how your implementation of Wide_Text_IO works, because this is not really a new problem. In our implementation of Wide_Text_IO, you can specify the encoding of the external file using the Form parameter. Then the bytes read from the external file go through an appropriate decoder, and if the decoder discovers that the external file is malformed it raises Data_Error. Surely this is a case where we want to raise an exception, right? Ada is not about reading random junk from external files. When I wrote this, it seemed that Data_Error was the right exception, because it means "there is something rotten in the external file". We could argue about that, but that's a detail. The point is that you can get an exception when reading a file of Wide_Characters, even though that type has 2**16 elements. Even for Wide_Character, the raw file format that you imagine doesn't exist in practice. The closest would be the UCS-2 format, but even this format starts with a signature that indicates the endianness of the machine used to produce the file, and our implementation raises Data_Error if the signature is not well-formed. **************************************************************** From: Robert Dewar Sent: Tuesday, February 8, 2005 8:16 AM > for I in Data_In'Range loop > if Data_In(I) not in 1..Wide_Wide_Character'Pos(Wide_Wide_Character'Last) > then Report_Some_Error; > end if; > end loop; This code is plain wrong, it can be optimized away, presumably what you are trying to do is to do a 'Valid test, but if so, please use 'Valid. The above code, if executed will work. Compilers can always raise CE because of silly capacity limitations, but in practice all compilers support decent sized universal integer at run time. If we have 32 bit characters and you want to check they are in 31 bit range, you can do things in a far simpler way. If you must read in integer values, which seems a clear mistake if you have 32-bit characters, then you can simply test for negative. Better is to read them into WWC and just do: subtype ISO_WWC is WWC range 0 .. WWC'Val(16#7FFF_FFFF); note that for many purposes, you probably want 16#10_FFFF# rather than the full range anyway. Now you just do if Char not in ISO_WWC then ... The above messing with Pos seems awkward to me. > Oh, and notice that this has nothing to do with the data safety issue as > such. IMHO, it is much safer to put constants in programs in the most > meaningful way. I could have written if Data_In(I) < 1 then..., in > fact, the compiler may change the code into exactly that. But the way I > wrote it is much more informative--and more likely to be right if my > understanding of the range of Wide_Wide_Character is wrong, or if it > changes. But it is just plain wrong to read 32-bit unsigned values into Integer, why would any programmer make this mistake? **************************************************************** From: Robert Dewar Sent: Tuesday, February 8, 2005 8:19 AM > I suppose that you mean *root_integer*, not universal_integer above. If > that's the case, then that's a ludicrously bogus assertion (do I sound > like RBKD? I try ;-). In our technology root_integer has 32 bits on > 32-bit platforms, so overflows involving root_integer are a very real > problem. And yes, I do claim that this is a perfectly reasonable > implementation, and we are not going to change that. (Of course, on > 64-bit platforms root_integer has 64 bits.) OK, I am surprised, this means for instance that you get into trouble with 'Size on a 32-bit platform, since on such platforms objects can easily have sizes in bits greater than 2**31. TO limit objects to sizes of 2**31 or less is a serious limitation in my view. Anyway, true, if you are working on a platform with this limitation, you will have to be careful using Pos. I certainly see this limitation as an argument for not wanting to implement 32-bit characters, too bad ... **************************************************************** From: Robert Dewar Sent: Tuesday, February 8, 2005 10:23 AM Thinking this over, the fact that a major important implementation does not allow the 'Pos attribute to be applied to WWC if we go 32 bits, is a really serious problem. I really think that it is essential that 'Pos work fine with Wide_Wide_String. Obviously it is unreasonable to require 64-bit universal integer just for this purpose (and in fact the implementation of 64-bit universal integer is non-trivial on 32-bit machines if you want to avoid unnecessary 64-bit inefficiency). So, given this information combined with the points made by Eachus, I have changed my mind and think that 31-bit is appropriate for Wide_Character. (less work for me, that's what I already implemented :-) We do need to decide what happens with out of range char_32t values. I would think CE must be raised. Annoying to do all those junk tests, but in practice this will be seldom used anyway. **************************************************************** From: Robert Dewar Sent: Tuesday, February 8, 2005 10:25 AM Oh, and I do mean root integer. Sorry, can't get out of the habit of thinking of this as universal integer. I have groked the to-me peculiar terminology in Ada 95, which made sense if you really had Integer'Class but otherwise seems peculiar, anyway, sorry for the confusion there. **************************************************************** From: Pascal Leroy Sent: Tuesday, February 8, 2005 10:43 AM > Obviously it is unreasonable to require > 64-bit universal integer just for this purpose (and in fact > the implementation of 64-bit universal integer is non-trivial > on 32-bit machines if you want to avoid unnecessary 64-bit > inefficiency). Right. We gave some thought to that matter at some point and concluded that it was not an afternoon project. > We do need to decide what happens with out of range char_32t > values. I would think CE must be raised. Annoying to do all > those junk tests, but in practice this will be seldom used anyway. I agree that C_E is the right choice. And I believe that Data_Error is appropriate if you get one of these in a file. **************************************************************** From: Robert Dewar Sent: Tuesday, February 8, 2005 10:53 AM > I agree that C_E is the right choice. And I believe that Data_Error is > appropriate if you get one of these in a file. You mean for Text_IO. As always no checks are required for stream io, sequential io or direct io. **************************************************************** From: Robert I. Eachus Sent: Tuesday, February 8, 2005 3:46 PM > This code is plain wrong, it can be optimized away, presumably what you > are trying to do is to do a 'Valid test, but if so, please use 'Valid. > The above code, if executed will work. But it is not wrong if compiled on a machine with say a 36-bit Integer type. I see no reason not to go to the extra effort of writing what may be "junk" code, if it makes the code easier to port. > Better is to read them into WWC and just do: > > subtype ISO_WWC is WWC range 0 .. WWC'Val(16#7FFF_FFFF); > > note that for many purposes, you probably want 16#10_FFFF# rather > than the full range anyway. > > Now you just do > > if Char not in ISO_WWC then ... > > The above messing with Pos seems awkward to me. Actually that should be: subtype ISO_WWC is range Wide_Wide_Character'First..Wide_Wide_Character'Val(16#7FFF_FFFF#); or subtype Unicode is range Wide_Wide_Character'First..Wide_Wide_Character'Val(16#10_FFFF#); But the first case is unnecessary if Wide_Wide_Character has 2**31 - 1 values. Again a slight argument in favor of doing things that way. > But it is just plain wrong to read 32-bit unsigned values into Integer, > why would any programmer make this mistake? Robert, I think you are ignoring my comments to the effect that all this is minor. As I said, I would use Interfaces.Unsigned_32, instead of Integer if Wide_Wide_Character had more than 2**31 - 1 values. Since I would prefer to write Integer instead of Interfaces.Unsigned_32, that translates into a mild preference for the narrower range. **************************************************************** From: Dan Eilers Sent: Friday, February 4, 2005 5:11 PM Is it intended that other-format characters not be allowed in string literals used as operator_symbols? x: integer := abs(3); -- other-format allowed in abs y: integer := "abs"(3); -- other-format not allowed in "abs" **************************************************************** From: Robert Dewar Sent: Friday, February 4, 2005 6:04 PM Definitely we do NOT want to allow other-format characters in this context, since you cannot tell at lexical analysis time the difference between string literals and operator symbols, and you definitely do not want such lexical rules to have to be resolved later. There is nothing in the AI that suggests any such intention. **************************************************************** From: Randy Brukardt Sent: Friday, February 4, 2005 6:07 PM If we allow other_format characters in string literals (as AI-395 suggests), then clearly there is no issue here. If we don't allow them in string literals, I don't see how or why we should allow them in identifiers (of course, that's a recommendation that is out of our hands). It's most important to be consistent here. (Whatever we do will be "wrong" to someone.) **************************************************************** From: Robert Dewar Sent: Friday, February 4, 2005 7:17 PM > If we allow other_format characters in string literals (as AI-395 suggests), > then clearly there is no issue here. Yes, there is, we won't strip them out in string literals, but can we use "a[mogolian goodness knows what other format char]bs" for the absolute value operator? **************************************************************** From: Pascal Leroy Sent: Saturday, February 5, 2005 4:15 AM Surely a clarification is required, because 6.1(10) is unclear (what does "correspond" mean?). I suppose that the clarification could go either way, but my view would be that you take the sequence of characters from the string literal verbatim, and see if that sequence is appropriate for a reserved word. As an example, say that - is the infamous soft hyphen. Starting from the string litteral "a-bs", you construct the token a-bs, you strip the other_format characters giving abs, and you end up with the reserved word abs. In other word, the answer to your question would be yes in my view. **************************************************************** From: Robert Dewar Sent: Saturday, February 5, 2005 8:05 AM My goodness, there seems no end to silliness here! What *possible* reasonable justification can you give for it being *useful* to have soft hyphens in the middle of the reserved word abs in a string. For reserved words in normal use, we do it because they are not distinguishable from identifiers, and though completely and totally useless to have soft hyphens in between the i and f of IF, it's presumably harmless, and we really need this rule to avoid implementation nonsense, given the (somewhat dubious) decision (which we follow because part of the standard) to allow them in identifiers. For reserved words in strings, the situation is exactly the opposite, strings normally do NOT ignore other-format characters. But when used as operators, you propose they do. This means a completely separate circuit in the compiler to do this stripping. Right, it's relatively easy to add this nonsense. Probably won't take more than ten minutes to implement, but it is annoying to even spend the ten minutes because: a) it is a completely useless feature b) it will waste time in the lexical analyzer for cases where wide characters are not used at all (since you still have to check for them in this special situation). So far, the wide wide character nonsense, though nastily expensive (binary searches on giant tables etc), only affects the programs (which won't exist in practice) that use wide wide characters in identifiers. Can't we get out of this mode of fascination with wide wide characters, and get into the mode of doing a reasonable minimal implementation of the standard. You are proposing a new feature here, without a shred of input that says it is useful to anyone at all. **************************************************************** From: Pascal Leroy Sent: Saturday, February 5, 2005 10:59 AM > My goodness, there seems no end to silliness here! What > *possible* reasonable justification can you give for it being > *useful* to have soft hyphens in the middle of the reserved > word abs in a string. No need to get all excited, I don't care one way or another, I could flip a coin, as long as things are well-defined in the RM. My assumption was that a compiler would have somewhere a routine to clean up an identifier (by removing other_format) and convert it to upper case. It seemed simpler implementation-wise to say that for operators you would obtain the sequence of characters from the string literal and pass it to the clean-and-upper-case routime. I you think that's misguided, fine. At any rate, there is no user benefit one way or another, so we should do what's easiest for implementations. **************************************************************** From: Dan Eilers Sent: Saturday, February 5, 2005 1:36 PM > It seemed simpler implementation-wise to say that for operators you would > obtain the sequence of characters from the string literal and pass it to > the clean-and-upper-case routime. I think you would need to clarify exactly when the "cleaning" occurs, so it is clear what happens for "**". **************************************************************** From: Robert Dewar Sent: Saturday, February 5, 2005 4:19 PM > No need to get all excited, I don't care one way or another, I could flip > a coin, as long as things are well-defined in the RM. Sorry for getting excited (well you know me, it is not really excitement, more a matter of argument style -- once at an ARG meeting JDI made a proposal and I said "That's the most ludicrous junky proposal I have heard in quite a whilte". JDI then said "Ah, now that Robert has pointed out that this is ludicrous and junky, I see I must be wrong :-) :-) Once I had a funny discussion with someone (can't remember who) on the ARG who assumed that JDI and I did not get along, when in fact we are close friends. I guess there is one way in which I am quite different, from Jean. I argue a point of view energetically, but if I can't convince a reasonable majority, I figure it is either because my point of view is wrong, or I am incompetent to present it. Either way, no point in pursuing things :-) > My assumption was that a compiler would have somewhere a routine to clean > up an identifier (by removing other_format) and convert it to upper case. > It seemed simpler implementation-wise to say that for operators you would > obtain the sequence of characters from the string literal and pass it to > the clean-and-upper-case routime. I you think that's misguided, fine. That's extra overhead, because the more natural way of doing things is to discard other-format characters and do the case folding as an identifier is scanned. This is important, since otherwise you are going to have extra overhead of copying and pay a price even for the case of identifiers with no wide characters. > At any rate, there is no user benefit one way or another, so we should do > what's easiest for implementations. Well surely what is easiest for implementations is to do what they do now for Ada 95. The rules for identifiers have changed, the rules for operator symbols do not need to! **************************************************************** From: Robert Dewar Sent: Saturday, February 5, 2005 4:54 PM > I think you would need to clarify exactly when the "cleaning" > occurs, so it is clear what happens for "**". The easiest clarification is to just say that other format characters are either disallowed in strings or significant period. **************************************************************** From: Pascal Leroy Sent: Monday, February 7, 2005 4:15 AM > I think you would need to clarify exactly when the "cleaning" > occurs, so it is clear what happens for "**". You're right, "**" is problematic because the sequence of characters is not appropriate for an identifier. All the more justification for following Robert here: if you put an other_format in the string literal, it does not match one of the acceptable operator names. So "*-*" or "a-bs" are not operator_symbols. **************************************************************** From: Robert Dewar Sent: Saturday, February 5, 2005 4:48 PM I am a little worried about ambiguities introduced by new definititions in packages like Ada.Characters.Handling and Interfaces.C. Our ASIS builds got blown with a message: gnatelim-nodes.adb:252:17: ambiguous expression (cannot resolve "To_String") gnatelim-nodes.adb:252:17: possible interpretation at a-chahan.ads:116 gnatelim-nodes.adb:252:17: possible interpretation at a-chahan.ads:112 I did not look at the sources, but the declarations in question are: function To_String (Item : Wide_Wide_String; Substitute : Character := ' ') return String; function To_String (Item : Wide_String; Substitute : Character := ' ') return String; Now it is true that ASIS is one of the few applications to use wide_character and wide_string types. Furthermore, the code in this particular case is absurd: return To_String ("EMPTY KEY!!!"); but it is possible to imagine legitimate cases, e.g. something like To_String ("["A325"]"); -- get encoded version of wide char Wouldn't it be better to add these new declarations to a new child package Ada.Wide_Wide_Characters.Handling? In that case we could put all the wonderful 10646 categorization stuff there too either now or later. **************************************************************** From: Pascal Leroy Sent: Monday, February 7, 2005 4:45 AM Well, it's clear to me that Ada.Characters.Handling was botched in Ada 95 in the sense that all the operations that involve the Wide_ types should have been part of a child named Ada.Wide_Characters.Handling. Then we could just have replicated the structure for the Wide_Wide_ types. We could follow your suggestion, and if we did we would have for consistency to create a child Ada.Wide_Characters.Handling, too. Unfortunately, the Wide_ operations would have to remain in Ada.Characters.Handling and be renamed in Ada.Wide_Characters.Handling. This would give the impression that the Wide_ types are actually "more important" than the Wide_Wide_ types. My view is that Wide_Character should probably not be used in new applications (ignoring ASIS). In this area, there are two categories of applications: those (the vast majority) which do not care about i18n, and use Character and will continue to ignore the fancy character sets; and those which do care about i18n. The latter category should really be using Wide_Wide_Character, not Wide_Character. The reason is that the BMP is where the Unicode guys stuffed the most frequently used characters, but the choice of what went into the BMP was somewhat arbitrary. if you want to do a good job of supporting Asian languages, you must also handle the characters in plane 2 (SIP), ie use Wide_Wide_Character. I don't feel strongly, but it seems rather insufficiently broken. > but it is possible to imagine legitimate cases, e.g. > something like To_String ("["A325"]"); -- get encoded > version of wide char I don't quite understand this example, btw. It seems that this call would return " " anyway, so it's not exactly useful. **************************************************************** From: Robert Dewar Sent: Monday, February 7, 2005 11:36 AM > Well, it's clear to me that Ada.Characters.Handling was botched in Ada 95 > in the sense that all the operations that involve the Wide_ types should > have been part of a child named Ada.Wide_Characters.Handling. Then we > could just have replicated the structure for the Wide_Wide_ types. Fine but it is not our mission to improve the Ada 95 standard by creating gratuitous incompatibilities. We only introduce non-upwards incompatible changes if there is a really good argument. Here you have no good technical argument other than a matter of consistency and taste. That's not good enough for causing troubles by introducing incompatibilities. > > We could follow your suggestion, and if we did we would have for > consistency to create a child Ada.Wide_Characters.Handling, too. If you like, it's harmless (through fairly useless) to do so > Unfortunately, the Wide_ operations would have to remain in > Ada.Characters.Handling and be renamed in Ada.Wide_Characters.Handling. I agree, bug given your willingness to introduce incompatible change I am a little surprised at your statement of this as obvious. > This would give the impression that the Wide_ types are actually "more > important" than the Wide_Wide_ types. So what? In any case it's true, Wide_ is more important because of compatibility issues and current usage. Wide_Wide is not being put in because of overriding demand from real users! > > My view is that Wide_Character should probably not be used in new > applications (ignoring ASIS). But compatibility is about worrying about new applications, and why ignore ASIS. The majority of our users using Wide_Character are doing so in an ASIS context. I really think a case has not been made for introducing incompatibilities. **************************************************************** From: Randy Brukardt Sent: Wednesday, April 13, 2005 8:05 PM The wording in 6.1(10) now says: The sequence of characters in an operator_symbol shall be identical, after conversion to upper case, to the sequence of characters for one of the six classes of operators defined in clause 4.5 (in upper case). Spaces are not allowed. One or more characters in category other_format may be inserted after any graphic_character in the operator_symbol if the operator_symbol is a reserved word. This seems wrong, as an operator_symbol is a string literal. And how anything with quotes around it could ever "be identical" to "the sequence of characters for one" of "operators defined in clause 4.5" is beyond me. The original wording isn't much better, as it talks about "correspond to" one of the operators (whatever that means). I suggest adding "and removing the surrounding quotation marks" after "conversion to upper case". The last sentence needs fixing, too. The sequence of characters in an operator_symbol shall be identical, after conversion to upper case and removing the surrounding quotation marks, to the sequence of characters for one of the six classes of operators defined in clause 4.5 (in upper case). Spaces are not allowed. One or more characters in category other_format may be inserted after any graphic_character in the operator_symbol other than the surrounding quotation marks if the operator_symbol is a reserved word. Maybe there is a better way to do it, but I can't think of it off-hand. ****************************************************************