!standard 2.1(1) 05-01-26 AI95-00285/13 !standard 2.1(2) !standard 2.1(3) !standard 2.1(4) !standard 2.1(5) !standard 2.1(7) !standard 2.1(8) !standard 2.1(9) !standard 2.1(10) !standard 2.1(11) !standard 2.1(12) !standard 2.1(13) !standard 2.1(14) !standard 2.1(15) !standard 2.1(16) !standard 2.1(17) !standard 0.3(32) !standard 0.3(34) !standard 1.1.4(14) !standard 1.2(8/1) !standard 2.2(03) !standard 2.2(04) !standard 2.2(05) !standard 2.2(08) !standard 2.2(09) !standard 2.3(02) !standard 2.3(03) !standard 2.3(05) !standard 2.3(06) !standard 2.6(06) !standard 3.5(27) !standard 3.5(30-34) !standard 3.5(37) !standard 3.5(39) !standard 3.5(43-51) !standard 3.5(55) !standard 3.5(56) !standard 3.5(59) !standard 3.5.2(2) !standard 3.5.2(3) !standard 3.5.2(4) !standard 3.5.2(5) !standard 3.6.3(2) !standard 3.6.3(4) !standard A.1(36) !standard A.1(42) !standard A.1(49) !standard A.3(1) !standard A.3.2(13) !standard A.3.2(14) !standard A.3.2(16) !standard A.3.2(18) !standard A.3.2(42) !standard A.3.2(43) !standard A.3.2(44) !standard A.3.2(45) !standard A.3.2(46) !standard A.3.2(47) !standard A.3.2(48) !standard A.3.2(49) !standard A.4(1) !standard A.4.1(4) !standard A.4.8(1) !standard A.6(1) !standard A.7(4) !standard A.7(10) !standard A.7(13) !standard A.7(15) !standard A.11(00) !standard A.11(01) !standard A.11(02) !standard A.11(03) !standard A.12(01) !standard A.12.4(01) !standard B.3(39) !standard B.3(40) !standard B.3(60) !standard C.5(07) !standard F(04) !standard F.3(01) !standard F.3(06) !standard F.3(19) !standard F.3(20) !standard F.3.5(01) !standard G.1.5(01) !standard H.4(20) !class amendment 02-01-23 !status Amendment 200Y 04-09-27 !status WG9 approved 04-11-18 !status ARG Approved 8-0-1 04-09-17 !status work item 02-09-24 !status received 02-01-15 !priority Medium !difficulty Hard !subject Support for 16-bit and 32-bit characters !summary Support is added for program text using the entire set of characters from ISO/IEC 10646:2003, and for operating on characters outside of the BMP at run- time. !problem SC22 directed its working groups to provide support for the ISO/IEC 10646 character set. Resolution 02-24 "Recommendation on Coded Character Sets Support" of the SC22 2002 plenary states: "JTC 1/SC 22 believes that programming languages should offer the appropriate support for ISO/IEC 10646, and the Unicode character set where appropriate." Moreover, ISO/IEC 10646:2003 makes use of planes other than the BMP. !proposal The essence of this proposal is to allow the source of the program to be written using 16-bit characters (from the BMP) or 32-bit characters. Also, it makes it possible to operate on 32-bit characters at run-time The main difficulty in supporting characters beyond Row 00 of the BMP in the program text is to define how identifiers and literals are built (which characters are letters, digits, etc.) and to define the lower/upper case equivalence rules. Fortunately, the Unicode Consortium has already done most of the work for us, so it's only a matter of defining how we want to piggyback on their categorization and conversion rules. Unicode defines a "character database" which describes all the properties of each character. The most important property for our purposes is the "General Category". General categories are disjoint. The following categories are of interest for describing Ada program text: - Letter, Uppercase -- e.g., LATIN CAPITAL LETTER A - Letter, Lowercase -- e.g., LATIN SMALL LETTER A - Letter, Titlecase -- e.g., LATIN CAPITAL LETTER L WITH SMALL LETTER J - Letter, Modifier -- e.g., MODIFIER LETTER APOSTROPHE - Letter, Other -- e.g., HEBREW LETTER ALEF - Mark, Non-Spacing -- e.g., COMBINING GRAVE ACCENT - Mark, Spacing Combining -- e.g., MUSICAL SYMBOL COMBINING AUGMENTATION DOT - Number, Decimal Digit -- e.g., DIGIT ZERO - Number, Letter -- e.g., ROMAN NUMERAL TWO - Other, Control -- e.g., NULL - Other, Format -- e.g., ACTIVATE ARABIC FORM SHAPING - Other, Private Use -- e.g., - Other, Surrogate -- e.g., - Punctuation, Connector -- e.g., LOW LINE - Separator, Space -- e.g., SPACE - Separator, Line -- e.g., LINE SEPARATOR - Separator, Paragraph -- e.g., PARAGRAPH SEPARATOR (See http://www.unicode.org/Public/4.0-Update/UCD- 4.0.0.html#General_Category_Values for details on the categorization.) In paragraph 2.1 we define a non-terminal of the grammar for each of the above categories, e.g., letter_uppercase, letter_lowercase, etc. The characters in the category other_format are effectively ignored in most lexical elements, with the exception that they are illegal in string_literals and character_literals. Throughout the syntax rules, we specify which characters are allowed for the lexical elements. For instance, the E in the exponent part of a numeric literal may not be a "GREEK CAPITAL LETTER EPSILON", even though a capital E and a capital epsilon look very much the same. Similar considerations apply to the extended digits, the point, etc. Unicode proposes to define identifiers for programming languages as follows (see annex 7 of UAX #15 at http://www.unicode.org/reports/tr15/tr15- 23.html#Programming_Language_Identifiers): identifier ::= identifier_start {identifier_start | identifier_extend} identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= mark_non_spacing | mark_spacing_combining | number_decimal_digit | punctuation_connector | other_format This definition was made with C in mind, and is not exactly appropriate for Ada, as it would allow consecutive underlines. Because the underline is the only character of Row 00 of the BMP which is a punctuation_connector, it seems sensible to remain close to the existing syntax rules of 2.3(2-3), and to use the following definitions: identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= identifier_start | mark_non_spacing | mark_spacing_combining | number_decimal_digit | other_format identifier ::= identifier_start {[punctuation_connector] identifier_extend} Unicode recommends that, before storing or comparing identifiers, the following transformations be applied: o Characters in category other_format are filtered out. o For languages which have case insensitive identifiers, Normalization Form KC is applied (see http://www.unicode.org/reports/tr15/tr15-23.html#Specification). This is to ensure that identifiers which look visually the same are considered as identical, even if they are composed of different characters. o _Full_ case folding, as described in the table http://www.unicode.org/Public/4.0-Update/CaseFolding-4.0.0.txt, is used to find the uppercase version of each character. We decided not to apply Normalization Form KC, as there seems to be insufficient experience on using normalization forms. This seems to be a lose-lose situation anyway: without normalization, texts that look alike don't have the same meaning; with normalization the widely available text tools like grep, awk, etc. don't work. We follow the lead of C# (ECMA-334) in specifying that a program which is not in Normalization Form KC has an implementation-defined effect. This ensures that a program text which is normalized is portable. It also allows an implementation to provide useful support for non-normalized texts if appropriate in a particular computing environment (in that case, the implementation must document how it handles such texts). Unicode doesn't provide guidance for the composition of numeric literals, so we don't change them. The use of the digits at positions 16#30# to 16#39# is universal in computer science, and allowing digits from other cultures could cause confusion while bringing little benefits. The definition and role of format_effectors is modified to include the characters at positions 16#85#, 16#2028# and 16#2029#. These characters may be used to terminate lines, as recommended by section 5.8 of Unicode 4.0 (see http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G10213). Note that characters in category other_format are forbidden in character_literals and string_literals, because their sole purpose is to affect the presentation of characters. If a program needs to operate on these characters, it can do that by using Wide_Wide_Character'Val (...). Private use characters are not considered to be graphic characters (even though for some applications they may actually turn out to be graphic). The reason is that we wouldn't be able to define the normalization and case folding rules for these characters, so it seems better to disallow them, except in comments where they cannot do any harm. We are removing 3.5.2(5) since an implementation may want to provide a nonstandard mode where the set of graphic characters is not a proper subset of that defined in ISO/IEC 10646:2003, for instance to deal with private use characters. We don't want to prevent implementations from doing anything useful. This paragraph has no force anyway, since in a nonstandard mode an implementation may do pretty much what it likes. In order to represent 32-bit characters at run-time, we add new declarations to Standard. We also provide the following new predefined packages for 32-bit characters: Ada.Strings.Wide_Wide_Bounded Ada.Strings.Wide_Wide_Fixed Ada.Strings.Wide_Wide_Maps Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants Ada.Strings.Wide_Wide_Unbounded Ada.Wide_Wide_Text_IO Ada.Wide_Wide_Text_IO.Text_Streams Ada.Wide_Wide_Text_IO.Complex_IO Ada.Wide_Wide_Text_IO.Editing These packages are similar to their Wide_ equivalents, with Wide_Wide_ substituted for Wide_ everywhere. In addition the following declaration is present in Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants: Wide_Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set; It contains each Wide_Wide_Character value in the BMP of ISO/IEC 10646:2003. The attributes Wide_Wide_Image, Wide_Wide_Value and Wide_Wide_Width are also provided. Their definition is similar to that of Wide_Image, Wide_Value and Wide_Width, respectively, with Wide_Character and Wide_String replaced by Wide_Wide_Character and Wide_Wide_String. Note that the dynamic semantics of a number of operations (attribute Value, procedures Get in Text_IO, procedures Trim in the string packages, etc.) are defined in terms of "space" and "blank". A space is the character at position 16#20# and a blank is either a space or a horizontal tabulation. We are not changing the definition of space or blank, so characters like NO-BREAK SPACE or IDEOGRAPHIC SPACE are not considered to be space or blank in this context. SC22/WG14 is planning to include support for Unicode 16- and 32-bit characters in C. Their proposal is presented in ISO/IEC TR 19769:2004 (http://www.open-std.org/jtc1/sc22//WG14/www/docs/n1040.pdf). In order to provide compatibility with the upcoming C standard, new types are added to Interfaces.C that correspond to C char16_t and char32_t. It is recognized that adding new declarations to predefined units can cause incompatibilities, but it is thought that the new identifiers are unlikely to conflict with existing code. There has been considerable discussion in the ARG regarding the best reference material to use for this AI. ISO/IEC 10646:2003, ISO/IEC TR 10176 (4th edition) and Unicode 4.0 are all relevant. To clarify the matter, we presented an earlier version of this AI (version 8) to the September 2004 SC22 plenary meeting in Jeju, Korea (document N3758). SC22 passed the following resolution: "Resolution 04-15: Coded Character Sets: JTC 1/SC 22 agrees that the proposed implementation of coded character set support described in document N 3758 agrees with the principles for coded character set support previously adopted by SC 22, notably resolution 02-24. JTC 1/SC 22 instructs WG 9 to consider referencing ISO/IEC TR 10176 Annex A in the revision of the Ada language standard." The AARM note in section 2.1(4-14) of the wording explains why we decided to use ISO/IEC 10646:2003 and Unicode 4.0 instead of ISO/IEC TR 10176. !wording In Introduction (32) change: ... Character, [and Wide_Character]{Wide_Character, and Wide_Wide_Character} ... In Introduction (34) change: ... String [and Wide_String]{, Wide_String, and Wide_Wide_String} ... Add after 1.1.4(14): The delimiters, compound delimiters, reserved words, and numeric_literals are exclusively made of the characters whose code position is between 16#20# and 16#7E#, inclusively. The special characters for which names are defined in this International Standard (see 2.1) belong to the same range. [For example, the character E in the definition of exponent is the character whose name is "LATIN CAPITAL LETTER E", not "GREEK CAPITAL LETTER EPSILON".] Replace 1.2(8) by: ISO/IEC 10646:2003, Information technology - Universal Multiple-Octet Coded Character Set (UCS) Replace 2.1(1) by: The characters whose code position is 16#FFFE# or 16#FFFF# are not allowed anywhere in the text of a program. The characters in categories other_control, other_private_use, and other_surrogate are only allowed in comments. Delete 2.1(2-3). Replace 2.1(4-14) by: The character repertoire for the text of an Ada program consists of the collection of characters described by the ISO/IEC 10646:2003 Universal Multiple-Octet Coded Character Set. The coded representation for these characters is implementation defined (it need not be a representation defined within ISO/IEC 10646:2003). The semantics of an Ada program whose text is not in Normalization Form KC (as defined by section 24 of ISO/IEC 10646:2003) are implementation defined. The description of the language definition in this International Standard uses the character properties General Category, Simple Uppercase Mapping, Uppercase Mapping, and Special Case Condition of the documents referenced by the note in section 1 of ISO/IEC 10646:2003. The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified. The categories of characters are defined as follows: letter_uppercase Any character whose General Category is defined to be "Letter, Uppercase". letter_lowercase Any character whose General Category is defined to be "Letter, Lowercase". letter_titlecase Any character whose General Category is defined to be "Letter, Titlecase". letter_modifier Any character whose General Category is defined to be "Letter, Modifier". letter_other Any character whose General Category is defined to be "Letter, Other". mark_non_spacing Any character whose General Category is defined to be "Mark, Non-Spacing". mark_spacing_combining Any character whose General Category is defined to be "Mark, Spacing Combining". number_decimal_digit Any character whose General Category is defined to be "Number, Decimal Digit". number_letter Any character whose General Category is defined to be "Number, Letter". other_control Any character whose General Category is defined to be "Other, Control". other_format Any character whose General Category is defined to be "Other, Format". other_private_use Any character whose General Category is defined to be "Other, Private Use". other_surrogate Any character whose General Category is defined to be "Other, Surrogate". punctuation_connector Any character whose General Category is defined to be "Punctuation, Connector". separator_space Any character whose General Category is defined to be "Separator, Space". separator_line Any character whose General Category is defined to be "Separator, Line". separator_paragraph Any character whose General Category is defined to be "Separator, Paragraph". format_effector The characters whose code position is 16#09# (CHARACTER TABULATION), 16#0A# (LINE FEED(LF)), 16#0B# (LINE TABULATION), 16#0C# (FORM FEED(FF)), 16#0D# (CARRIAGE RETURN(CR)), 16#85# (NEXT LINE(NEL)), and the characters in categories separator_line and separator_paragraph. The names mentioned in parentheses in this list are not defined by ISO/IEC 10646:2003; they are only used for convenience in this International Standard. graphic_character Any character which is not in the categories other_control, other_private_use, other_surrogate, other_format, format_effector, and whose code position is neither 16#FFFE# nor 16#FFFF#. AARM NOTE We considered basing the definition of lexical elements on Annex A of ISO/IEC TR 10176 (4th edition), which lists the characters which should be supported in identifiers for all programming languages, but we finally decided against this option. Note that it is not our intent to diverge from ISO/IEC TR 10176, except to the extent that ISO/IEC TR 10176 itself diverges from ISO/IEC 10646:2003 (which is the case at the time of this writing). More precisely, we intend to align strictly with ISO/IEC 10646:2003. It must be noted that ISO/IEC TR 10176 is a Technical Report while ISO/IEC 10646:2003 is a Standard. If one has to make a choice, one should conform with the Standard rather than with the Technical Report. And, it turns out that one *must* make a choice because there are important differences between the two: o ISO/IEC TR 10176 is still based on ISO/IEC 10646:2000 while ISO/IEC 10646:2003 has already been published for a year. o There are considerable differences between the two editions of ISO/IEC 10646, notably in supporting characters beyond the BMP (this might be significant for some languages, e.g. Korean). o ISO/IEC TR 10176 is a moving target. It is in its fourth edition already, and nevertheless needs additional revision to catch up with ISO/IEC 10646:2003. We cannot afford to revise the Ada language and the vendors cannot afford to change the compilers each time ISO/IEC TR 10176 changes. And we cannot afford to delay the adoption of our amendment until ISO/IEC TR 10176 has been revised; we would run out of interest, money, and the ISO time table before then. o ISO/IEC TR 10176 does not define case conversion tables, which are essential for a case-insensitive language like Ada. To get case conversion tables, we would have to reference either ISO/IEC 10646:2003 or Unicode, or we would have to invent our own. For the purpose of defining the lexical elements of the language, we need character properties like categorization, as well as case conversion tables. These are mentioned in ISO/IEC 10646:2003 as useful for implementations, with a reference to Unicode. Machine-readable tables are available on the web at URLs: http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt http://www.unicode.org/Public/4.0-Update/CaseFolding-4.0.0.txt with an explanatory document found at URL: http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html The actual text of the standard only makes specific references to the corresponding clauses of ISO/IEC 10646:2003, not to Unicode. END AARM NOTE Change 2.1(15): Replace the leading sentence by: The following names are used when referring to certain characters (the first name is that given in ISO/IEC 10646:2003): Replace "symbol" by "graphic symbol" in the column headers. Delete the last four characters (Ada doesn't use brackets). Add ! Exclamation Mark and % Percent Sign to the table. Replace the AARM Note: This table serves to show the correspondence between ISO/IEC 10646:2003 names and the graphic symbols (glyphs) used in this International Standard. These are the characters that play a special role in the syntax of Ada. Delete 2.1(16). Delete 2.1(17). Replace 2.2(3-5) by: In some cases an explicit _separator_ is required to separate adjacent lexical elements. A separator is any of a separator_space, a format_effector or the end of a line, as follows: o A separator_space is a separator except within a comment, a string_literal, or a character_literal. o Character Tabulation is a separator except within a comment. Replace 2.2(8) by: A delimiter is either one of the following characters: Replace 2.3(2-3) by: identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= identifier_start | mark_non_spacing | mark_spacing_combining | number_decimal_digit | other_format identifier ::= identifier_start {[punctuation_connector] identifier_extend} Replace 2.3(5) by: Two identifiers are considered the same if they consist of the same sequence of characters after applying the following transformations (in this order): o The characters in category other_format are eliminated. o Locale-independent full case folding, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2003, is applied to obtain the uppercase version of each character. Add a note after 2.3(6): Identifiers differing only in the use of corresponding upper and lower case letters are considered the same. Add after 2.6(6): No modification is performed on the sequence of characters in a string_literal. Replace 3.5(28-29) by: S'Wide_Wide_Image S'Wide_Wide_Image denotes a function with the following specification: function S'Wide_Wide_Image (Arg : S'Base) return Wide_Wide_String; Add after 3.5(34): S'Wide_Image S'Wide_Image denotes a function with the following specification: function S'Wide_Image (Arg : S'Base) return Wide_String; The function returns an image of the value of Arg as a Wide_String. The lower bound of the result is one. The image has the same sequence of character as defined for S'Wide_Wide_Image if all the graphic characters are defined in Wide_Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of Arg). Replace 3.5(37) by: The function returns an image of the value of Arg as a String. The lower bound of the result is one. The image has the same sequence of character as defined for S'Wide_Wide_Image if all the graphic characters are defined in Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of Arg). Add after 3.5(37): S'Wide_Wide_Width S'Wide_Wide_Width denotes the maximum length of a Wide_Wide_String returned by S'Wide_Wide_Image over all the values of S. It denotes zero for a subtype that has a null range. Its type is universal_integer. Replace 3.5(40-45) by: S'Wide_Wide_Value S'Wide_Wide_Value denotes a function with the following specification: function S'Wide_Wide_Value (Arg : Wide_Wide_String) return S'Base; This function returns a value given an image of the value as a Wide_Wide_String, ignoring any leading or trailing spaces. For the evaluation of a call on S'Wide_Wide_Value for an enumeration subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an enumeration literal and if it corresponds to a literal of the type of S (or corresponds to the result of S'Wide_Wide_Image for a nongraphic character of the type), the result is the corresponding enumeration value; otherwise Constraint_Error is raised. For the evaluation of a call on S'Wide_Wide_Value for an integer subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an integer literal, with an optional leading sign character (plus or minus for a signed type; only plus for a modular type), and the corresponding numeric value belongs to the base range of the type of S, then that value is the result; otherwise Constraint_Error is raised. For the evaluation of a call on S'Wide_Wide_Value for a real subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of one of the following: Add after 3.5(51): S'Wide_Value S'Wide_Value denotes a function with the following specification: function S'Wide_Value(Arg : Wide_String) return S'Base This function returns a value given an image of the value as a Wide_String, ignoring any leading or trailing spaces. For the evaluation of a call on S'Wide_Value for an enumeration subtype S, if the sequence of characters of the parameter (ignoring leading and trailing spaces) has the syntax of an enumeration literal and if it corresponds to a literal of the type of S (or corresponds to the result of S'Wide_Image for a value of the type), the result is the corresponding enumeration value; otherwise Constraint_Error is raised. For a numeric subtype S, the evaluation of a call on S'Wide_Value with Arg of type Wide_String is equivalent to a call on S'Wide_Wide_Value for a corresponding Arg of type Wide_Wide_String. At the end of 3.5(55) change: ... to a call on [S'Wide_Value]{S'Wide_Wide_Value} for a corresponding Arg of type [Wide_String]{Wide_Wide_String}. In 3.5(56) change: ... {Wide_Wide_Value,} Wide_Value, Value, {Wide_Wide_Image,} Wide_Image, and Image ... In 3.5(59) change: ... as [does]{do} S'Wide_Value (S'Wide_Image (V)) {and S'Wide_Wide_Value (S'Wide_Wide_Image (V))} {None of these expressions}[Neither expression] ever... In the middle of 3.5.2(2), change: ... the attributes [(Wide_)Image and (Wide_)Value]{Image, Wide_Image, Wide_Wide_Image, Value, Wide_Value, and Wide_Wide_Value} Replace 3.5.2(3) with: The predefined type Wide_Character is a character type whose values correspond to the 65536 code positions of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP). Each of the graphic characters of the BMP has a corresponding character_literal in Wide_Character. The first 256 values of Wide_Character have the same character_literal or language-defined name as defined for Character. Each of the graphic_characters have a corresponding character_literal. The predefined type Wide_Wide_Character is a character type whose values correspond to the 2147483648 code positions of the ISO/IEC 10646:2003 character set. Each of the graphic_characters has a corresponding character_literal in Wide_Wide_Character. The first 65536 values of Wide_Wide_Character have the same character_literal or language-defined name as defined for Wide_Character. In types Wide_Character and Wide_Wide_Character, the characters whose code positions are 16#FFFE# and 16#FFFF# are assigned the language-defined names FFFE and FFFF. The other characters whose code position is larger than 16#FF# and which are not graphic_characters have language-defined names which are formed by appending to the string "Character_" the representation of their code position in hexadecimal as eight extended digits. As with other language-defined names, these names are usable only with the attributes (Wide_)Wide_Image and (Wide_)Wide_Value; they are not usable as enumeration literals. In 3.5.2(4) change: ... Character [and Wide_Character]{, Wide_Character, and Wide_Wide_Character} ... Delete 3.5.2(5). Replace 3.6.3(2) by: There are three predefined string types, String, Wide_String, and Wide_Wide_String, each indexed by the value of the predefined subtype Positive; these are declared in the visible part of package Standard: Replace 3.6.3(4) by: type String is array (Positive range <>) of Character; type Wide_String is array (Positive range <>) of Wide_Character; type Wide_Wide_String is array (Positive range <>) of Wide_Wide_Character; Fix the list in A(2/1). Add in the middle of A.1(36) -- The declaration of type Wide_Wide_Character is based on the full -- ISO/IEC 10646:2003 character set. The first 65536 positions have the -- same contents as type Wide_Character. See 3.5.2. type Wide_Wide_Character is (nul, soh, ..., FFFE, FFFF, ...); Add after A.1(42): type Wide_Wide_String is array (Positive range <>) of Wide_Wide_Character; pragma Pack (Wide_Wide_String); -- The predefined operators for this type correspond to those for String. Replace the beginning of A.1(49) by: In each of the types Character [and Wide_Character]{, Wide_Character, and Wide_Wide_Character} ... In A.3(1) change: ... with Wide_Character {and Wide_Wide_Character} data ... In A.3.2(13) change: ... between {Wide_Wide_Character, } Wide_Character{,} ... Add after A.3.2(14): function Is_Character (Item : in Wide_Wide_Character) return Boolean; function Is_String (Item : in Wide_Wide_String) return Boolean; function Is_Wide_Character (Item : in Wide_Wide_Character) return Boolean; function Is_Wide_String (Item : in Wide_Wide_String) return Boolean; Add after A.3.2(16): function To_Character (Item : in Wide_Wide_Character; Substitute : in Character := ' ') return Character; function To_String (Item : in Wide_Wide_String; Substitute : in Character := ' ') return String; Add after A.3.2(18): function To_Wide_Character (Item : in Wide_Wide_Character; Substitute : in Wide_Character := ' ') return Wide_Character; function To_Wide_String (Item : in Wide_Wide_String; Substitute : in Wide_Character := ' ') return Wide_String; function To_Wide_Wide_Character (Item : in Character) return Wide_Wide_Character; function To_Wide_Wide_String (Item : in String) return Wide_Wide_String; function To_Wide_Wide_Character (Item : in Wide_Character) return Wide_Wide_Character; function To_Wide_Wide_String (Item : in Wide_String) return Wide_Wide_String; Replace A.3.2(42-48) by: The following functions test Wide_Wide_Character or Wide_Character values for membership in Wide_Character or Character, or convert between corresponding characters of Wide_Wide_Character, Wide_Character, and Character. function Is_Character (Item : in Wide_Character) return Boolean; Returns True if Wide_Character'Pos(Item) <= Character'Pos(Character'Last). function Is_Character (Item : in Wide_Wide_Character) return Boolean; Returns True if Wide_Wide_Character'Pos(Item) <= Character'Pos(Character'Last). function Is_Wide_Character (Item : in Wide_Wide_Character) return Boolean; Returns True if Wide_Wide_Character'Pos(Item) <= Wide_Character'Pos(Wide_Character'Last). function Is_String (Item : in Wide_String) return Boolean; function Is_String (Item : in Wide_Wide_String) return Boolean; Returns True if Is_Character(Item(I)) is True for each I in Item'Range. function Is_Wide_String (Item : in Wide_Wide_String) return Boolean; Returns True if Is_Wide_Character(Item(I)) is True for each I in Item'Range. function To_Character (Item : in Wide_Character; Substitute : in Character := ' ') return Character; function To_Character (Item : in Wide_Wide_Character; Substitute : in Character := ' ') return Character; Returns the Character corresponding to Item if Is_Character(Item), and returns the Substitute Character otherwise. function To_Wide_Character (Item : in Character) return Wide_Character; Returns the Wide_Character X such that Character'Pos(Item) = Wide_Character'Pos (X). function To_Wide_Character (Item : in Wide_Wide_Character; Substitute : in Wide_Character := ' ') return Wide_Character; Returns the Wide_Character corresponding to Item if Is_Wide_Character(Item), and returns the Substitute Wide_Character otherwise. function To_Wide_Wide_Character (Item : in Character) return Wide_Wide_Character; Returns the Wide_Wide_Character X such that Character'Pos(Item) = Wide_Wide_Character'Pos (X). function To_Wide_Wide_Character (Item : in Wide_Character) return Wide_Wide_Character; Returns the Wide_Wide_Character X such that Wide_Character'Pos(Item) = Wide_Wide_Character'Pos (X). function To_String (Item : in Wide_String; Substitute : in Character := ' ') return String; function To_String (Item : in Wide_Wide_String; Substitute : in Character := ' ') return String; Returns the String whose range is 1..Item'Length and each of whose elements is given by To_Character of the corresponding element in Item. function To_Wide_String (Item : in String) return Wide_String; Returns the Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Character of the corresponding element in Item. function To_Wide_String (Item : in Wide_Wide_String; Substitute : in Wide_Character := ' ') return Wide_String; Returns the Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Character of the corresponding element in Item with the given Substitute Wide_Character. function To_Wide_Wide_String (Item : in String) return Wide_Wide_String; function To_Wide_Wide_String (Item : in Wide_String) return Wide_Wide_String; Returns the Wide_Wide_String whose range is 1..Item'Length and each of whose elements is given by To_Wide_Wide_Character of the corresponding element in Item. Delete A.3.2(49). In A.4(1) change: ... [both] String [and Wide_String]{, Wide_String, and Wide_Wide_String} ... Add after A.4.1(4): Wide_Wide_Space : constant Wide_Wide_Character := ' '; Add after A.4.7 a new section, A.4.8: A.4.8 Wide_Wide_String Handling Facilities for handling strings of Wide_Wide_Character components are found in the packages Strings.Wide_Wide_Maps, Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, Strings.Wide_Wide_Unbounded, and Strings.Wide_Wide_Maps.Wide_Wide_Constants. They provide the same string-handling operations as the corresponding packages for strings of Character components. Static Semantics The package Strings.Wide_Wide_Maps has the following declaration. package Ada.Strings.Wide_Wide_Maps is pragma Preelaborate(Wide_Wide_Maps); -- Representation for a set of Wide_Wide_Character values: type Wide_Wide_Character_Set is private; Null_Set : constant Wide_Wide_Character_Set; type Wide_Wide_Character_Range is record Low : Wide_Wide_Character; High : Wide_Wide_Character; end record; -- Represents Wide_Wide_Character range Low..High type Wide_Wide_Character_Ranges is array (Positive range <>) of Wide_Wide_Character_Range; function To_Set (Ranges : in Wide_Wide_Character_Ranges) return Wide_Wide_Character_Set; function To_Set (Span : in Wide_Wide_Character_Range) return Wide_Wide_Character_Set; function To_Ranges (Set : in Wide_Wide_Character_Set) return Wide_Wide_Character_Ranges; function "=" (Left, Right : in Wide_Wide_Character_Set) return Boolean; function "not" (Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "and" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "or" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "xor" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function "-" (Left, Right : in Wide_Wide_Character_Set) return Wide_Wide_Character_Set; function Is_In (Element : in Wide_Wide_Character; Set : in Wide_Wide_Character_Set) return Boolean; function Is_Subset (Elements : in Wide_Wide_Character_Set; Set : in Wide_Wide_Character_Set) return Boolean; function "<=" (Left : in Wide_Wide_Character_Set; Right : in Wide_Wide_Character_Set) return Boolean renames Is_Subset; -- Alternative representation for a set of Wide_Wide_Character values: subtype Wide_Wide_Character_Sequence is Wide_Wide_String; function To_Set (Sequence : in Wide_Wide_Character_Sequence) return Wide_Wide_Character_Set; function To_Set (Singleton : in Wide_Wide_Character) return Wide_Wide_Character_Set; function To_Sequence (Set : in Wide_Wide_Character_Set) return Wide_Wide_Character_Sequence; -- Representation for a Wide_Wide_Character to Wide_Wide_Character -- mapping: type Wide_Wide_Character_Mapping is private; function Value (Map : in Wide_Wide_Character_Mapping; Element : in Wide_Wide_Character) return Wide_Wide_Character; Identity : constant Wide_Wide_Character_Mapping; function To_Mapping (From, To : in Wide_Wide_Character_Sequence) return Wide_Wide_Character_Mapping; function To_Domain (Map : in Wide_Wide_Character_Mapping) return Wide_Wide_Character_Sequence; function To_Range (Map : in Wide_Wide_Character_Mapping) return Wide_Wide_Character_Sequence; type Wide_Wide_Character_Mapping_Function is access function (From : in Wide_Wide_Character) return Wide_Wide_Character; private ... -- not specified by the language end Ada.Strings.Wide_Wide_Maps; The context clause for each of the packages Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, and Strings.Wide_Wide_Unbounded identifies Strings.Wide_Wide_Maps instead of Strings.Maps. For each of the packages Strings.Fixed, Strings.Bounded, Strings.Unbounded, and Strings.Maps.Constants the corresponding wide wide string package has the same contents except that o Wide_Wide_Space replaces Space o Wide_Wide_Character replaces Character o Wide_Wide_String replaces String o Wide_Wide_Character_Set replaces Character_Set o Wide_Wide_Character_Mapping replaces Character_Mapping o Wide_Wide_Character_Mapping_Function replaces Character_Mapping_Function o Wide_Wide_Maps replaces Maps o Bounded_Wide_Wide_String replaces Bounded_String o Null_Bounded_Wide_Wide_String replaces Null_Bounded_String o To_Bounded_Wide_Wide_String replaces To_Bounded_String o To_Wide_Wide_String replaces To_String o Unbounded_Wide_Wide_String replaces Unbounded_String o Null_Unbounded_Wide_Wide_String replaces Null_Unbounded_String o Wide_Wide_String_Access replaces String_Access o To_Unbounded_Wide_Wide_String replaces To_Unbounded_String The following additional declarations are present in Strings.Wide_Wide_Maps.Wide_Wide_Constants: Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set; -- Contains each Wide_Wide_Character value WWC such that Characters.Handling.Is_Character(WWC) is True Wide_Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set; -- Contains each Wide_Wide_Character value WWC such that -- Characters.Handling.Is_Wide_Character (WWC) is True [Author's note: the preceding comment is missing ".Handling" in A.4.7(46).] NOTES If a null Wide_Wide_Character_Mapping_Function is passed to any of the Wide_Wide_String handling subprograms, Constraint_Error is propagated. In A.6(1) change: ... packages Text_IO [and Wide_Text_IO]{, Wide_Text_IO, and Wide_Wide_Text_IO} ... In A.7(4) change: ... data, [and] Wide_Text_IO for Wide_Character and Wide_String data {, and Wide_Wide_Text_IO for Wide_Wide_Character and Wide_Wide_String data} ... In A.7(10) change: ... Text_IO, Wide_Text_IO {, Wide_Wide_Text_IO}, and Stream_IO ... In A.7(13) change: ... Direct_IO, Text_IO [and Wide_Text_IO]{, Wide_Text_IO, and Wide_Wide_Text_IO} ... In A.7(15) change: ... Text_IO, Wide_Text_IO {, Wide_Wide_Text_IO}, and Stream_IO ... Replace A.11 by: A.11 Wide Text Input-Output and Wide Wide Text Input-Output The packages Wide_Text_IO and Wide_Wide_Text_IO provide facilities for input and output in human-readable form. Each file is read or written sequentially, as a sequence of wide characters (or wide wide characters) grouped into lines, and as a sequence of lines grouped into pages. Static Semantics The specification of package Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Character, and any occurrence of String is replaced by Wide_String. Nongeneric equivalents of Wide_Text_IO.Integer_IO and Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Text_IO, Ada.Long_Integer_Wide_Text_IO, Ada.Float_Wide_Text_IO, Ada.Long_Float_Wide_Text_IO. The specification of package Wide_Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Wide_Character, and any occurrence of String is replaced by Wide_Wide_String. Nongeneric equivalents of Wide_Wide_Text_IO.Integer_IO and Wide_Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Wide_Text_IO, Ada.Long_Integer_Wide_Wide_Text_IO, Ada.Float_Wide_Wide_Text_IO, Ada.Long_Float_Wide_Wide_Text_IO. In A.12(1) change: ... Text_IO.Text_Streams [and Wide_Text_IO.Text_Streams]{, Wide_Text_IO.Text_Streams, and Wide_Wide_Text_IO.Text_Streams} ... Add a new section after A.12.3: A.12.4 The Package Wide_Wide_Text_IO.Text_Streams The package Wide_Wide_Text_IO.Text_Streams provides a function for treating a wide wide text file as a stream. Static Semantics The library package Wide_Wide_Text_IO.Text_Streams has the following declaration: with Ada.Streams; package Ada.Wide_Wide_Text_IO.Text_Streams is type Stream_Access is access all Streams.Root_Stream_Type'Class; function Stream (File : in File_Type) return Stream_Access; end Ada.Wide_Wide_Text_IO.Text_Streams; The Stream function has the same effect as the corresponding function in Streams.Stream_IO. Add after B.3(39): -- ISO/IEC 10646:2003 compatible types defined by SC22/WG14 document N1010. type char16_t is ; char16_nul : constant char16_t := implementation-defined; function To_C (Item : in Wide_Character) return char16_t; function To_Ada (Item : in char16_t) return Wide_Character; type char16_array is array (size_t range <>) of aliased char16_t; pragma Pack(char16_array); function Is_Nul_Terminated (Item : in char16_array) return Boolean; function To_C (Item : in Wide_String; Append_Nul : in Boolean := True) return char16_array; function To_Ada (Item : in char16_array; Trim_Nul : in Boolean := True) return Wide_String; procedure To_C (Item : in Wide_String; Target : out char16_array; Count : out size_t; Append_Nul : in Boolean := True); procedure To_Ada (Item : in char16_array; Target : out Wide_String; Count : out Natural; Trim_Nul : in Boolean := True); type char32_t is ; char32_nul : constant char32_t := implementation-defined; function To_C (Item : in Wide_Wide_Character) return char32_t; function To_Ada (Item : in char32_t) return Wide_Wide_Character; type char32_array is array (size_t range <>) of aliased char32_t; pragma Pack(char32_array); function Is_Nul_Terminated (Item : in char32_array) return Boolean; function To_C (Item : in Wide_Wide_String; Append_Nul : in Boolean := True) return char32_array; function To_Ada (Item : in char32_array; Trim_Nul : in Boolean := True) return Wide_Wide_String; procedure To_C (Item : in Wide_Wide_String; Target : out char32_array; Count : out size_t; Append_Nul : in Boolean := True); procedure To_Ada (Item : in char32_array; Target : out Wide_Wide_String; Count : out Natural; Trim_Nul : in Boolean := True); In B.3(43) change: The types int, short, long, unsigned, ptrdiff_t, size_t, double, char [, and wchar_t]{, wchar_t, char16_t, and char32_t} correspond respectively to the C types having the same names. Add after B.3(60): function Is_Nul_Terminated (Item : in char16_array) return Boolean; The result of Is_Nul_Terminated is True if Item contains char16_nul, and is False otherwise. function To_C (Item : in Wide_Character) return char16_t; function To_Ada (Item : in char16_t ) return Wide_Character; To_C and To_Ada provide mappings between the Ada and C 16-bit character types. function To_C (Item : in Wide_String; Append_Nul : in Boolean := True) return char16_array; function To_Ada (Item : in char16_array; Trim_Nul : in Boolean := True) return Wide_String; procedure To_C (Item : in Wide_String; Target : out char16_array; Count : out size_t; Append_Nul : in Boolean := True); procedure To_Ada (Item : in char16_array; Target : out Wide_String; Count : out Natural; Trim_Nul : in Boolean := True); The To_C and To_Ada subprograms that convert between Wide_String and char16_array have analogous effects to the To_C and To_Ada subprograms that convert between String and char_array, except that char16_nul is used instead of nul. function Is_Nul_Terminated (Item : in char32_array) return Boolean; The result of Is_Nul_Terminated is True if Item contains char16_nul, and is False otherwise. function To_C (Item : in Wide_Wide_Character) return char32_t; function To_Ada (Item : in char32_t ) return Wide_Wide_Character; To_C and To_Ada provide mappings between the Ada and C 32-bit character types. function To_C (Item : in Wide_Wide_String; Append_Nul : in Boolean := True) return char32_array; function To_Ada (Item : in char32_array; Trim_Nul : in Boolean := True) return Wide_Wide_String; procedure To_C (Item : in Wide_Wide_String; Target : out char32_array; Count : out size_t; Append_Nul : in Boolean := True); procedure To_Ada (Item : in char32_array; Target : out Wide_Wide_String; Count : out Natural; Trim_Nul : in Boolean := True); The To_C and To_Ada subprograms that convert between Wide_Wide_String and char32_array have analogous effects to the To_C and To_Ada subprograms that convert between String and char_array, except that char32_nul is used instead of nul. At the beginning of C.5(7) change: If the pragma applies to an enumeration type, then the semantics of the Wide_Wide_Image and Wide_Wide_Value attributes are implementation defined for that type; the semantics of Image, Wide_Image, Value, and Wide_Value are still defined in terms of Wide_Wide_Image and Wide_Wide_Value... In F(4) change: ... Text_IO.Editing [and Wide_Text_IO.Editing]{, Wide_Text_IO.Editing, and Wide_Wide_Text_IO.Editing} ... At the beginning of F.3(1) change: The child packages Text_IO.Editing [and Wide_Text_IO.Editing]{, Wide_Text_IO.Editing, and Wide_Wide_Text_IO.Editing}... Add at the end of F.3(6): ... For Wide_Wide_Text_IO.Editing their types are Wide_Wide_String and Wide_Wide_Character, respectively. In F.3(19) change: ... Text_IO.Decimal_IO [and Wide_Text_IO.Decimal_IO]{, Wide_Text_IO.Decimal_IO and Wide_Wide_Text_IO.Decimal_IO} In F.3(20) change: ... [both] for {all of} Text_IO.Editing [and Wide_Text_IO.Editing]{, Wide_Text_IO.Editing, and Wide_Wide_Text_IO.Editing} ... Add a new section after F.3.4: F.3.5 The Package Wide_Wide_Text_IO.Editing Static Semantics The child package Wide_Wide_Text_IO.Editing has the same contents as Text_IO.Editing, except that: o each occurrence of Character is replaced by Wide_Wide_Character, o each occurrence of Text_IO is replaced by Wide_Wide_Text_IO, o the subtype of Default_Currency is Wide_Wide_String rather than String, and o each occurrence of String in the generic package Decimal_Output is replaced by Wide_Wide_String. NOTES Each of the functions Wide_Wide_Text_IO.Editing.Valid, To_Picture, and Pic_String has String (versus Wide_Wide_String) as its parameter or result subtype, since a picture String is not localizable. Add a new section after G.1.4: G.1.5 The Package Wide_Wide_Text_IO.Complex_IO Static Semantics Implementations shall also provide the generic library package Wide_Wide_Text_IO.Complex_IO. Its declaration is obtained from that of Text_IO.Complex_IO by systematically replacing Text_IO by Wide_Wide_Text_IO and String by Wide_Wide_String; the description of its behavior is obtained by additionally replacing references to particular characters (commas, parentheses, etc.) by those for the corresponding wide wide characters. In H.4(20) change: ... Text_IO, Wide_Text_IO {, Wide_Wide_Text_IO}, or Stream_IO ... Fix annex K. [Author's note: I'm pretty sure it's auto-generated...] !discussion See proposal. !example An example would show identifiers using characters from the CJKV ideographs or from non-Latin alphabets (Cyrillic, Greek, Arabic, etc.). But that's hard to do in a Latin-1, plain text file... !comment Introduction clause !corrigendum 0.3(32) @drepl An enumeration type defines an ordered set of distinct enumeration literals, for example a list of states or an alphabet of characters. The enumeration types Boolean, Character, and Wide_Character are predefined. @dby An enumeration type defines an ordered set of distinct enumeration literals, for example a list of states or an alphabet of characters. The enumeration types Boolean, Character, Wide_Character, and Wide_Wide_Character are predefined. !comment Introduction clause !corrigendum 0.3(34) @drepl Composite types allow definitions of structured objects with related components. The composite types in the language include arrays and records. An array is an object with indexed components of the same type. A record is an object with named components of possibly different types. Task and protected types are also forms of composite types. The array types String and Wide_String are predefined. @dby Composite types allow definitions of structured objects with related components. The composite types in the language include arrays and records. An array is an object with indexed components of the same type. A record is an object with named components of possibly different types. Task and protected types are also forms of composite types. The array types String, Wide_String, and Wide_Wide_String are predefined. !corrigendum 1.1.4(14) @dinsa @xbullet@fa and @i@fa are both equivalent to @fa alone.> @dinst The delimiters, compound delimiters, reserved words, and @fas are exclusively made of the characters whose code position is between 16#20# and 16#7E#, inclusively. The special characters for which names are defined in this International Standard (see 2.1) belong to the same range. For example, the character E in the definition of exponent is the character whose name is "LATIN CAPITAL LETTER E", not "GREEK CAPITAL LETTER EPSILON". !corrigendum 1.2(8/1) @drepl ISO/IEC 10646-1:1993, @i, supplemented by Technical Corrigendum 1:1996. @dby ISO/IEC 10646:2003, @i !corrigendum 2.1(1) @drepl The only characters allowed outside of @fas are the @fas and @fas. @dby The characters whose code position is 16#FFFE# or 16#FFFF# are not allowed anywhere in the text of a program. The characters in categories @fa, @fa, and @fa are only allowed in @fas. !corrigendum 2.1(2) @ddel @xcode<@fa> !corrigendum 2.1(3) @ddel @xcode<@fa> !corrigendum 2.1(4) @drepl The character repertoire for the text of an Ada program consists of the collection of characters called the Basic Multilingual Plane (BMP) of the ISO 10646 Universal Multiple-Octet Coded Character Set, plus a set of @fas and, in comments only, a set of @fas; the coded representation for these characters is implementation defined (it need not be a representation defined within ISO-10646-1). @dby The character repertoire for the text of an Ada program consists of the collection of characters described by the ISO/IEC 10646:2003 Universal Multiple-Octet Coded Character Set. The coded representation for these characters is implementation defined (it need not be a representation defined within ISO/IEC 10646:2003). The semantics of an Ada program whose text is not in Normalization Form KC (as defined by section 24 of ISO/IEC 10646:2003) are implementation defined. !corrigendum 2.1(5) @drepl The description of the language definition in this International Standard uses the graphic symbols defined for Row 00: Basic Latin and Row 00: Latin-1 Supplement of the ISO 10646 BMP; these correspond to the graphic symbols of ISO 8859-1 (Latin-1); no graphic symbols are used in this International Standard for characters outside of Row 00 of the BMP. The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified. @dby The description of the language definition in this International Standard uses the character properties General Category, Simple Uppercase Mapping, Uppercase Mapping, and Special Case Condition of the documents referenced by the note in section 1 of ISO/IEC 10646:2003. The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified. !corrigendum 2.1(7) @ddel @xhang<@xterm<@fa> @fa> !corrigendum 2.1(8) @drepl @xhang<@xterm<@fa> Any character of Row 00 of ISO 10646 BMP whose name begins ``Latin Capital Letter''.> @dby @xhang<@xterm<@fa> Any character whose General Category is defined to be "Letter, Uppercase".> !corrigendum 2.1(9) @drepl @xhang<@xterm<@fa> Any character of Row 00 of ISO 10646 BMP whose name begins ``Latin Small Letter''.> @dby @xhang<@xterm<@fa> Any character whose General Category is defined to be "Letter, Lowercase".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Letter, Titlecase".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Letter, Modifier".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Letter, Other".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Mark, Non-Spacing".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Mark, Spacing Combining".> !corrigendum 2.1(10) @drepl @xhang<@xterm<@fa> One of the characters 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.> @dby @xhang<@xterm<@fa> Any character whose General Category is defined to be "Number, Decimal Digit".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Number, Letter".> !corrigendum 2.1(11) @ddel @xhang<@xterm<@fa> The character of ISO 10646 BMP named ``Space''.> !corrigendum 2.1(12) @drepl @xhang<@xterm<@fa> Any character of the ISO 10646 BMP that is not reserved for a control function, and is not the @fa, an @fa, or a @fa.> @dby @xhang<@xterm<@fa> Any character whose General Category is defined to be "Other, Control".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Other, Format".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Other, Private Use".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Other, Surrogate".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Punctuation, Connector".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Separator, Space".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Separator, Line".> @xhang<@xterm<@fa> Any character whose General Category is defined to be "Separator, Paragraph".> !corrigendum 2.1(13) @drepl @xhang<@xterm<@fa> The control functions of ISO 6429 called character tabulation (HT), line tabulation (VT), carriage return (CR), line feed (LF), and form feed (FF). @dby @xhang<@xterm<@fa> The characters whose code position is 16#09# (CHARACTER TABULATION), 16#0A# (LINE FEED(LF)), 16#0B# (LINE TABULATION), 16#0C# (FORM FEED(FF)), 16#0D# (CARRIAGE RETURN(CR)), 16#85# (NEXT LINE(NEL)), and the characters in categories @fa and @fa. The names mentioned in parentheses in this list are not defined by ISO/IEC 10646:2003; they are only used for convenience in this International Standard. !corrigendum 2.1(14) @drepl @xhang<@xterm<@fa> Any control function, other than a @fa, that is allowed in a comment; the set of @fas allowed in comments is implementation defined.> @dby @xhang<@xterm<@fa> Any character which is not in the categories @fa, @fa, @fa, @fa, @fa, and whose code position is neither 16#FFFE# nor 16#FFFF#.> !corrigendum 2.1(15) @drepl The following names are used when referring to certain @fas: @dby The following names are used when referring to certain characters (the first name is that given in ISO/IEC 10646:2003): !comment I'm not going to try to update the table here, as it would be very !comment difficult to format properly. Moreover, there is nothing important !comment wrong with it. I'll make the suggested changes as editorial !comment corrections in the Standard. !corrigendum 2.1(16) @ddel In a nonstandard mode, the implementation may support a different character repertoire; in particular, the set of characters that are considered @fas can be extended or changed to conform to local conventions. !corrigendum 2.1(17) @ddel @xindent<@s9<1 Every code position of ISO 10646 BMP that is not reserved for a control function is defined to be a graphic_character by this International Standard. This includes all code positions other than 0000 - 001F, 007F - 009F, and FFFE - FFFF.>> !corrigendum 2.2(3) @drepl In some cases an explicit @i is required to separate adjacent lexical elements. A separator is any of a space character, a format effector, or the end of a line, as follows: @dby In some cases an explicit @i is required to separate adjacent lexical elements. A separator is any of a @fa, a @fa or the end of a line, as follows: !corrigendum 2.2(4) @drepl @xbullet, a @fa, or a @fa.> @dby @xbullet is a separator except within a @fa, a @fa, or a @fa.> !corrigendum 2.2(5) @drepl @xbullet.> @dby @xbullet.> !corrigendum 2.2(8) @drepl A delimiter is either one of the following special characters: @dby A delimiter is either one of the following characters: !corrigendum 2.3(02) @drepl @xcode<@fa> @dby @xcode<@fa> @xcode<@fa> @xcode<@fa> !corrigendum 2.3(03) @ddel @xcode<@fa> !corrigendum 2.3(05) @drepl All characters of an @fa are significant, including any underline character. @fas differing only in the use of corresponding upper and lower case letters are considered the same. @dby Two @fas are considered the same if they consist of the same sequence of characters after applying the following transformations (in this order): @xbullet are eliminated.> @xbullet !corrigendum 2.3(06) @dinsa In a nonstandard mode, an implementation may support other upper/lower case equivalence rules for @fas, to accommodate local conventions. @dinst @xindent<@s9s differing only in the use of corresponding upper and lower case letters are considered the same.>> !corrigendum 2.6(06) @dinsa A @i is a @fa with no @fas between the quotation marks. @dinst No modification is performed on the sequence of characters in a @fa. !corrigendum 3.5(27) @dinsa @xindent @dinss @xhang<@xterm S'Wide_Wide_Image denotes a function with the following specification:> @xcode< @b S'Wide_Wide_Image(@i : S'Base) @b Wide_Wide_String> @xindent of the value of @i, that is, a sequence of characters representing the value in display form. The lower bound of the result is one.> @xindent @xindent (a value of a character type that has no enumeration literal associated with it), the result is a corresponding language-defined or implementation-defined name in upper case (for example, the image of the nongraphic character identified as @i is "NUL" -- the quotes are not part of the image).> @xindent @xindent !corrigendum 3.5(30) @drepl @xindent of the value of @i, that is, a sequence of characters representing the value in display form. The lower bound of the result is one.> @dby @xindent as a Wide_String, that is, a sequence of characters representing the value in display form. The lower bound of the result is one. The image has the same sequence of character as defined for S'Wide_Wide_Image if all the graphic characters are defined in Wide_Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of Arg).> !corrigendum 3.5(31) @ddel @xindent !corrigendum 3.5(32) @ddel @xindent (a value of a character type that has no enumeration literal associated with it), the result is a corresponding language-defined or implementation-defined name in upper case (for example, the image of the nongraphic character identified as @i is "NUL" -- the quotes are not part of the image).> !corrigendum 3.5(33) @ddel @xindent !corrigendum 3.5(34) @ddel @xindent !corrigendum 3.5(37) @drepl @xindent as a String. The lower bound of the result is one. The image has the same sequence of graphic characters as that defined for S'Wide_Image if all the graphic characters are defined in Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Image for the same value of @i).> @dby @xindent as a String. The lower bound of the result is one. The image has the same sequence of character as defined for S'Wide_Wide_Image if all the graphic characters are defined in Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of @i).> @xhang<@xterm S'Wide_Wide_Width denotes the maximum length of a Wide_Wide_String returned by S'Wide_Wide_Image over all the values of S. It denotes zero for a subtype that has a null range. Its type is @i.> !corrigendum 3.5(39) @dinsa @xhang<@xterm S'Width denotes the maximum length of a String returned by S'Image over all values of the subtype S. It denotes zero for a subtype that has a null range. Its type is @i.> @dinss @xhang<@xterm S'Wide_Wide_Value denotes a function with the following specification:> @xcode< @b S'Wide_Wide_Value(@i : Wide_Wide_String) @b S'Base> @xindent @xindent @xindent @xindent @xinbull<@fa> @xinbull<@fa.[@fa]> @xinbull<.numeral[exponent]> @xinbull<@fa#@fa.#[@fa]> @xinbull<@fa#.@fa#[@fa]> @xindent !corrigendum 3.5(43) @drepl @xindent @dby @xindent of type Wide_String is equivalent to a call on S'Wide_Wide_Value for a corresponding @i of type Wide_Wide_String. > !corrigendum 3.5(44) @ddel @xindent !corrigendum 3.5(45) @ddel @xindent !corrigendum 3.5(46) @ddel @xinbull<@fa> !corrigendum 3.5(47) @ddel @xinbull<@fa.[@fa]> !corrigendum 3.5(48) @ddel @xinbull<.numeral[exponent]> !corrigendum 3.5(49) @ddel @xinbull<@fa#@fa.#[@fa]> !corrigendum 3.5(50) @ddel @xinbull<@fa#.@fa#[@fa]> !corrigendum 3.5(51) @ddel @xindent !corrigendum 3.5(55) @drepl @xindent of type String is equivalent to a call on S'Wide_Value for a corresponding @i of type Wide_String.> @dby @xindent of type String is equivalent to a call on S'Wide_Wide_Value for a corresponding @i of type Wide_Wide_String.> !corrigendum 3.5(56) @drepl An implementation may extend the Wide_Value, Value, Wide_Image, and Image attributes of a floating point type to support special values such as infinities and NaNs. @dby An implementation may extend the Wide_Wide_Value, Wide_Value, Value, Wide_Wide_Image, Wide_Image, and Image attributes of a floating point type to support special values such as infinities and NaNs. !corrigendum 3.5(59) @drepl @xindent<@s9<21 For any value V (including any nongraphic character) of an enumeration subtype S, S'Value(S'Image(V)) equals V, as does S'Wide_Value(S'Wide_Image(V)). Neither expression ever raises Constraint_Error.>> @dby @xindent<@s9<21 For any value V (including any nongraphic character) of an enumeration subtype S, S'Value(S'Image(V)) equals V, as do S'Wide_Value(S'Wide_Image(V)) and S'Wide_Wide_Value(S'Wide_Wide_Image(V)). None of these expressions ever raise Constraint_Error.>> !corrigendum 3.5.2(2) @drepl The predefined type Character is a character type whose values correspond to the 256 code positions of Row 00 (also known as Latin-1) of the ISO 10646 Basic Multilingual Plane (BMP). Each of the graphic characters of Row 00 of the BMP has a corresponding @fa in Character. Each of the nongraphic positions of Row 00 (0000-001F and 007F-009F) has a corresponding language-defined name, which is not usable as an enumeration literal, but which is usable with the attributes (Wide_)Image and (Wide_)Value; these names are given in the definition of type Character in A.1, ``The Package Standard'', but are set in @fa. @dby The predefined type Character is a character type whose values correspond to the 256 code positions of Row 00 (also known as Latin-1) of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP). Each of the graphic characters of Row 00 of the BMP has a corresponding @fa in Character. Each of the nongraphic positions of Row 00 (0000-001F and 007F-009F) has a corresponding language-defined name, which is not usable as an enumeration literal, but which is usable with the attributes Image, Wide_Image, Wide_Wide_Image, Value, Wide_Value, and Wide_Wide_Value; these names are given in the definition of type Character in A.1, ``The Package Standard'', but are set in @fa. !corrigendum 3.5.2(3) @drepl The predefined type Wide_Character is a character type whose values correspond to the 65536 code positions of the ISO 10646 Basic Multilingual Plane (BMP). Each of the graphic characters of the BMP has a corresponding @fa in Wide_Character. The first 256 values of Wide_Character have the same @fa or language-defined name as defined for Character. The last 2 values of Wide_Character correspond to the nongraphic positions FFFE and FFFF of the BMP, and are assigned the language-defined names @i and @i. As with the other language-defined names for nongraphic characters, the names @i and @i are usable only with the attributes (Wide_)Image and (Wide_)Value; they are not usable as enumeration literals. All other values of Wide_Character are considered graphic characters, and have a corresponding @fa. @dby The predefined type Wide_Character is a character type whose values correspond to the 65536 code positions of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP). Each of the graphic characters of the BMP has a corresponding @fa in Wide_Character. The first 256 values of Wide_Character have the same @fa or language-defined name as defined for Character. Each of the @fas has a corresponding @fa. The predefined type Wide_Wide_Character is a character type whose values correspond to the 2147483648 code positions of the ISO/IEC 10646:2003 character set. Each of the @fas has a corresponding @fa in Wide_Wide_Character. The first 65536 values of Wide_Wide_Character have the same @fa or language-defined name as defined for Wide_Character. In types Wide_Character and Wide_Wide_Character, the characters whose code positions are 16#FFFE# and 16#FFFF# are assigned the language-defined names @i and @i. The other characters whose code position is larger than 16#FF# and which are not @fas have language-defined names which are formed by appending to the string "Character_" the representation of their code position in hexadecimal as eight extended digits. As with other language-defined names, these names are usable only with the attributes (Wide_)Wide_Image and (Wide_)Wide_Value; they are not usable as enumeration literals. !corrigendum 3.5.2(4) @drepl In a nonstandard mode, an implementation may provide other interpretations for the predefined types Character and Wide_Character, to conform to local conventions. @dby In a nonstandard mode, an implementation may provide other interpretations for the predefined types Character, Wide_Character, and Wide_Wide_Character to conform to local conventions. !corrigendum 3.5.2(5) @ddel If an implementation supports a mode with alternative interpretations for Character and Wide_Character, the set of graphic characters of Character should nevertheless remain a proper subset of the set of graphic characters of Wide_Character. Any character set ``localizations'' should be reflected in the results of the subprograms defined in the language-defined package Characters.Handling (see A.3) available in such a mode. In a mode with an alternative interpretation of Character, the implementation should also support a corresponding change in what is a legal @fa. !corrigendum 3.6.3(2) @drepl There are two predefined string types, String and Wide_String, each indexed by values of the predefined subtype Positive; these are declared in the visible part of package Standard: @dby There are three predefined string types, String, Wide_String, and Wide_Wide_String, each indexed by the value of the predefined subtype Positive; these are declared in the visible part of package Standard: !corrigendum 3.6.3(4) @drepl @xcode<@b String @b (Positive @b <@>) @b Character; @b Wide_String @b (Positive @b <@>) @b Wide_Character;> @dby @xcode<@b String @b (Positive @b <@>) @b Character; @b Wide_String @b (Positive @b <@>) @b Wide_Character; @b Wide_Wide_String @b (Positive @b <@>) @b Wide_Wide_Character;> !corrigendum A.1(36) @drepl @xcode< -- @ft<@i> -- @ft<@i> -- @ft<@i> -- @ft<@i> @b Wide_Character @b (@i, @i ... @i, @i); @b ASCII @b ... @b ASCII; --@ft<@i>> @dby @xcode< -- @ft<@i> -- @ft<@i> -- @ft<@i> -- @ft<@i> @b Wide_Character @b (@i, @i ... @i, @i); -- @ft<@i> -- @ft<@i> -- @ft<@i> @b Wide_Wide_Character @b (@i, @i ... @i, @i, ...); @b ASCII @b ... @b ASCII; --@ft<@i>> !corrigendum A.1(42) @drepl @xcode< -- @ft<@i>> @dby @xcode< -- @ft<@i> @b Wide_Wide_String @b (Positive @b <@>) of Wide_Wide_Character; @b Pack (Wide_Wide_String); -- @ft<@i>> !corrigendum A.1(49) @drepl In each of the types Character and Wide_Character, the character literals for the space character (position 32) and the non-breaking space character (position 160) correspond to different values. Unless indicated otherwise, each occurrence of the character literal ' ' in this International Standard refers to the space character. Similarly, the character literals for hyphen (position 45) and soft hyphen (position 173) correspond to different values. Unless indicated otherwise, each occurrence of the character literal '-' in this International Standard refers to the hyphen character. @dby In each of the types Character, Wide_Character, and Wide_Wide_Character, the character literals for the space character (position 32) and the non-breaking space character (position 160) correspond to different values. Unless indicated otherwise, each occurrence of the character literal ' ' in this International Standard refers to the space character. Similarly, the character literals for hyphen (position 45) and soft hyphen (position 173) correspond to different values. Unless indicated otherwise, each occurrence of the character literal '-' in this International Standard refers to the hyphen character. !corrigendum A.3(1) @drepl This clause presents the packages related to character processing: an empty pure package Characters and child packages Characters.Handling and Characters.Latin_1. The package Characters.Handling provides classification and conversion functions for Character data, and some simple functions for dealing with Wide_Character data. The child package Characters.Latin_1 declares a set of constants initialized to values of type Character. @dby This clause presents the packages related to character processing: an empty pure package Characters and child packages Characters.Handling and Characters.Latin_1. The package Characters.Handling provides classification and conversion functions for Character data, and some simple functions for dealing with Wide_Character and Wide_Wide_Character data. The child package Characters.Latin_1 declares a set of constants initialized to values of type Character. !corrigendum A.3.2(13) @drepl @xcode< --@ft<@i>> @dby @xcode< --@ft<@i>> !corrigendum A.3.2(14) @dinsa @xcode< @b Is_Character (Item : @b Wide_Character) @b Boolean; @b Is_String (Item : @b Wide_String) @b Boolean;> @dinst @xcode< @b Is_Character (Item : @b Wide_Wide_Character) @b Boolean; @b Is_String (Item : @b Wide_Wide_String) @b Boolean; @b Is_Wide_Character (Item : @b Wide_Wide_Character) @b Boolean; @b Is_Wide_String (Item : @b Wide_Wide_String) @b Boolean;> !corrigendum A.3.2(16) @dinsa @xcode< @b To_String (Item : @b Wide_String; Substitute : @b Character := ' ') @b String;> @dinst @xcode< @b To_Character (Item : @b Wide_Wide_Character; Substitute : @b Character := ' ') @b Character; @b To_String (Item : @b Wide_Wide_String; Substitute : @b Character := ' ') @b String;> !corrigendum A.3.2(18) @dinsa @xcode< @b To_Wide_String (Item : @b String) @b Wide_String;> @dinss @xcode< @b To_Wide_Character (Item : @b Wide_Wide_Character; Substitute : @b Wide_Character := ' ') @b Wide_Character; @b To_Wide_String (Item : @b Wide_Wide_String; Substitute : @b Wide_Character := ' ') @b Wide_String; @b To_Wide_Wide_Character (Item : @b Character) @b Wide_Wide_Character; @b To_Wide_Wide_String (Item : @b String) @b Wide_Wide_String; @b To_Wide_Wide_Character (Item : @b Wide_Character) @b Wide_Wide_Character; @b To_Wide_Wide_String (Item : @b Wide_String) @b Wide_Wide_String;> !corrigendum A.3.2(42) @drepl The following set of functions test Wide_Character values for membership in Character, or convert between corresponding characters of Wide_Character and Character. @dby The following functions test Wide_Wide_Character or Wide_Character values for membership in Wide_Character or Character, or convert between corresponding characters of Wide_Wide_Character, Wide_Character, and Character. !corrigendum A.3.2(43) @drepl @xhang<@xterm Returns True if Wide_Character'Pos(Item) <= Character'Pos(Character'Last).> @dby @xcode<@b Is_Character (Item : @b Wide_Character) @b Boolean;> @xindent @xcode<@b Is_Character (Item : @b Wide_Wide_Character) @b Boolean;> @xindent @xcode<@b Is_Wide_Character (Item : @b Wide_Wide_Character) @b Boolean;> @xindent !corrigendum A.3.2(44) @drepl @xhang<@xterm Returns True if Is_Character(Item(I)) is True for each I in Item'Range.> @dby @xcode<@b Is_String (Item : @b Wide_String) @b Boolean; @b Is_String (Item : @b Wide_Wide_String) @b Boolean;> @xindent @xcode<@b Is_Wide_String (Item : @b Wide_Wide_String) @b Boolean;> @xindent !corrigendum A.3.2(45) @drepl @xhang<@xterm Returns the Character corresponding to Item if Is_Character(Item), and returns the Substitute Character otherwise.> @dby @xcode<@b To_Character (Item : @b Wide_Character; Substitute : @b Character := ' ') @b Character; @b To_Character (Item : @b Wide_Wide_Character; Substitute : @b Character := ' ') @b Character;> @xindent @xcode<@b To_Wide_Character (Item : @b Character) @b Wide_Character;> @xindent @xcode<@b To_Wide_Character (Item : @b Wide_Wide_Character; Substitute : @b Wide_Character := ' ') @b Wide_Character;> @xindent @xcode<@b To_Wide_Wide_Character (Item : @b Character) @b Wide_Wide_Character;> @xindent @xcode<@b To_Wide_Wide_Character (Item : @b Wide_Character) @b Wide_Wide_Character;> @xindent !corrigendum A.3.2(46) @drepl @xhang<@xterm Returns the String whose range is 1..Item'Length and each of whose elements is given by To_Character of the corresponding element in Item.> @dby @xcode<@b To_String (Item : @b Wide_String; Substitute : @b Character := ' ') @b String; @b To_String (Item : @b Wide_Wide_String; Substitute : @b Character := ' ') @b String;> @xindent @xcode<@b To_Wide_String (Item : @b String) @b Wide_String;> @xindent @xcode<@b To_Wide_String (Item : @b Wide_Wide_String; Substitute : @b Wide_Character := ' ') @b Wide_String;> @xindent @xcode<@b To_Wide_Wide_String (Item : @b String) @b Wide_Wide_String; @b To_Wide_Wide_String (Item : @b Wide_String) @b Wide_Wide_String;> @xindent !corrigendum A.3.2(47) @ddel @xhang<@xterm Returns the Wide_Character X such that Character'Pos(Item) = Wide_Character'Pos(X).> !corrigendum A.3.2(48) @ddel @xhang<@xterm !corrigendum A.3.2(49) @ddel If an implementation provides a localized definition of Character or Wide_Character, then the effects of the subprograms in Characters.Handling should reflect the localizations. See also 3.5.2. !corrigendum A.4(1) @drepl This clause presents the specifications of the package Strings and several child packages, which provide facilities for dealing with string data. Fixed-length, bounded-length, and unbounded-length strings are supported, for both String and Wide_String. The string-handling subprograms include searches for pattern strings and for characters in program-specified sets, translation (via a character-to-character mapping), and transformation (replacing, inserting, overwriting, and deleting of substrings). @dby This clause presents the specifications of the package Strings and several child packages, which provide facilities for dealing with string data. Fixed-length, bounded-length, and unbounded-length strings are supported, for String, Wide_String, and Wide_Wide_String. The string-handling subprograms include searches for pattern strings and for characters in program-specified sets, translation (via a character-to-character mapping), and transformation (replacing, inserting, overwriting, and deleting of substrings). !corrigendum A.4.1(4) @drepl @xcode< Space : @b Character := ' '; Wide_Space : @b Wide_Character := ' ';> @dby @xcode< Space : @b Character := ' '; Wide_Space : @b Wide_Character := ' '; Wide_Wide_Space : @b Wide_Wide_Character := ' ';> !corrigendum A.4.7(46) @drepl @xcode< Character_Set : @b Wide_Maps.Wide_Character_Set; -- @ft<@i>> @dby @xcode< Character_Set : @b Wide_Maps.Wide_Character_Set; -- @ft<@i>> !comment Updated for AI-161 change. !corrigendum A.4.8(01) @dinsc Facilities for handling strings of Wide_Wide_Character elements are found in the packages Strings.Wide_Wide_Maps, Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, Strings.Wide_Wide_Unbounded, and Strings.Wide_Wide_Maps.Wide_Wide_Constants. They provide the same string-handling operations as the corresponding packages for strings of Character elements. @i<@s8> The package Strings.Wide_Wide_Maps has the following declaration. @xcode<@b Ada.Strings.Wide_Wide_Maps @b @b Preelaborate(Wide_Wide_Maps); -- Representation for a set of Wide_Wide_Character values: @b Wide_Wide_Character_Set @b; @b Preelaborable_Initialization(Wide_Wide_Character_Set); Null_Set : @b Wide_Wide_Character_Set; @b Wide_Wide_Character_Range @b @b Low : Wide_Wide_Character; High : Wide_Wide_Character; @b; -- @ft<@i> @b Wide_Wide_Character_Ranges @b (Positive @b <@>) @b Wide_Wide_Character_Range; @b To_Set (Ranges : @b Wide_Wide_Character_Ranges) @b Wide_Wide_Character_Set; @b To_Set (Span : @b Wide_Wide_Character_Range) @b Wide_Wide_Character_Set; @b To_Ranges (Set : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Ranges; @b "=" (Left, Right : @b Wide_Wide_Character_Set) @b Boolean; @b "@b" (Right : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Set; @b "@b" (Left, Right : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Set; @b "@b" (Left, Right : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Set; @b "@b" (Left, Right : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Set; @b "-" (Left, Right : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Set; @b Is_In (Element : @b Wide_Wide_Character; Set : @b Wide_Wide_Character_Set) @b Boolean; @b Is_Subset (Elements : @b Wide_Wide_Character_Set; Set : @b Wide_Wide_Character_Set) @b Boolean; @b "<=" (Left : @b Wide_Wide_Character_Set; Right : @b Wide_Wide_Character_Set) @b Boolean @b Is_Subset; -- @ft<@i> sub@b Wide_Wide_Character_Sequence @b Wide_Wide_String; @b To_Set (Sequence : @b Wide_Wide_Character_Sequence) @b Wide_Wide_Character_Set; @b To_Set (Singleton : @b Wide_Wide_Character) @b Wide_Wide_Character_Set; @b To_Sequence (Set : @b Wide_Wide_Character_Set) @b Wide_Wide_Character_Sequence; -- @ft<@i> -- @ft<@i> @b Wide_Wide_Character_Mapping @b; @b Preelaborable_Initialization(Wide_Wide_Character_Mapping); @b Value (Map : @b Wide_Wide_Character_Mapping; Element : @b Wide_Wide_Character) @b Wide_Wide_Character; Identity : @b Wide_Wide_Character_Mapping; @b To_Mapping (From, To : @b Wide_Wide_Character_Sequence) @b Wide_Wide_Character_Mapping; @b To_Domain (Map : @b Wide_Wide_Character_Mapping) @b Wide_Wide_Character_Sequence; @b To_Range (Map : @b Wide_Wide_Character_Mapping) @b Wide_Wide_Character_Sequence; @b Wide_Wide_Character_Mapping_Function @b @b (From : @b Wide_Wide_Character) @b Wide_Wide_Character; @b ... -- @ft<@i> @b Ada.Strings.Wide_Wide_Maps;> The context clause for each of the packages Strings.Wide_Wide_Fixed, Strings.Wide_Wide_Bounded, and Strings.Wide_Wide_Unbounded identifies Strings.Wide_Wide_Maps instead of Strings.Maps. For each of the packages Strings.Fixed, Strings.Bounded, Strings.Unbounded, and Strings.Maps.Constants the corresponding wide wide string package has the same contents except that @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet @xbullet The following additional declarations are present in Strings.Wide_Wide_Maps.Wide_Wide_Constants: @xcode< Character_Set : @b Wide_Wide_Maps.Wide_Wide_Character_Set; -- @ft<@i> -- @ft<@i> Wide_Character_Set : @b Wide_Wide_Maps.Wide_Wide_Character_Set; -- @ft<@i> --@ft<@i< Characters.Handling.Is_Wide_Character(WWC) is True>>> @xindent<@s9> !corrigendum A.6(01) @drepl Input-output is provided through language-defined packages, each of which is a child of the root package Ada. The generic packages Sequential_IO and Direct_IO define input-output operations applicable to files containing elements of a given type. The generic package Storage_IO supports reading from and writing to an in-memory buffer. Additional operations for text input-output are supplied in the packages Text_IO and Wide_Text_IO. Heterogeneous input-output is provided through the child packages Streams.Stream_IO and Text_IO.Text_Streams (see also 13.13). The package IO_Exceptions defines the exceptions needed by the predefined input-output packages. @dby Input-output is provided through language-defined packages, each of which is a child of the root package Ada. The generic packages Sequential_IO and Direct_IO define input-output operations applicable to files containing elements of a given type. The generic package Storage_IO supports reading from and writing to an in-memory buffer. Additional operations for text input-output are supplied in the packages Text_IO, Wide_Text_IO, and Wide_Wide_Text_IO. Heterogeneous input-output is provided through the child packages Streams.Stream_IO and Text_IO.Text_Streams (see also 13.13). The package IO_Exceptions defines the exceptions needed by the predefined input-output packages. !corrigendum A.7(04) @drepl Input-output for direct access files is likewise defined by a generic package called Direct_IO. Input-output in human-readable form is defined by the (nongeneric) packages Text_IO for Character and String data, and Wide_Text_IO for Wide_Character and Wide_String data. Input-output for files containing streams of elements representing values of possibly different types is defined by means of the (nongeneric) package Streams.Stream_IO. @dby Input-output for direct access files is likewise defined by a generic package called Direct_IO. Input-output in human-readable form is defined by the (nongeneric) packages Text_IO for Character and String data, Wide_Text_IO for Wide_Character and Wide_String data, and Wide_Wide_Text_IO for Wide_Wide_Character and Wide_Wide_String data. Input-output for files containing streams of elements representing values of possibly different types is defined by means of the (nongeneric) package Streams.Stream_IO. !corrigendum A.7(10) @drepl @xcode<@b File_Mode @b (In_File, Out_File, Append_File); -- @ft<@i>> @dby @xcode<@b File_Mode @b (In_File, Out_File, Append_File); -- @ft<@i>> !corrigendum A.7(13) @drepl Several file management operations are common to Sequential_IO, Direct_IO, Text_IO, and Wide_Text_IO. These operations are described in subclause A.8.2 for sequential and direct files. Any additional effects concerning text input-output are described in subclause A.10.2. @dby Several file management operations are common to Sequential_IO, Direct_IO, Text_IO, Wide_Text_IO, and Wide_Wide_Text_IO. These operations are described in subclause A.8.2 for sequential and direct files. Any additional effects concerning text input-output are described in subclause A.10.2. !corrigendum A.7(15) @drepl @xindent<@s9<18 Each instantiation of the generic packages Sequential_IO and Direct_IO declares a different type File_Type. In the case of Text_IO, Wide_Text_IO, and Streams.Stream_IO, the corresponding type File_Type is unique.>> @drepl @xindent<@s9<18 Each instantiation of the generic packages Sequential_IO and Direct_IO declares a different type File_Type. In the case of Text_IO, Wide_Text_IO, Wide_Wide_Text_IO, and Streams.Stream_IO, the corresponding type File_Type is unique.>> !corrigendum A.11(00) @drepl Wide Text Input-Output @dby Wide Text Input-Output and Wide Wide Text Input-Output !corrigendum A.11(01) @drepl The package Wide_Text_IO provides facilities for input and output in human-readable form. Each file is read or written sequentially, as a sequence of wide characters grouped into lines, and as a sequence of lines grouped into pages. @dby The packages Wide_Text_IO and Wide_Wide_Text_IO provide facilities for input and output in human-readable form. Each file is read or written sequentially, as a sequence of wide characters (or wide wide characters) grouped into lines, and as a sequence of lines grouped into pages. !corrigendum A.11(02) @drepl The specification of package Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Character, and any occurrence of String is replaced by Wide_String. @dby The specification of package Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Character, and any occurrence of String is replaced by Wide_String. Nongeneric equivalents of Wide_Text_IO.Integer_IO and Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Text_IO, Ada.Long_Integer_Wide_Text_IO, Ada.Float_Wide_Text_IO, Ada.Long_Float_Wide_Text_IO. !corrigendum A.11(03) @drepl Nongeneric equivalents of Wide_Text_IO.Integer_IO and Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Text_IO, Ada.Long_Integer_Wide_Text_IO, Ada.Float_Wide_Text_IO, Ada.Long_Float_Wide_Text_IO. @dby The specification of package Wide_Wide_Text_IO is the same as that for Text_IO, except that in each Get, Look_Ahead, Get_Immediate, Get_Line, Put, and Put_Line procedure, any occurrence of Character is replaced by Wide_Wide_Character, and any occurrence of String is replaced by Wide_Wide_String. Nongeneric equivalents of Wide_Wide_Text_IO.Integer_IO and Wide_Wide_Text_IO.Float_IO are provided (as for Text_IO) for each predefined numeric type, with names such as Ada.Integer_Wide_Wide_Text_IO, Ada.Long_Integer_Wide_Wide_Text_IO, Ada.Float_Wide_Wide_Text_IO, Ada.Long_Float_Wide_Wide_Text_IO. !corrigendum A.12(01) @drepl The packages Streams.Stream_IO, Text_IO.Text_Streams, and Wide_Text_IO.Text_Streams provide stream-oriented operations on files. @dby The packages Streams.Stream_IO, Text_IO.Text_Streams, Wide_Text_IO.Text_Streams, and Wide_Wide_Text_IO.Text_Streams provide stream-oriented operations on files. !corrigendum A.12.4(01) @dinsc The package Wide_Wide_Text_IO.Text_Streams provides a function for treating a wide wide text file as a stream. @i<@s8> The library package Wide_Wide_Text_IO.Text_Streams has the following declaration: @xcode<@b Ada.Streams; @b Ada.Wide_Wide_Text_IO.Text_Streams @b @b Stream_Access @b Streams.Root_Stream_Type'Class; @b Stream (File : @b File_Type) @b Stream_Access; @b Ada.Wide_Wide_Text_IO.Text_Streams;> The Stream function has the same effect as the corresponding function in Streams.Stream_IO. !corrigendum B.3(39) @dinsa @xcode< @b To_Ada (Item : @b wchar_array; Target : @b Wide_String; Count : @b Natural; Trim_Nul : @b Boolean := True);> @dinss @xcode< -- @ft<@i> @b char16_t @b @ft<@i<>>; char16_nul : @b char16_t := @ft<@i<>; @b To_C (Item : @b Wide_Character) @b char16_t; @b To_Ada (Item : @b char16_t) @b Wide_Character; @b char16_array @b (size_t @b <@>) @b char16_t; @b Pack(char16_array); @b Is_Nul_Terminated (Item : @b char16_array) @b Boolean; @b To_C (Item : @b Wide_String; Append_Nul : @b Boolean := True) @b char16_array; @b To_Ada (Item : @b char16_array; Trim_Nul : @b Boolean := True) @b Wide_String; @b To_C (Item : @b Wide_String; Target : @b char16_array; Count : @b size_t; Append_Nul : @b Boolean := True); @b To_Ada (Item : @b char16_array; Target : @b Wide_String; Count : @b Natural; Trim_Nul : @b Boolean := True); @b char32_t @b @ft<@i<>>; char32_nul : @b char32_t := @ft<@i<>; @b To_C (Item : @b Wide_Wide_Character) @b char32_t; @b To_Ada (Item : @b char32_t) @b Wide_Wide_Character; @b char32_array @b (size_t @b <@>) @b char32_t; @b Pack(char32_array); @b Is_Nul_Terminated (Item : @b char32_array) @b Boolean; @b To_C (Item : @b Wide_Wide_String; Append_Nul : @b Boolean := True) @b char32_array; @b To_Ada (Item : @b char32_array; Trim_Nul : @b Boolean := True) @b Wide_Wide_String; @b To_C (Item : @b Wide_Wide_String; Target : @b char32_array; Count : @b size_t; Append_Nul : @b Boolean := True); @b To_Ada (Item : @b char32_array; Target : @b Wide_Wide_String; Count : @b Natural; Trim_Nul : @b Boolean := True);> !corrigendum B.3(43) @drepl The types int, short, long, unsigned, ptrdiff_t, size_t, double, char, and wchar_t correspond respectively to the C types having the same names. The types signed_char, unsigned_short, unsigned_long, unsigned_char, C_float, and long_double correspond respectively to the C types signed char, unsigned short, unsigned long, unsigned char, float, and long double. @dby The types int, short, long, unsigned, ptrdiff_t, size_t, double, char, wchar_t, char16_t, and char32_t correspond respectively to the C types having the same names. The types signed_char, unsigned_short, unsigned_long, unsigned_char, C_float, and long_double correspond respectively to the C types signed char, unsigned short, unsigned long, unsigned char, float, and long double. !corrigendum B.3(60) @dinsa @xindent @dinss @xcode<@b Is_Nul_Terminated (Item : @b char16_array) @b Boolean;> @xindent @xcode<@b To_C (Item : @b Wide_Character) @b char16_t; @b To_Ada (Item : @b char16_t ) @b Wide_Character;> @xindent @xcode<@b To_C (Item : @b Wide_String; Append_Nul : @b Boolean := True) @b char16_array; @b To_Ada (Item : @b char16_array; Trim_Nul : @b Boolean := True) @b Wide_String; @b To_C (Item : @b Wide_String; Target : @b char16_array; Count : @b size_t; Append_Nul : @b Boolean := True); @b To_Ada (Item : @b char16_array; Target : @b Wide_String; Count : @b Natural; Trim_Nul : @b Boolean := True);> @xindent @xcode<@b Is_Nul_Terminated (Item : @b char32_array) @b Boolean;> @xindent @xcode<@b To_C (Item : @b Wide_Wide_Character) @b char32_t; @b To_Ada (Item : @b char32_t ) @b Wide_Wide_Character;> @xindent @xcode<@b To_C (Item : @b Wide_Wide_String; Append_Nul : @b Boolean := True) @b char32_array; @b To_Ada (Item : @b char32_array; Trim_Nul : @b Boolean := True) @b Wide_Wide_String; @b To_C (Item : @b Wide_Wide_String; Target : @b char32_array; Count : @b size_t; Append_Nul : @b Boolean := True); @b To_Ada (Item : @b char32_array; Target : @b Wide_Wide_String; Count : @b Natural; Trim_Nul : @b Boolean := True);> @xindent !corrigendum C.5(7) @drepl If the pragma applies to an enumeration type, then the semantics of the Wide_Image and Wide_Value attributes are implementation defined for that type; the semantics of Image and Value are still defined in terms of Wide_Image and Wide_Value. In addition, the semantics of Text_IO.Enumeration_IO are implementation defined. If the pragma applies to a tagged type, then the semantics of the Tags.Expanded_Name function are implementation defined for that type. If the pragma applies to an exception, then the semantics of the Exceptions.Exception_Name function are implementation defined for that exception. @dby If the pragma applies to an enumeration type, then the semantics of the Wide_Wide_Image and Wide_Wide_Value attributes are implementation defined for that type; the semantics of Image, Wide_Image, Value, and Wide_Value are still defined in terms of Wide_Wide_Image and Wide_Wide_Value. In addition, the semantics of Text_IO.Enumeration_IO are implementation defined. If the pragma applies to a tagged type, then the semantics of the Tags.Expanded_Name function are implementation defined for that type. If the pragma applies to an exception, then the semantics of the Exceptions.Exception_Name function are implementation defined for that exception. !corrigendum F(4) @drepl @xbullet @dby @xbullet !corrigendum F.3(1) @drepl The child packages Text_IO.Editing and Wide_Text_IO.Editing provide localizable formatted text output, known as @i , for decimal types. An edited output string is a function of a numeric value, program-specifiable locale elements, and a format control value. The numeric value is of some decimal type. The locale elements are: @dby The child packages Text_IO.Editing, Wide_Text_IO.Editing, and Wide_Wide_Text_IO.Editing provide localizable formatted text output, known as @i, for decimal types. An edited output string is a function of a numeric value, program-specifiable locale elements, and a format control value. The numeric value is of some decimal type. The locale elements are: !corrigendum F.3(6) @drepl For Text_IO.Editing the edited output and currency strings are of type String, and the locale characters are of type Character. For Wide_Text_IO.Editing their types are Wide_String and Wide_Character, respectively. @dby For Text_IO.Editing the edited output and currency strings are of type String, and the locale characters are of type Character. For Wide_Text_IO.Editing their types are Wide_String and Wide_Character, respectively. For Wide_Wide_Text_IO.Editing their types are Wide_Wide_String and Wide_Wide_Character, respectively. !corrigendum F.3(19) @drepl The generic packages Text_IO.Decimal_IO and Wide_Text_IO.Decimal_IO (see A.10.9, ''Input-Output for Real Types'') provide text input and non-edited text output for decimal types. @dby The generic packages Text_IO.Decimal_IO, Wide_Text_IO.Decimal_IO, and Wide_Wide_Text_IO.Decimal_IO (see A.10.9, ''Input-Output for Real Types'') provide text input and non-edited text output for decimal types. !corrigendum F.3(20) @drepl @xindent<@s9<2 A picture String is of type Standard.String, both for Text_IO.Editing and Wide_Text_IO.Editing.>> @dby @xindent<@s9<2 A picture String is of type Standard.String, for all of Text_IO.Editing, Wide_Text_IO.Editing, and Wide_Wide_Text_IO.Editing.>> !corrigendum F.3.5(01) @dinsc @i<@s8> The child package Wide_Wide_Text_IO.Editing has the same contents as Text_IO.Editing, except that: @xbullet @xbullet @xbullet @xindent<@s9> !corrigendum G.1.5(01) @dinsc @i<@s8> Implementations shall also provide the generic library package Wide_Wide_Text_IO.Complex_IO. Its declaration is obtained from that of Text_IO.Complex_IO by systematically replacing Text_IO by Wide_Wide_Text_IO and String by Wide_Wide_String; the description of its behavior is obtained by additionally replacing references to particular characters (commas, parentheses, etc.) by those for the corresponding wide wide characters. !corrigendum H.4(20) @drepl @xhang<@xterm Semantic dependence on any of the library units Sequential_IO, Direct_IO, Text_IO, Wide_Text_IO, or Stream_IO is not allowed.> @dby @xhang<@xterm Semantic dependence on any of the library units Sequential_IO, Direct_IO, Text_IO, Wide_Text_IO, Wide_Wide_Text_IO, or Stream_IO is not allowed.> !ACATS test ACATS tests need to be constructed for these facilities. !appendix From: Gary Dismukes Sent: Tuesday, January 15, 2002 4:14 PM Ben Brosgol recently pointed out to us (ACT) the introduction of a variant of the Latin 1 character set that is designated Latin 9. A web page describing Latin 9 can be viewed at: http://www.cs.tut.fi/~jkorpela/latin9.html Here's the summary blurb on that page describing the relatively minor differences between Latin 1 and Latin 9: ISO Latin 9 as compared with ISO Latin 1 The ISO Latin 9 (ISO 8859-15) character set differs from the well-known ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro sign and some national letters used e.g. in French and Finnish have been introduced and some rarely used special characters omitted. We've added a new package to the GNAT library named Ada.Characters.Latin_9, analogous to Ada.Characters.Latin_1, to define character constants for this new character set. Robert Dewar asked me to post the following remarks from him re Latin-9 and Ada.Characters.Handling: ---------- Note that the Ada package Latin-1 did not exactly follow the official names of all characters, and I have copied its abbreviated naming style for the new characters in Latin-9. I have a gripe with the RM here. The setup for Ada.Characters.Latin_1 is to have separate packages for separate character sets, which makes perfectly good sense: 27 An implementation may provide additional packages as children of Ada.Characters, to declare names for the symbols of the local character set or other character sets. But for Characters.Handling, we have the odd statement: 49 If an implementation provides a localized definition of Character or Wide_Character, then the effects of the subprograms in Characters.Handling should reflect the localizations. See also 3.5.2. which implies that some mysterious transformation happens on this package (under what circumstnaces?) I think this is a bad idea for two reasons: a) it requires specialized mechanisms in the compiler, and it seems odd for the meaning of this package to depend on some compiler switch etc. b) it precludes handling multiple character sets in the same program, whereas the design for Ada.Characters.Latin_1 etc seems to accomodate this. My recommendation is that an implementation generate separate packages, called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling being a renaming of Ada.Characters.Handling_Latin_1 perhaps?) Robert Dewar ************************************************************* From: Pascal Leroy Sent: Tuesday, January 15, 2002 5:05 PM > The ISO Latin 9 (ISO 8859-15) character set differs from the well-known > ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro > sign and some national letters used e.g. in French and Finnish have been > introduced and some rarely used special characters omitted. Oh boy, good to see that the OE and oe ligatures are now available, and that we now can write French without having to use Unicode! ************************************************************* From: John Barnes Sent: Wednesday, January 16, 2002 1:44 AM Better put that on the agenda for the next ARG. Ada 2005 should use Latin 9 rather than Latin 1. A minor change. Might be a few incompatibilities. ************************************************************* From: Pascal Leroy Sent: Wednesday, January 16, 2002 12:53 PM As I mentioned in a mail yesterday, the fact that you can use Latin 9 to write French makes it look very interesting to me. On the other hand, it is not too useful for Ada to support Latin 9 if the OSes don't: if I emit the character OE and it print out as 1/4 on my screen, I didn't gain much. So while I agree that we should consider supporting Latin 9 _in_addition_ to Latin 1 in Ada 05, I don't think Latin 9 should _replace_ Latin 1, because I am ready to bet that we will still have Latin 1 OSes ten years from now. ************************************************************* From: John Barnes Sent: Thursday, January 17, 2002 1:33 AM It was somewhat of a jokey suggestion as I am sure you are aware. Indeed I had a big problem when writing my book and displaying the type Character. I wrote it in QuarkXpress on a PC and it was fine. The publishers moved it to a Mac before printing and some characters came out wrong. One of them came out as a picture of an apple. Moreover, someone had bitten a lump out of it. So much for standards I thought. But supporting Latin-9 would be nice. All those adverts on the Paris Metro for eating an oeuf can then be printed properly. ************************************************************* From: Bob Duff Sent: Thursday, January 17, 2002 1:14 PM > Indeed I had a big problem when writing my book and > displaying the type Character. I had a great deal of trouble writing the part of the Reference Manual where type Character lives. I think Randy had some trouble with the updated RM, too. At least we didn't try to show type Wide_Character in its full glory. ;-) 7-bit ascii will live forever, I suppose. ************************************************************* From: Bob Duff Sent: Wednesday, January 16, 2002 2:15 PM > Ben Brosgol recently pointed out to us (ACT) the introduction of a > variant of the Latin 1 character set that is designated Latin 9. The nice thing about standards is that there are so many to choose from. ;-) > My recommendation is that an implementation generate separate packages, > called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling > being a renaming of Ada.Characters.Handling_Latin_1 perhaps?) That makes sense. But I think the RM statement you complain about is envisioning a nonstandard version of Standard.[Wide_]Character, which is a separate issue. I don't see that as a big deal -- if you don't think it's a good idea, don't implement any such thing. I tend to agree that compiler switches and the like shouldn't normally be meddling with the semantics of packages Standard and Characters.Handling without a very good reason. ************************************************************* From: Florian Weimer Sent: Friday, January 18, 2002 6:58 AM > But I think the RM statement you complain about is envisioning a > nonstandard version of Standard.[Wide_]Character, which is a separate > issue. If you use Latin 9 for Standard.Character, this is certainly a nonstandard version, and Ada.Characters.Handling has to be modified to remain useful. ************************************************************* From: Florian Weimer Sent: Friday, January 18, 2002 6:58 AM > Better put that on the agenda for the next ARG. Ada 2005 > should use Latin 9 rather than Latin 1. A minor change. > Might be a few incompatibilities. I disagree. With Latin 9, the mapping from Character to Wide_Character is less straightforward, and this could have unexpected results. OTOH, it seems that Wide_Character is not widely used (unless you are forced to do so by ASIS), so this might not matter much. In addition, we really should add Wide_Wide_Character (which covers the sixteen additional planes), or make Wide_Character itself wider. Otherwise, using Unicode with standard Ada will be rather painful. ************************************************************* From: Florian Weimer Sent: Saturday, April 20, 2002 3:18 AM ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits, and 10646-2:2001 allocates characters outside the Basic Multilingual Plane. Not too long ago, quite a few people assumed that characters beyond the BMP would be interesting only for rather esoteric scholarly use (Linear B is a perfect example). However, we now have got at least different sets of code positions outside the BMP which will see more widespread use eventually: the mathematical alphabets and Plane 14 Language Tags (which are required to make some Japanese people happy who fear that Japanese characters are rendered using Chinese glyphs). Therefore, I think Ada 200X should somehow support characters outside the BMP. A few random thoughts (sorry, I'm probably not using strict ISO 10646 terminology): * Several major vendors have adopted ISO 10646-1:1993 early, using a 16 bit representation for characters (i.e. wchar_t in C is 16 bits). These vendors include Sun (Java) and Microsoft (Windows), and probably most proprietary UNIX vendors. These vendor implementations now cover the code positions beyond the BMP using UTF-16, which uses surrogate pairs (a single character is represented using two 16 bit values from reserved ranges in the BMP). UTF-16 has got a few drawbacks: the ordering (in terms UCS code positions) is no longer lexicographic (which leads us to such brain damage as CESU-8), dealing with individual characters is complicated, and you cannot implement the C wide character functions properly. For Ada, numerous changes would be required if we want to expose the UTF-16 representation to programmers, for example by declaring Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no longer be arrays of characters indexed by position). GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS characters in a single 32 bit value, that is, UTF-32), and while this is certainly not the "industry standard" (it is encouraged by ISO 9899:1999, though), I really hope we can use this approach (UTF-32 internal representation) for Ada, as it simplifies things considerably, especially if we want to add character properties support (see below). * We could add Wide_Wide_Character and Wide_Wide_String types to pacakge Standard (and extending the Ada.Strings hierarchy), which are encoded in UTF-32. I don't know if this is necessary. IIRC, Robert Dewar once told that the only applications using Wide_Character are based on ASIS, where using Wide_Character is not really voluntarily. Maybe it is possible to bump Wide_Character'Size to 32 bits instead, without really breaking backwards compatibility. Of course, we would need a way to converted UTF-32 strings to UTF-16 strings and vice versa (the UTF-16 string type could become a second-class citizen, though, without full support in the Ada.Strings hierarchy). * External representation of UCS characters is rapidly moving towards UTF-8 (especially in Internet standards). Ada should provide an interface for converting between the wide string type(s) and UTF-8 octet sequences. It should be possible to use string literals where UTF-8 strings are expected. * Supporting higher levels of Unicode (e.g. accessing the character properties database, normalization forms) would be interesting, too. Such documents will eventually follow in the ISO 10646 series, but I don't know if the ISO standard will be ready for Ada 200X. Currently, only the Unicode Consortium has standardized or documented issues like character properties or terminal behavior in detail. I don't know how ISO reacts if ISO standards refer to competing standardization efforts. IEEE POSIX.1 (and probably, or already, ISO POSIX) standardizes the BSD sockets interface, and not OSI, so maybe this isn't an issue. In any case, this point is mostly a library issue which can be addressed by a community implementation effort, it does not require changes in the Ada language (adding Wide_Wide_Character does, for example). ************************************************************* From: Pascal Leroy Sent: Monday, April 22, 2002 8:32 AM > ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits, > and 10646-2:2001 allocates characters outside the Basic Multilingual > Plane. > > Therefore, I think Ada 200X should somehow support characters outside > the BMP. The normalization of new character sets (both as part of 10646 and of 8859) was actually discussed at the last ARG meeting, and I was given an action item to somehow integrate them in the language, probably as some kind of amendment AI. > A few random thoughts (sorry, I'm probably not using strict ISO 10646 > terminology): > > * Several major vendors have adopted ISO 10646-1:1993 early, using a > 16 bit representation for characters (i.e. wchar_t in C is 16 > bits). Which is fine as it maps directly to Ada's wide character. I still think that we want to retain the capacity of using 16-bit blobs to represent characters in the BMP, as 99.5% of practical applications will only need the BMP. > For Ada, numerous changes would be required if we want to expose the > UTF-16 representation to programmers, for example by declaring > Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no > longer be arrays of characters indexed by position). Changes to Wide_Character and Wide_String are pretty much out of the question. On the other hand, the type that is intended for interfacing with C is Interfaces.C.wchar_array, and it would be straightforward to provide (in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C compiler does). > I really hope we can use this approach (UTF-32 > internal representation) for Ada, as it simplifies things > considerably, especially if we want to add character properties > support (see below). I would think that we would want to use UCS-4, since it's an ISO standard. Moreover, UTF-32 has a number of consistency rules (eg code points below 16#10ffff#) which seem irrelevant for internal manipulation of strings. > * We could add Wide_Wide_Character and Wide_Wide_String types to > pacakge Standard (and extending the Ada.Strings hierarchy), which > are encoded in UTF-32. Wide_Wide_ types seem like the natural way to add this capability to the language, except that some compilers may not be quite prepared to deal with enumeration types with 2 ** 32 literals (ours isn't). > (the UTF-16 string type could become a > second-class citizen, though, without full support in the Ada.Strings > hierarchy). As far as I can tell, there is no support for UTF-16, only for UCS-2. Anyway, I don't think it is reasonable to force applications to go to the full 32-bit overhead just because they use, say, the french OE ligature. > * External representation of UCS characters is rapidly moving > towards UTF-8 (especially in Internet standards). > > Ada should provide an interface for converting between the wide string > type(s) and UTF-8 octet sequences. It should be possible to use > string literals where UTF-8 strings are expected. External representation is best handled by Text_IO and friends, typically by using a form parameter to specify the encoding (and there are many more encodings than just UCS and UTF). The ARG won't get into the business of specifying the details of the form parameter, so this is something that will remain non-portable for the foreseeable future. (Where do we stop? Do we want to require all validated compilers to support UTF-8? What about the chinese Big5 or the JIS encodings?) > * Supporting higher levels of Unicode (e.g. accessing the character > properties database, normalization forms) would be interesting, > too. We certainly don't want to get into that business. The designers of Ada 95 wisely decided to lump all of the characters in the range 16#0100# .. 16#FFFD# into the category special_character, so that they don't have to decide which is a letter, a number, etc. Similarly they didn't provide classification functions or upper/lower conversions for wide characters. This seems reasonable if we don't want to have to amend Ada each time a bunch of characters are added to 10646. ************************************************************* From: Nick Roberts Sent: Wednesday, April 24, 2002 7:31 PM > Therefore, I think Ada 200X should somehow support characters outside > the BMP. I agree. > GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS > characters in a single 32 bit value, that is, UTF-32), and while this is > certainly not the "industry standard" (it is encouraged by ISO 9899:1999, > though), I really hope we can use this approach (UTF-32 internal > representation) for Ada, as it simplifies things considerably, especially > if we want to add character properties support (see below). I agree very strongly! > * We could add Wide_Wide_Character and Wide_Wide_String types to > pacakge Standard (and extending the Ada.Strings hierarchy), which > are encoded in UTF-32. I must say I would prefer the identifiers Universal_Character and Universal_String. I see the logic of Wide_Wide_ but it seems clumsy! > I don't know if this is necessary. IIRC, Robert Dewar once told that > the only applications using Wide_Character are based on ASIS, where > using Wide_Character is not really voluntarily. Maybe it is possible > to bump Wide_Character'Size to 32 bits instead, without really > breaking backwards compatibility. I disagree with this idea. > Of course, we would need a way to converted UTF-32 strings to UTF-16 > strings and vice versa (the UTF-16 string type could become a > second-class citizen, though, without full support in the Ada.Strings > hierarchy). Possibly these support packages should be in an optional annex. > * External representation of UCS characters is rapidly moving > towards UTF-8 (especially in Internet standards). > > Ada should provide an interface for converting between the wide string > type(s) and UTF-8 octet sequences. It should be possible to use string > literals where UTF-8 strings are expected. > > * Supporting higher levels of Unicode (e.g. accessing the character > properties database, normalization forms) would be interesting, > too. Again, perhaps all this should really be in (or moved into) an optional annex. ************************************************************* From: Robert Dewar Sent: Wednesday, April 24, 2002 9:50 PM I suspect that the work on wide_wide_character will in practice turn out to be nearly useless in the short or medium term. We certainly put in a lot of work in GNAT in implementing wide character with many different representation schemes, but this feature has been very little used (ASIS being the main use :-). In practice I think the 16-bit character type defined in Ada now will be adequate for almost all use, and I see no reason in requring implementations to go beyond this in the absence of real market demand. Yes, it's fun to talk about character set issues (after all I was chair of the CRG, so I appreciate this), but there is no point in increasing implementation burdens unless it's really valuable. I would just give clear permission for an implementation to add additional character types in standard (indeed that permission exists today in Ada 95), and leave it at that. ************************************************************* From: John Barnes Sent: Thursday, April 25, 2002 1:46 AM The BSI is looking at character set issues across languages and your message reminded me of the CRG. Was there ever a final report that I could refer to? ************************************************************* From: Robert Dewar Sent: Thursday, April 26, 2002 10:25 PM I think there was a final report, perhaps Jim could track it down. ************************************************************* From: Randy Brukardt Sent: Thursday, April 25, 2002 3:44 PM > We certainly put in a lot of work in GNAT in implementing wide > character with many different representation schemes, but this > feature has been very little used (ASIS being the main use :-). To add another data point: Claw was designed so that a wide character version could be easily created. But we've never implemented that version, mainly because we've never had a paying customer ask for it. So I have to wonder how important "Really_Wide_Character" would be. ************************************************************* From: Florian Weimer Sent: Saturday, May 18, 2002 5:41 AM > I suspect that the work on wide_wide_character will in practice turn > out to be nearly useless in the short or medium term. Using Ada for internationalized applications on GNU systems (using GNU facilities) almost requires 32 bit Wide_Wide_Character support, since GNU uses a 32 bit wchar_t internally. (See a similar discussion on the GCC development list.) ************************************************************* From: Robert Dewar Sent: Saturday, May 18, 2002 7:32 AM We have seen zero demand for such functionality, so would not invest any time at all in either design or implementation work here. If such a feature is added to Ada, I would definitely suggest it be optional. ************************************************************* From: Florian Weimer Sent: Saturday, May 18, 2002 6:00 AM >> * Several major vendors have adopted ISO 10646-1:1993 early, using a >> 16 bit representation for characters (i.e. wchar_t in C is 16 >> bits). > > Which is fine as it maps directly to Ada's wide character. I still think > that we want to retain the capacity of using 16-bit blobs to represent > characters in the BMP, as 99.5% of practical applications will only need the > BMP. Quite a few people have already changed their minds about the 99.5% figure (mathematical characters and Plane 14 Language being the reason). Maybe it's true for the character count, but I doubt it for the application count. > Changes to Wide_Character and Wide_String are pretty much out of the > question. Okay, accepted. > On the other hand, the type that is intended for interfacing with > C is Interfaces.C.wchar_array, and it would be straightforward to provide > (in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit > Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C > compiler does). I doubt that C compilers can use UTF-16 for wchar_t. You cannot apply iswlower() to a single surrogate character. :-/ > I would think that we would want to use UCS-4, since it's an ISO standard. > Moreover, UTF-32 has a number of consistency rules (eg code points below > 16#10ffff#) which seem irrelevant for internal manipulation of strings. Yes, UCS-4 is indeed the correct encoding form to use. >> * We could add Wide_Wide_Character and Wide_Wide_String types to >> pacakge Standard (and extending the Ada.Strings hierarchy), which >> are encoded in UTF-32. > > Wide_Wide_ types seem like the natural way to add this capability to the > language, except that some compilers may not be quite prepared to deal with > enumeration types with 2 ** 32 literals (ours isn't). Ah, this could be a problem indeed, together with the large universal_integer returned by Wide_Wide_Character'Pos. >> (the UTF-16 string type could become a >> second-class citizen, though, without full support in the Ada.Strings >> hierarchy). > > As far as I can tell, there is no support for UTF-16, only for UCS-2. At the moment, yes, but I think we need some UTF-16 support, too, because many operating system interfaces use it. > Anyway, I don't think it is reasonable to force applications to go to the > full 32-bit overhead just because they use, say, the french OE ligature. Most people apparently refuse to use Wide_Character, too, for the same reason. They either go for ISO 8859-15 or Windows 1252, or don't use the OE ligature at all. > External representation is best handled by Text_IO and friends, typically by > using a form parameter to specify the encoding (and there are many more > encodings than just UCS and UTF). There was a recent discussion to add other I/O facilities. UTF-8 is becoming more and more common in the Internet context, and often, you can determine the encoding of a file only after reading the first couple of lines (think of a MIME-encoded mail message). Furthermore, UTF-8 already plays an important role in interacting with other libraries (not written in Ada). > (Where do we stop? Do we want to require all validated compilers to > support UTF-8? Yes, why not? Why shall all compilers support ISO 8859-1? Why UCS-2? > What about the chinese Big5 or the JIS encodings?) If there is support for UCS-4, handling these encodings could be performed by a mechanism similar to POSIX iconv(). ************************************************************* From: Robert Dewar Sent: Saturday, May 18, 2002 7:43 AM > Yes, why not? Why shall all compilers support ISO 8859-1? Why UCS-2? Why not = because there is no real demand. Especially this time around we need to be very careful not to require things that no one is really interested in. If we do this, the vendors will simply ignore any new standard. In fact I think if there is a new standard, it will only be implemented as a result of direct customer interest in features in this standard. The value of formal conformance and validation has largely disappeared from the Ada marketplace at this stage (in terms of customer demand). That's not to say that the Ada marketplace is not very vital and dynamic, we get dozens of requests for enhancements from our users every month, but there is precious little intersection between the things users seem to need and want and these kind of discussions. In GNAT, we put a lot of effort into implementing multiple character sets (we just added the new Latin set with the Euro symbol, because customers needed that for example). Some of it has been useful (like this Euro addition), but mostly these features are of entertainment and advertising value only. In fact the only serious user that we have for Wide_Character and Wide_String is us (from ASIS :-) One thing to remember here is that very little is needed in the way of language support for fancy character sets (most of the effort in GNAT for example for 8-bit sets is in csets, which gives proper case mapping for identifiers, and it is easy enough to add new tables to this -- someone contributed a new Cyrillic table just a few months ago). Most of the issues are representational issues, and the Ada standard has nothing to say about source representation (and this should not change in any new standard). ************************************************************* From: Pascal Leroy Sent: Tuesday, May 21, 2002 4:03 AM > > Which is fine as it maps directly to Ada's wide character. I still think > > that we want to retain the capacity of using 16-bit blobs to represent > > characters in the BMP, as 99.5% of practical applications will only need the > > BMP. > > Quite a few people have already changed their minds about the 99.5% > figure (mathematical characters and Plane 14 Language being the > reason). Maybe it's true for the character count, but I doubt it for > the application count. Remember, we are talking Ada applications here. There are probably many applications out there that deal with mathematical symbols or with Tengwar, but I doubt that they are written in Ada. > > External representation is best handled by Text_IO and friends, typically by > > using a form parameter to specify the encoding (and there are many more > > encodings than just UCS and UTF). > > There was a recent discussion to add other I/O facilities. UTF-8 is > becoming more and more common in the Internet context, and often, you > can determine the encoding of a file only after reading the first > couple of lines (think of a MIME-encoded mail message). Furthermore, > UTF-8 already plays an important role in interacting with other > libraries (not written in Ada). Maybe we need a predefined unit to convert UCS-2 to/from UTF-8. But then such conversion functions could easily be written by the user, too, or provided by some public domain stuff. > > (Where do we stop? Do we want to require all validated compilers to > > support UTF-8? > > Yes, why not? Why shall all compilers support ISO 8859-1? Why UCS-2? You don't sell many compilers if you don't support 8859-1. As for UCS-2, well, that's pretty much the default representation of wide characters anyway. Other than that, it would seem that we should let the market decide. Speaking for Rational, we have had wide character support for about 7 years, and I don't recall seeing a single bug report or request for enhancement on this topic. This may indicate that our technology is perfect, but there are other explanation ;-) . (As a matter of fact we probably have very few licenses installed in countries where 8859-1 is not sufficient to write the native language -- ignoring the problem with the OE ligature in French.) One option would be to add Wide_Wide_Character in a new annex, and let users decide if they want their vendors to support this annex. Of course, chances are that nobody would care, in which case that would be a lot of standardization effort for nothing. ************************************************************* From: Robert Dewar Sent: Tuesday, May 21, 2002 4:39 AM I agree with everything Pascal had to say about wide character. We do have one Japanese customer using wide characters, and as I mentioned earlier, ASIS uses wide strings to represent source texts, but other than that, we have heard very little about wide strings. The only real input we have got from customers on character set issues was the request to support Latin-9 with the new Euro symbol and we got contributed tables for Cyrillic from a Russian enthusiast (not a customer, but it seemed a harmless addition :-) ************************************************************* From: Florian Weimer Sent: Tuesday, May 21, 2002 1:42 PM > I agree with everything Pascal had to say about wide character. We do have > one Japanese customer using wide characters, and as I mentioned earlier, > ASIS uses wide strings to represent source texts, but other than that, > we have heard very little about wide strings. I guess this customer doesn't use Wide_Character in the way it was intended (for storing ISO 10646 code position), so this example is a bit dubious. > The only real input we have got from customers on character set > issues was the request to support Latin-9 with the new Euro symbol Even in this rather innocent case, Wide_Character is no longer using UCS-2 with GNAT. ************************************************************* From: Michael F. Yoder Sent: Monday, October 21, 2002 10:58 AM This is one of the items on my homework list. UTF = UCS Transformation Format. UCS = Universal Multiple-Octet Coded Character Set. I guess the MOC is silent. :-) UTF-8 encodes 31-bit values as 8-bit values, as follows. 0xxxxxxx encodes itself (the coding is ASCII-compatible) 110xxxxx 10Y encodes xxxxxY where Y stands for yyyyyy 1110xxxx 10Y 10Z encodes xxxxYZ 11110xxx 10Y 10Z 10U encodes xxxYZU 111110xx 10Y 10Z 10U 10V encodes xxYZUV 1111110x 10Y 10Z 10U 10V 10W encodes xYZUVW The octets 11111110 and 11111111 aren't used in the encoding. So, excepting these 2, octets starting with 11 are headers, those starting with 10 are trailers, and those starting with 0 are singletons. It's forbidden to use the redundant encodings (you must use the shortest encoding allowed). There are security reasons for this, aside from the fact that doing so breaks the string search property mentioned below. The encoding is self-synchronizing: if you start in the middle of a string of octets, you skip octets of the form 10xxxxxx to get to the next start of character. If the encoding is proper, string searches for an encoded pattern within an encoded string will work as desired to yield all occurrences of the pattern. (For case-folded searches and the like this only works if the string is mapped before being converted to UTF-8.) ************************************************************* From: Robert Dewar Sent: Monday, October 21, 2002 11:03 AM Is anyone using UTF-8 encoding with Ada. We have some customers using wide character encodings but none to our knowledge uses UTF-8. ************************************************************* From: Robert A. Duff Sent: Monday, October 21, 2002 11:43 AM > It's forbidden to use the redundant encodings (you must use the shortest > encoding allowed). There are security reasons for this,... I'm curious: why is that? (Not quite curious enough to go RTFM. ;-)) >... aside from the > fact that doing so breaks the string search property mentioned below. Yes, I understand that. ************************************************************* From: Michael F. Yoder Sent: Monday, October 21, 2002 1:15 PM This problem is one my previous employer is having to deal with. Basically, it's that redundant encodings can be used to sneak things past filters if the redundant encodings aren't rejected; if redundant encodings are allowed, writing (say) a regular expression that will match exactly all possible encoded forms is a pain, is error-prone, and is probably significantly slower to check. Here's a contrived case. A program reads a command, and if it's the special command 'shazam' it checks the user's authorization; otherwise it passes on the command unmodified, because all other commands are safe. If there's a redundant encoding of 'shazam' that the filter misses, an unauthorized user can bypass the checking if he can arrange to supply that encoding. ************************************************************* From: Michael F. Yoder Sent: Thursday, October 24, 2002 5:46 PM This is the easy part of my homework. The identifier character ranges are defined in terms of multiple character categories (see below), so I can't get the harder part without a little coding. This is using Unicode version 3.2. A "space" is itself a normative category. It is anything in the range U+2000 to U+200B, plus 5 other scattered characters. A "separator" is any space plus the two characters "Line Separator" U+2028 and "Paragraph Separator" U+2029. These are each in a normative category containing just 1 value. A "decimal digit" is itself a normative category. There are 25 ranges of these, 23 including the digits 0 through 9 and 2 with only the digits 1 through 9. (These two scripts use the ASCII zero rather than encoding a separate one.) Five of these ranges are above U+FFFF, that is, out of the BMP (their character descriptions all start with "mathematical"). The digits 1 through 9 in these scripts don't in general look much like our 1 through 9. The rules for identifiers say (I'm condensing and interpreting) that the syntax for identifiers should start with their basic definition and fiddle it as appropriate to include extra characters (for Ada, that means underscore). Their basic definition is identifier ::= id-start { id-start | id-extend } id-start is any letter (which come in 5 subcategories) or a "letter number." There are a lot of letters outside the BMP, including the large range "CJK Ideograph Extension B." id-extend is decimal digits plus nonspacing marks, spacing combining marks, connector punctuation, and formatting codes. ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 7:19 PM I am completely confused, why are we discussing this eactly can you be clear as to the goals of this discussion? ************************************************************* From: Randy Brukardt Sent: Thursday, October 24, 2002 2:50 PM I know I don't count, :-) but I've had several requests to extend my spam filter to support UTF-8 encodings. Because I'm not asking for any money for the filter, and I haven't had any signficant amount of UTF-8 mail, I haven't done anything about it yet. But it seems likely that I will need to do this at some point (I've seen occassional UTF-8 encoded mail, but not enough good mail that handling it manually is a problem.) ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 4:29 PM Oh sure, UTF-8 encoded spam is common indeed, but that was not what I was talking about (unless you have some spam messages written in Ada source code :-) ************************************************************* From: Randy Brukardt Sent: Thursday, October 24, 2002 4:59 PM I think you misunderstand. I have written an anti-spam plugin for the IMS mailserver that I use. It is written in Ada, of course, and I've had requests for it to be able to handle UTF-8 encoded mail. For me, it's fine to treat such mail as all spam, but that is not true for some of the other users of it. (I've made it available to the community of IMS users, as they have made many useful plugins available that I have been using for years.) In order to properly support UTF-8 mail, I'd need at least to convert the search patterns (in Latin-1, of course) into UTF-8. I'd also need to verify that the rules that Mike noted are followed (a common trick of spammers is to violate basic encoding rules, as most decoders don't check. But the illegal encodings tend to get ignored by filters, because they don't match exactly. That was one of the prime reasons I wrote the plugin in the first place, because a lot of spam is now coming encoded in one way or another, and thus is not picked up by a plain text scan). ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 7:17 PM Oh! I was confused then, I thought this was something to do with Ada. ************************************************************* From: Randy Brukardt Sent: Thursday, October 24, 2002 7:46 PM Of course it has to do with Ada. You asked "Is anyone using UTF-8 encoding with Ada." And I answered that I have an Ada program that needs to process UTF-8 text (but doesn't yet). And I tried to explain what the program is and why it needs to process UTF-8 text and what support from Ada would be valuable. Perhaps I should have just answered your original question "Yes"? :-) ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 8:09 PM Sorry, when I meant "using UTF-8 encoding with Ada", I was talking about language features for wide character representation. The fact that your program is in Ada does not seem to be particularly informative. I am completely confused here, what ARG-related language problem is this thread addressing? ************************************************************* From: Randy Brukardt Sent: Thursday, October 24, 2002 8:32 PM As I recall, one of the facets of UTF-8 support in Ada would be the equivalent of Ada.Characters.Handling for UTF-8 represented Strings. Those operations would be valuable for this application, particularly To_Wide_String (UTF_8_String) or To_UTF_8_String (String). A UTF-8 Text_IO would also be valuable, although I'd find that overkill for this application (usually the text has to be decoded to UTF-8 from some 7-bit representation anyway). I'm not sure where else UTF-8 would appear in the standard. Source representation and external file representations are outside of the scope of the standard. The regular string operations seem to work for most (all?) operations. Everything else seems to already be covered by the existing wide character support. ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 8:45 PM Well, harmless I suppose, but I doubt worth the effort. Again, I would generate packages on the basis of packages that exist, have proved useful and are actually widely used. It seems a mistake to get into the "here's a neat idea for a package that would help with something I happen to be doing". ************************************************************* From: Michael F. Yoder Sent: Thursday, October 24, 2002 5:46 PM > I am completely confused here, what ARG-related language >problem is this thread addressing? Kiyoshi Ishihata stated at the last meeting that there was in interest in some countries in being able to write programs as much as possible in native languages, the primary deficit in this regard being that identifiers are entirely in Latin-1 characters. He didn't specify which countries to my recollection, but Japan, Russia, China, and India are obvious cases where the commonly used scripts are disjoint from Latin-1. The information being supplied is exploratory in nature: the idea is to find out how hard it would be to extend existing compilers so as to satisfy all the national groups at once, and whether and to what extent the ARG should be involved in specifying standards for such extensions. There was a separate issue involving the fact that ISO 10646-n (I forget what n is) now has mapped characters outside the BMP. This had to happen, given that the code now maps some 70,000 Han characters. ************************************************************* From: Robert Dewar Sent: Thursday, October 24, 2002 8:54 PM Well I would just allow arbitrary wide characters in identifiers, why not, it does not cause any problems. GNAT has implemented an option for this for ever. I would specify that there is no upper/lower case equivalence in this case, since otherwise you get into a huge mess that is simply not worth the effort. ************************************************************* From: Tucker Taft Sent: Thursday, October 24, 2002 10:10 PM I suggest you read the ARG minutes when they are available. Kiyoshi indicated specifically that they wanted to restrict usage to characters that "make sense" as identifier characters. I will admit I was in your camp that the simplest is to just allow anything. However, I will leave it to Kiyoshi to explain his reasoning. He certainly knows more than I do about the requirements. You should perhaps discuss it direclty with Kiyoshi if you don't agree. Mike indicated that UTF-8 encoding makes it easy to support even very wide characters in identifiers, because it provides a canonical representation, as a stream of bytes. We asked him to share his knowledge in this area, so we didn't all have to become experts in ISO-10646 to evaluate the implemenation issues in this area. ************************************************************* From: Randy Brukardt Sent: Thursday, October 24, 2002 10:29 PM Here is my notes on the Wide_Character in identifiers issue, which will be turned into the minutes. "What about full source representation of the language in Wide_Character? Kiyoshi reports that there is a push in SC22 to allow full wide characters in identifiers. How do you define which characters are letters? How do you define case equivalence? Mike says just use "letter" in the character standard. But this is likely to be very complex in the compiler and in the run-time. Tucker suggests use anything out of row 00 be treated a letter. Kiyoshi says that would not be acceptable to Japan, which is preparing a standard for which characters are allowed in identifiers." ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 4:11 AM > I suggest you read the ARG minutes when they are available. Kiyoshi > indicated specifically that they wanted to restrict usage to > characters that "make sense" as identifier characters. I will admit > I was in your camp that the simplest is to just allow anything. > However, I will leave it to Kiyoshi to explain his reasoning. > He certainly knows more than I do about the requirements. You > should perhaps discuss it direclty with Kiyoshi if you don't agree. I would leave such restrictions up to either local coding standards, enforced e.g. by ASIS tools, or enforced by compiler restrictions. Getting into what makes sense in different languages is way way out of scope (I speak as the former chair of the CRG, character issues are very difficult to deal with. In the context of the CRG work, we spent ages discussing the issue of whether E and E-acute should be equivalent in identifiers, and came to the conclusion that the answer might be different in different languages. There is no point in adding a huge national dependent mess here. Indeed I would consider in the ISO standard saying specifically that national bodies are welcome to devise local sub-standards for identifiers and character set requirements and leave it at that. I perfectly well understand where Kiyoshi is coming from. I am sure he feels as strongly that only certain characters be used as Jean Ichbiah felt about the E/E-acute issue. But it just is not practical for the international standard to get into the business of deciding what are and what are not useful identifier names in all the languages of the world, or even just for the P members :-) ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 4:16 AM OK, so great, very appropriate, there can be a Japanese National standard that specifies that for Ada compilers to meet this standard, there must be a mode in which identifiers are only allowed to contain bla bla characters. Other countries in the world are free to devise similar national standards but I fail to see why they should be a matter for an international standard. What would be marginally useful in the international standard would be to devise a general framework for those national standards, and make it clear that it is an acceptable thing for Ada compilers to implement one or more of these standards. Frankly I think that the standard already does that, but it would be fine to make it explicit. GNAT for example allows lots of localization of identifier characters sets, e.g. Latin-2, Cyrillic etc. ************************************************************* From: Pascal Leroy Sent: Friday, October 25, 2002 6:54 AM > But it just is not practical for the international > standard to get into the business of deciding what are and what are not > useful identifier names in all the languages of the world... It has certainly never been the intent to have the ARG discuss the identifier characters for all the languages in the world. However, there is an ISO working group in charge of developing and maintaining the ISO 10646 standard, and the intent was to piggyback on the work done there. 10646 defines precisely what is a character (and so yes, E and E-acute are distinct, as are uppercase A and uppercase alpha, even though they really look the same), what is a letter, a digit, how the uppercase/lowercase conversions work, etc. I see no reason why the Ada standard couldn't use these definitions. (And Mike gave us a feeling of what this would look like, and it doesn't seem unreasonably complicated to me.) Note that Java does exactly that, and defines letters and digits in a way which is derived from Unicode (itself a close approximation to 10646). I don't see why Ada would lag behind in this area: it would not be a big implementation effort, and it would improve usability of the language. I don't buy the notion that national bodies have a role to play here (except of course that they probably want to influence 10646). It's already hard to define one language standard and ensure that it's implemented with a minimum of consistency, I don't see how users or implementers could live with the coexistence of "Japanese Ada" and "Hebrew Ada" and "Russian Ada". Pascal PS: Note that the E vs. E-acute discussion is moot, since this is already settled by Latin-1 and yes, they are different. ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 7:55 PM > I don't buy the notion that national bodies have a role to play here (except > of course that they probably want to influence 10646). It's already hard to > define one language standard and ensure that it's implemented with a minimum > of consistency, I don't see how users or implementers could live with the > coexistence of "Japanese Ada" and "Hebrew Ada" and "Russian Ada". Well GNAT implements lots of different localized character sets, and noone seems to have dropped dead :-) ************************************************************* From: Robert A. Duff Sent: Friday, October 25, 2002 9:13 AM > Kiyoshi Ishihata stated at the last meeting that there was in interest > in some countries in being able to write programs as much as possible in > native languages, the primary deficit in this regard being that > identifiers are entirely in Latin-1 characters. Yes, but it was also mentioned at the meeting that SC22 is trying to get programming languages to do something-or-other related to this. I.e. allow 31-bit characters in identifiers, and have some uniformity across programming languages about which characters are allowed in identifiers. I suppose WG9 is supposed to "obey" SC22 on this point? By the way, let's mention the AI number being discussed in these messages, so we don't get the "What the heck are you talking about?" kinds of messages from Robert or others who might have missed part of the discussion. ;-) I believe Pascal raised the issue many months ago, and it has an AI number, and one can presumably search for that AI number in the meeting minutes (once Randy publishes them). ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 8:32 PM I tried, I could not find the AI number on this one Of course if there are uniform rules at the SC22 level, then it is fine to adopt them in Ada. I just think it is not something we should expend our own very limited resources on. ************************************************************* From: Randy Brukardt Sent: Friday, October 25, 2002 8:59 PM This was discussed as part of AI-285, which started life as an AI about Latin-9. That discussion took up the entire afternoon of the third day of the meeting. These other issues came up since it was felt that better Wide_Character support would (might?) make it unnecessary for the standard to directly deal with Latin-9. (Implementations still would have to, in all likelyhood.) There are a lot of notes in this area, and I haven't gotten that far in the minutes yet. So my summary might be suspect... (And I haven't posted the mail yet, either, but it's likely that it will all got on AI-285.) ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 9:12 PM > This was discussed as part of AI-285, which started life as an AI about > Latin-9. That discussion took up the entire afternoon of the third day of > the meeting. Be careful not to be eaten alive by character discussions. It was quite intentional that we banned discussion of these issues from the main group in the Ada 9X effort and shoveled them off to the CRG. Spending one of six sessions on this issue alone to me says that things are already getting out of control :-) I quite understand how this happens (remember I was chair of the CRG!) > > These other issues came up since it was felt that better Wide_Character > > support would (might?) make it unnecessary for the standard to directly deal > > with Latin-9. (Implementations still would have to, in all likelyhood.) Well of course in practice Latin-9 is barely interesting, it just introduces a different name for the Euro character. But for sure most computing with Ada will be done using latin-9 whatever the Ada standard says :-) ************************************************************* From: Randy Brukardt Sent: Friday, October 25, 2002 10:14 PM Well, it sounds worse that it is. The afternoon session of the last day is typically short. We didn't get back from lunch until about 2:15, and we adjorned at 3:28. Still, I probably would have dozed off during this discussion if I hadn't been taking notes... ************************************************************* From: Robert A. Duff Sent: Friday, October 25, 2002 9:19 AM I agree that the ARG should not spend time thinking about characters. And we should not add all kinds of verbiage about character sets to the RM. But if there is a character-set standard that can be simply referred to, why not. Apparently, there *is* a definition of which 31-bit characters are "letters". I thought the intent was to simply refer to that definition (which of course changes from year to year). ************************************************************* From: Robert Dewar Sent: Friday, October 25, 2002 8:45 PM Probably that's reasonable, although I worry that this will generate a lot of busy work in implementations for extraordinarily little gain. ************************************************************* From: Robert A. Duff Sent: Saturday, October 26, 2002 9:58 AM Yes. The purpose of Mike Yoder's "homework assignment" was to determine how difficult it is to write the "Is_Letter" function that the Ada lexer would need. And a case conversion routine, I guess. And how inefficient these would have to be. (People at the meeting were concerned about huge character-set tables having to be in the compiler.) I'm not at all interested in these character set issues. If folks can make an AI that is trivial to implement (efficiently), and invokes all character-set junk by reference to other standards, then I suppose it's OK with me. [ Insert my usual rant about what's important, here. ;-) ] ************************************************************* From: Robert A. Duff Sent: Saturday, October 26, 2002 10:14 AM I agree with Bob in all respects, including the parenthetical comment ************************************************************* From: Pascal Leroy Sent: Wednesday, November 27, 2002 4:27 AM During the last meeting we discussed the possibility of allowing any Unicode character (er, I mean, ISO 10646) in Ada source. Some people were concerned that the classification tables and the uppercase translation tables would be huge and complex to produce. Mike Y provided some input on this topic a while back, but since I (and probably other people) prefer to see the real tables, I spent a couple of hours writing a little Ada program to parse the Unicode database and spit out aggregates for these tables. I am attaching to this message three classification tables (letters, digits, and spaces) as well as the table that converts to uppercase. The latter is the largest one, and it only has 419 entries, for a total of 5028 bytes. And that's with a representation that is not particularly compact: a more space-efficient representation could be obtained for instance by storing the ranges as (First, Length) instead of (First, Last). The tables would change slightly depending on the rules that we choose (e.g. for the syntax of identifiers) but their size would not be substantially modified. This demonstrates two things: 1 - The tables are easy to produce from the Unicode database. 2 - The tables are small. --- Digits : constant Ranges := ( (16#30#, 16#39#), -- DIGIT ZERO .. DIGIT NINE (16#B2#, 16#B3#), -- SUPERSCRIPT TWO .. SUPERSCRIPT THREE (16#B9#, 16#B9#), -- SUPERSCRIPT ONE .. SUPERSCRIPT ONE (16#660#, 16#669#), -- ARABIC-INDIC DIGIT ZERO .. ARABIC-INDIC DIGIT NINE (16#6F0#, 16#6F9#), -- EXTENDED ARABIC-INDIC DIGIT ZERO .. EXTENDED ARABIC-INDIC DIGIT NINE (16#966#, 16#96F#), -- DEVANAGARI DIGIT ZERO .. DEVANAGARI DIGIT NINE (16#9E6#, 16#9EF#), -- BENGALI DIGIT ZERO .. BENGALI DIGIT NINE (16#A66#, 16#A6F#), -- GURMUKHI DIGIT ZERO .. GURMUKHI DIGIT NINE (16#AE6#, 16#AEF#), -- GUJARATI DIGIT ZERO .. GUJARATI DIGIT NINE (16#B66#, 16#B6F#), -- ORIYA DIGIT ZERO .. ORIYA DIGIT NINE (16#BE7#, 16#BEF#), -- TAMIL DIGIT ONE .. TAMIL DIGIT NINE (16#C66#, 16#C6F#), -- TELUGU DIGIT ZERO .. TELUGU DIGIT NINE (16#CE6#, 16#CEF#), -- KANNADA DIGIT ZERO .. KANNADA DIGIT NINE (16#D66#, 16#D6F#), -- MALAYALAM DIGIT ZERO .. MALAYALAM DIGIT NINE (16#E50#, 16#E59#), -- THAI DIGIT ZERO .. THAI DIGIT NINE (16#ED0#, 16#ED9#), -- LAO DIGIT ZERO .. LAO DIGIT NINE (16#F20#, 16#F29#), -- TIBETAN DIGIT ZERO .. TIBETAN DIGIT NINE (16#1040#, 16#1049#), -- MYANMAR DIGIT ZERO .. MYANMAR DIGIT NINE (16#1369#, 16#1371#), -- ETHIOPIC DIGIT ONE .. ETHIOPIC DIGIT NINE (16#17E0#, 16#17E9#), -- KHMER DIGIT ZERO .. KHMER DIGIT NINE (16#1810#, 16#1819#), -- MONGOLIAN DIGIT ZERO .. MONGOLIAN DIGIT NINE (16#2070#, 16#2070#), -- SUPERSCRIPT ZERO .. SUPERSCRIPT ZERO (16#2074#, 16#2079#), -- SUPERSCRIPT FOUR .. SUPERSCRIPT NINE (16#2080#, 16#2089#), -- SUBSCRIPT ZERO .. SUBSCRIPT NINE (16#FF10#, 16#FF19#), -- FULLWIDTH DIGIT ZERO .. FULLWIDTH DIGIT NINE (16#1D7CE#, 16#1D7FF#) -- MATHEMATICAL BOLD DIGIT ZERO .. MATHEMATICAL MONOSPACE DIGIT NINE ); --- Letters : constant Ranges := ( (16#41#, 16#5A#), -- LATIN CAPITAL LETTER A .. LATIN CAPITAL LETTER Z (16#61#, 16#7A#), -- LATIN SMALL LETTER A .. LATIN SMALL LETTER Z (16#AA#, 16#AA#), -- FEMININE ORDINAL INDICATOR .. FEMININE ORDINAL INDICATOR (16#B5#, 16#B5#), -- MICRO SIGN .. MICRO SIGN (16#BA#, 16#BA#), -- MASCULINE ORDINAL INDICATOR .. MASCULINE ORDINAL INDICATOR (16#C0#, 16#D6#), -- LATIN CAPITAL LETTER A WITH GRAVE .. LATIN CAPITAL LETTER O WITH DIAERESIS (16#D8#, 16#F6#), -- LATIN CAPITAL LETTER O WITH STROKE .. LATIN SMALL LETTER O WITH DIAERESIS (16#F8#, 16#2B8#), -- LATIN SMALL LETTER O WITH STROKE .. MODIFIER LETTER SMALL Y (16#2BB#, 16#2C1#), -- MODIFIER LETTER TURNED COMMA .. MODIFIER LETTER REVERSED GLOTTAL STOP (16#2D0#, 16#2D1#), -- MODIFIER LETTER TRIANGULAR COLON .. MODIFIER LETTER HALF TRIANGULAR COLON (16#2E0#, 16#2E4#), -- MODIFIER LETTER SMALL GAMMA .. MODIFIER LETTER SMALL REVERSED GLOTTAL STOP (16#2EE#, 16#2EE#), -- MODIFIER LETTER DOUBLE APOSTROPHE .. MODIFIER LETTER DOUBLE APOSTROPHE (16#37A#, 16#37A#), -- GREEK YPOGEGRAMMENI .. GREEK YPOGEGRAMMENI (16#386#, 16#386#), -- GREEK CAPITAL LETTER ALPHA WITH TONOS .. GREEK CAPITAL LETTER ALPHA WITH TONOS (16#388#, 16#3F5#), -- GREEK CAPITAL LETTER EPSILON WITH TONOS .. GREEK LUNATE EPSILON SYMBOL (16#400#, 16#481#), -- CYRILLIC CAPITAL LETTER IE WITH GRAVE .. CYRILLIC SMALL LETTER KOPPA (16#48A#, 16#559#), -- CYRILLIC CAPITAL LETTER SHORT I WITH TAIL .. ARMENIAN MODIFIER LETTER LEFT HALF RING (16#561#, 16#587#), -- ARMENIAN SMALL LETTER AYB .. ARMENIAN SMALL LIGATURE ECH YIWN (16#5D0#, 16#5F2#), -- HEBREW LETTER ALEF .. HEBREW LIGATURE YIDDISH DOUBLE YOD (16#621#, 16#64A#), -- ARABIC LETTER HAMZA .. ARABIC LETTER YEH (16#66E#, 16#66F#), -- ARABIC LETTER DOTLESS BEH .. ARABIC LETTER DOTLESS QAF (16#671#, 16#6D3#), -- ARABIC LETTER ALEF WASLA .. ARABIC LETTER YEH BARREE WITH HAMZA ABOVE (16#6D5#, 16#6D5#), -- ARABIC LETTER AE .. ARABIC LETTER AE (16#6E5#, 16#6E6#), -- ARABIC SMALL WAW .. ARABIC SMALL YEH (16#6FA#, 16#6FC#), -- ARABIC LETTER SHEEN WITH DOT BELOW .. ARABIC LETTER GHAIN WITH DOT BELOW (16#710#, 16#710#), -- SYRIAC LETTER ALAPH .. SYRIAC LETTER ALAPH (16#712#, 16#72C#), -- SYRIAC LETTER BETH .. SYRIAC LETTER TAW (16#780#, 16#7A5#), -- THAANA LETTER HAA .. THAANA LETTER WAAVU (16#7B1#, 16#7B1#), -- THAANA LETTER NAA .. THAANA LETTER NAA (16#905#, 16#939#), -- DEVANAGARI LETTER A .. DEVANAGARI LETTER HA (16#93D#, 16#93D#), -- DEVANAGARI SIGN AVAGRAHA .. DEVANAGARI SIGN AVAGRAHA (16#950#, 16#950#), -- DEVANAGARI OM .. DEVANAGARI OM (16#958#, 16#961#), -- DEVANAGARI LETTER QA .. DEVANAGARI LETTER VOCALIC LL (16#985#, 16#9B9#), -- BENGALI LETTER A .. BENGALI LETTER HA (16#9DC#, 16#9E1#), -- BENGALI LETTER RRA .. BENGALI LETTER VOCALIC LL (16#9F0#, 16#9F1#), -- BENGALI LETTER RA WITH MIDDLE DIAGONAL .. BENGALI LETTER RA WITH LOWER DIAGONAL (16#A05#, 16#A39#), -- GURMUKHI LETTER A .. GURMUKHI LETTER HA (16#A59#, 16#A5E#), -- GURMUKHI LETTER KHHA .. GURMUKHI LETTER FA (16#A72#, 16#A74#), -- GURMUKHI IRI .. GURMUKHI EK ONKAR (16#A85#, 16#AB9#), -- GUJARATI LETTER A .. GUJARATI LETTER HA (16#ABD#, 16#ABD#), -- GUJARATI SIGN AVAGRAHA .. GUJARATI SIGN AVAGRAHA (16#AD0#, 16#AE0#), -- GUJARATI OM .. GUJARATI LETTER VOCALIC RR (16#B05#, 16#B39#), -- ORIYA LETTER A .. ORIYA LETTER HA (16#B3D#, 16#B3D#), -- ORIYA SIGN AVAGRAHA .. ORIYA SIGN AVAGRAHA (16#B5C#, 16#B61#), -- ORIYA LETTER RRA .. ORIYA LETTER VOCALIC LL (16#B83#, 16#BB9#), -- TAMIL SIGN VISARGA .. TAMIL LETTER HA (16#C05#, 16#C39#), -- TELUGU LETTER A .. TELUGU LETTER HA (16#C60#, 16#C61#), -- TELUGU LETTER VOCALIC RR .. TELUGU LETTER VOCALIC LL (16#C85#, 16#CB9#), -- KANNADA LETTER A .. KANNADA LETTER HA (16#CDE#, 16#CE1#), -- KANNADA LETTER FA .. KANNADA LETTER VOCALIC LL (16#D05#, 16#D39#), -- MALAYALAM LETTER A .. MALAYALAM LETTER HA (16#D60#, 16#D61#), -- MALAYALAM LETTER VOCALIC RR .. MALAYALAM LETTER VOCALIC LL (16#D85#, 16#DC6#), -- SINHALA LETTER AYANNA .. SINHALA LETTER FAYANNA (16#E01#, 16#E30#), -- THAI CHARACTER KO KAI .. THAI CHARACTER SARA A (16#E32#, 16#E33#), -- THAI CHARACTER SARA AA .. THAI CHARACTER SARA AM (16#E40#, 16#E46#), -- THAI CHARACTER SARA E .. THAI CHARACTER MAIYAMOK (16#E81#, 16#EB0#), -- LAO LETTER KO .. LAO VOWEL SIGN A (16#EB2#, 16#EB3#), -- LAO VOWEL SIGN AA .. LAO VOWEL SIGN AM (16#EBD#, 16#EC6#), -- LAO SEMIVOWEL SIGN NYO .. LAO KO LA (16#EDC#, 16#F00#), -- LAO HO NO .. TIBETAN SYLLABLE OM (16#F40#, 16#F6A#), -- TIBETAN LETTER KA .. TIBETAN LETTER FIXED-FORM RA (16#F88#, 16#F8B#), -- TIBETAN SIGN LCE TSA CAN .. TIBETAN SIGN GRU MED RGYINGS (16#1000#, 16#102A#), -- MYANMAR LETTER KA .. MYANMAR LETTER AU (16#1050#, 16#1055#), -- MYANMAR LETTER SHA .. MYANMAR LETTER VOCALIC LL (16#10A0#, 16#10F8#), -- GEORGIAN CAPITAL LETTER AN .. GEORGIAN LETTER ELIFI (16#1100#, 16#135A#), -- HANGUL CHOSEONG KIYEOK .. ETHIOPIC SYLLABLE FYA (16#13A0#, 16#166C#), -- CHEROKEE LETTER A .. CANADIAN SYLLABICS CARRIER TTSA (16#166F#, 16#1676#), -- CANADIAN SYLLABICS QAI .. CANADIAN SYLLABICS NNGAA (16#1681#, 16#169A#), -- OGHAM LETTER BEITH .. OGHAM LETTER PEITH (16#16A0#, 16#16EA#), -- RUNIC LETTER FEHU FEOH FE F .. RUNIC LETTER X (16#1700#, 16#1711#), -- TAGALOG LETTER A .. TAGALOG LETTER HA (16#1720#, 16#1731#), -- HANUNOO LETTER A .. HANUNOO LETTER HA (16#1740#, 16#1751#), -- BUHID LETTER A .. BUHID LETTER HA (16#1760#, 16#1770#), -- TAGBANWA LETTER A .. TAGBANWA LETTER SA (16#1780#, 16#17B3#), -- KHMER LETTER KA .. KHMER INDEPENDENT VOWEL QAU (16#17D7#, 16#17D7#), -- KHMER SIGN LEK TOO .. KHMER SIGN LEK TOO (16#17DC#, 16#17DC#), -- KHMER SIGN AVAKRAHASANYA .. KHMER SIGN AVAKRAHASANYA (16#1820#, 16#18A8#), -- MONGOLIAN LETTER A .. MONGOLIAN LETTER MANCHU ALI GALI BHA (16#1E00#, 16#1FBC#), -- LATIN CAPITAL LETTER A WITH RING BELOW .. GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI (16#1FBE#, 16#1FBE#), -- GREEK PROSGEGRAMMENI .. GREEK PROSGEGRAMMENI (16#1FC2#, 16#1FCC#), -- GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI .. GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI (16#1FD0#, 16#1FDB#), -- GREEK SMALL LETTER IOTA WITH VRACHY .. GREEK CAPITAL LETTER IOTA WITH OXIA (16#1FE0#, 16#1FEC#), -- GREEK SMALL LETTER UPSILON WITH VRACHY .. GREEK CAPITAL LETTER RHO WITH DASIA (16#1FF2#, 16#1FFC#), -- GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI .. GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI (16#2071#, 16#2071#), -- SUPERSCRIPT LATIN SMALL LETTER I .. SUPERSCRIPT LATIN SMALL LETTER I (16#207F#, 16#207F#), -- SUPERSCRIPT LATIN SMALL LETTER N .. SUPERSCRIPT LATIN SMALL LETTER N (16#2102#, 16#2102#), -- DOUBLE-STRUCK CAPITAL C .. DOUBLE-STRUCK CAPITAL C (16#2107#, 16#2107#), -- EULER CONSTANT .. EULER CONSTANT (16#210A#, 16#2113#), -- SCRIPT SMALL G .. SCRIPT SMALL L (16#2115#, 16#2115#), -- DOUBLE-STRUCK CAPITAL N .. DOUBLE-STRUCK CAPITAL N (16#2119#, 16#211D#), -- DOUBLE-STRUCK CAPITAL P .. DOUBLE-STRUCK CAPITAL R (16#2124#, 16#2124#), -- DOUBLE-STRUCK CAPITAL Z .. DOUBLE-STRUCK CAPITAL Z (16#2126#, 16#2126#), -- OHM SIGN .. OHM SIGN (16#2128#, 16#2128#), -- BLACK-LETTER CAPITAL Z .. BLACK-LETTER CAPITAL Z (16#212A#, 16#212D#), -- KELVIN SIGN .. BLACK-LETTER CAPITAL C (16#212F#, 16#2131#), -- SCRIPT SMALL E .. SCRIPT CAPITAL F (16#2133#, 16#2139#), -- SCRIPT CAPITAL M .. INFORMATION SOURCE (16#213D#, 16#213F#), -- DOUBLE-STRUCK SMALL GAMMA .. DOUBLE-STRUCK CAPITAL PI (16#2145#, 16#2149#), -- DOUBLE-STRUCK ITALIC CAPITAL D .. DOUBLE-STRUCK ITALIC SMALL J (16#3005#, 16#3006#), -- IDEOGRAPHIC ITERATION MARK .. IDEOGRAPHIC CLOSING MARK (16#3031#, 16#3035#), -- VERTICAL KANA REPEAT MARK .. VERTICAL KANA REPEAT MARK LOWER HALF (16#303B#, 16#303C#), -- VERTICAL IDEOGRAPHIC ITERATION MARK .. MASU MARK (16#3041#, 16#3096#), -- HIRAGANA LETTER SMALL A .. HIRAGANA LETTER SMALL KE (16#309D#, 16#309F#), -- HIRAGANA ITERATION MARK .. HIRAGANA DIGRAPH YORI (16#30A1#, 16#30FA#), -- KATAKANA LETTER SMALL A .. KATAKANA LETTER VO (16#30FC#, 16#318E#), -- KATAKANA-HIRAGANA PROLONGED SOUND MARK .. HANGUL LETTER ARAEAE (16#31A0#, 16#31FF#), -- BOPOMOFO LETTER BU .. KATAKANA LETTER SMALL RO (16#3400#, 16#A48C#), -- .. YI SYLLABLE YYR (16#AC00#, 16#D7A3#), -- .. (16#F900#, 16#FB1D#), -- CJK COMPATIBILITY IDEOGRAPH-F900 .. HEBREW LETTER YOD WITH HIRIQ (16#FB1F#, 16#FB28#), -- HEBREW LIGATURE YIDDISH YOD YOD PATAH .. HEBREW LETTER WIDE TAV (16#FB2A#, 16#FD3D#), -- HEBREW LETTER SHIN WITH SHIN DOT .. ARABIC LIGATURE ALEF WITH FATHATAN ISOLATED FORM (16#FD50#, 16#FDFB#), -- ARABIC LIGATURE TEH WITH JEEM WITH MEEM INITIAL FORM .. ARABIC LIGATURE JALLAJALALOUHOU (16#FE70#, 16#FEFC#), -- ARABIC FATHATAN ISOLATED FORM .. ARABIC LIGATURE LAM WITH ALEF FINAL FORM (16#FF21#, 16#FF3A#), -- FULLWIDTH LATIN CAPITAL LETTER A .. FULLWIDTH LATIN CAPITAL LETTER Z (16#FF41#, 16#FF5A#), -- FULLWIDTH LATIN SMALL LETTER A .. FULLWIDTH LATIN SMALL LETTER Z (16#FF66#, 16#FFDC#), -- HALFWIDTH KATAKANA LETTER WO .. HALFWIDTH HANGUL LETTER I (16#10300#, 16#1031E#), -- OLD ITALIC LETTER A .. OLD ITALIC LETTER UU (16#10330#, 16#10349#), -- GOTHIC LETTER AHSA .. GOTHIC LETTER OTHAL (16#10400#, 16#1044D#), -- DESERET CAPITAL LETTER LONG I .. DESERET SMALL LETTER ENG (16#1D400#, 16#1D6C0#), -- MATHEMATICAL BOLD CAPITAL A .. MATHEMATICAL BOLD CAPITAL OMEGA (16#1D6C2#, 16#1D6DA#), -- MATHEMATICAL BOLD SMALL ALPHA .. MATHEMATICAL BOLD SMALL OMEGA (16#1D6DC#, 16#1D6FA#), -- MATHEMATICAL BOLD EPSILON SYMBOL .. MATHEMATICAL ITALIC CAPITAL OMEGA (16#1D6FC#, 16#1D714#), -- MATHEMATICAL ITALIC SMALL ALPHA .. MATHEMATICAL ITALIC SMALL OMEGA (16#1D716#, 16#1D734#), -- MATHEMATICAL ITALIC EPSILON SYMBOL .. MATHEMATICAL BOLD ITALIC CAPITAL OMEGA (16#1D736#, 16#1D74E#), -- MATHEMATICAL BOLD ITALIC SMALL ALPHA .. MATHEMATICAL BOLD ITALIC SMALL OMEGA (16#1D750#, 16#1D76E#), -- MATHEMATICAL BOLD ITALIC EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD CAPITAL OMEGA (16#1D770#, 16#1D788#), -- MATHEMATICAL SANS-SERIF BOLD SMALL ALPHA .. MATHEMATICAL SANS-SERIF BOLD SMALL OMEGA (16#1D78A#, 16#1D7A8#), -- MATHEMATICAL SANS-SERIF BOLD EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL OMEGA (16#1D7AA#, 16#1D7C2#), -- MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL ALPHA .. MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL OMEGA (16#1D7C4#, 16#1D7C9#), -- MATHEMATICAL SANS-SERIF BOLD ITALIC EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL (16#20000#, 16#2FA1D#) -- .. CJK COMPATIBILITY IDEOGRAPH-2FA1D ); --- Spaces : constant Ranges := ( (16#20#, 16#20#), -- SPACE .. SPACE (16#A0#, 16#A0#), -- NO-BREAK SPACE .. NO-BREAK SPACE (16#1680#, 16#1680#), -- OGHAM SPACE MARK .. OGHAM SPACE MARK (16#2000#, 16#200B#), -- EN QUAD .. ZERO WIDTH SPACE (16#202F#, 16#202F#), -- NARROW NO-BREAK SPACE .. NARROW NO-BREAK SPACE (16#205F#, 16#205F#), -- MEDIUM MATHEMATICAL SPACE .. MEDIUM MATHEMATICAL SPACE (16#3000#, 16#3000#) -- IDEOGRAPHIC SPACE .. IDEOGRAPHIC SPACE ); --- Uppercase_Mapping : constant Mapping_Ranges := ( (16#61#, 16#7A#, -32), -- LATIN SMALL LETTER A .. LATIN SMALL LETTER Z (16#B5#, 16#B5#, 743), -- MICRO SIGN .. MICRO SIGN (16#E0#, 16#F6#, -32), -- LATIN SMALL LETTER A WITH GRAVE .. LATIN SMALL LETTER O WITH DIAERESIS (16#F8#, 16#FE#, -32), -- LATIN SMALL LETTER O WITH STROKE .. LATIN SMALL LETTER THORN (16#FF#, 16#FF#, 121), -- LATIN SMALL LETTER Y WITH DIAERESIS .. LATIN SMALL LETTER Y WITH DIAERESIS (16#101#, 16#101#, -1), -- LATIN SMALL LETTER A WITH MACRON .. LATIN SMALL LETTER A WITH MACRON (16#103#, 16#103#, -1), -- LATIN SMALL LETTER A WITH BREVE .. LATIN SMALL LETTER A WITH BREVE (16#105#, 16#105#, -1), -- LATIN SMALL LETTER A WITH OGONEK .. LATIN SMALL LETTER A WITH OGONEK (16#107#, 16#107#, -1), -- LATIN SMALL LETTER C WITH ACUTE .. LATIN SMALL LETTER C WITH ACUTE (16#109#, 16#109#, -1), -- LATIN SMALL LETTER C WITH CIRCUMFLEX .. LATIN SMALL LETTER C WITH CIRCUMFLEX (16#10B#, 16#10B#, -1), -- LATIN SMALL LETTER C WITH DOT ABOVE .. LATIN SMALL LETTER C WITH DOT ABOVE (16#10D#, 16#10D#, -1), -- LATIN SMALL LETTER C WITH CARON .. LATIN SMALL LETTER C WITH CARON (16#10F#, 16#10F#, -1), -- LATIN SMALL LETTER D WITH CARON .. LATIN SMALL LETTER D WITH CARON (16#111#, 16#111#, -1), -- LATIN SMALL LETTER D WITH STROKE .. LATIN SMALL LETTER D WITH STROKE (16#113#, 16#113#, -1), -- LATIN SMALL LETTER E WITH MACRON .. LATIN SMALL LETTER E WITH MACRON (16#115#, 16#115#, -1), -- LATIN SMALL LETTER E WITH BREVE .. LATIN SMALL LETTER E WITH BREVE (16#117#, 16#117#, -1), -- LATIN SMALL LETTER E WITH DOT ABOVE .. LATIN SMALL LETTER E WITH DOT ABOVE (16#119#, 16#119#, -1), -- LATIN SMALL LETTER E WITH OGONEK .. LATIN SMALL LETTER E WITH OGONEK (16#11B#, 16#11B#, -1), -- LATIN SMALL LETTER E WITH CARON .. LATIN SMALL LETTER E WITH CARON (16#11D#, 16#11D#, -1), -- LATIN SMALL LETTER G WITH CIRCUMFLEX .. LATIN SMALL LETTER G WITH CIRCUMFLEX (16#11F#, 16#11F#, -1), -- LATIN SMALL LETTER G WITH BREVE .. LATIN SMALL LETTER G WITH BREVE (16#121#, 16#121#, -1), -- LATIN SMALL LETTER G WITH DOT ABOVE .. LATIN SMALL LETTER G WITH DOT ABOVE (16#123#, 16#123#, -1), -- LATIN SMALL LETTER G WITH CEDILLA .. LATIN SMALL LETTER G WITH CEDILLA (16#125#, 16#125#, -1), -- LATIN SMALL LETTER H WITH CIRCUMFLEX .. LATIN SMALL LETTER H WITH CIRCUMFLEX (16#127#, 16#127#, -1), -- LATIN SMALL LETTER H WITH STROKE .. LATIN SMALL LETTER H WITH STROKE (16#129#, 16#129#, -1), -- LATIN SMALL LETTER I WITH TILDE .. LATIN SMALL LETTER I WITH TILDE (16#12B#, 16#12B#, -1), -- LATIN SMALL LETTER I WITH MACRON .. LATIN SMALL LETTER I WITH MACRON (16#12D#, 16#12D#, -1), -- LATIN SMALL LETTER I WITH BREVE .. LATIN SMALL LETTER I WITH BREVE (16#12F#, 16#12F#, -1), -- LATIN SMALL LETTER I WITH OGONEK .. LATIN SMALL LETTER I WITH OGONEK (16#131#, 16#131#, -232), -- LATIN SMALL LETTER DOTLESS I .. LATIN SMALL LETTER DOTLESS I (16#133#, 16#133#, -1), -- LATIN SMALL LIGATURE IJ .. LATIN SMALL LIGATURE IJ (16#135#, 16#135#, -1), -- LATIN SMALL LETTER J WITH CIRCUMFLEX .. LATIN SMALL LETTER J WITH CIRCUMFLEX (16#137#, 16#137#, -1), -- LATIN SMALL LETTER K WITH CEDILLA .. LATIN SMALL LETTER K WITH CEDILLA (16#13A#, 16#13A#, -1), -- LATIN SMALL LETTER L WITH ACUTE .. LATIN SMALL LETTER L WITH ACUTE (16#13C#, 16#13C#, -1), -- LATIN SMALL LETTER L WITH CEDILLA .. LATIN SMALL LETTER L WITH CEDILLA (16#13E#, 16#13E#, -1), -- LATIN SMALL LETTER L WITH CARON .. LATIN SMALL LETTER L WITH CARON (16#140#, 16#140#, -1), -- LATIN SMALL LETTER L WITH MIDDLE DOT .. LATIN SMALL LETTER L WITH MIDDLE DOT (16#142#, 16#142#, -1), -- LATIN SMALL LETTER L WITH STROKE .. LATIN SMALL LETTER L WITH STROKE (16#144#, 16#144#, -1), -- LATIN SMALL LETTER N WITH ACUTE .. LATIN SMALL LETTER N WITH ACUTE (16#146#, 16#146#, -1), -- LATIN SMALL LETTER N WITH CEDILLA .. LATIN SMALL LETTER N WITH CEDILLA (16#148#, 16#148#, -1), -- LATIN SMALL LETTER N WITH CARON .. LATIN SMALL LETTER N WITH CARON (16#14B#, 16#14B#, -1), -- LATIN SMALL LETTER ENG .. LATIN SMALL LETTER ENG (16#14D#, 16#14D#, -1), -- LATIN SMALL LETTER O WITH MACRON .. LATIN SMALL LETTER O WITH MACRON (16#14F#, 16#14F#, -1), -- LATIN SMALL LETTER O WITH BREVE .. LATIN SMALL LETTER O WITH BREVE (16#151#, 16#151#, -1), -- LATIN SMALL LETTER O WITH DOUBLE ACUTE .. LATIN SMALL LETTER O WITH DOUBLE ACUTE (16#153#, 16#153#, -1), -- LATIN SMALL LIGATURE OE .. LATIN SMALL LIGATURE OE (16#155#, 16#155#, -1), -- LATIN SMALL LETTER R WITH ACUTE .. LATIN SMALL LETTER R WITH ACUTE (16#157#, 16#157#, -1), -- LATIN SMALL LETTER R WITH CEDILLA .. LATIN SMALL LETTER R WITH CEDILLA (16#159#, 16#159#, -1), -- LATIN SMALL LETTER R WITH CARON .. LATIN SMALL LETTER R WITH CARON (16#15B#, 16#15B#, -1), -- LATIN SMALL LETTER S WITH ACUTE .. LATIN SMALL LETTER S WITH ACUTE (16#15D#, 16#15D#, -1), -- LATIN SMALL LETTER S WITH CIRCUMFLEX .. LATIN SMALL LETTER S WITH CIRCUMFLEX (16#15F#, 16#15F#, -1), -- LATIN SMALL LETTER S WITH CEDILLA .. LATIN SMALL LETTER S WITH CEDILLA (16#161#, 16#161#, -1), -- LATIN SMALL LETTER S WITH CARON .. LATIN SMALL LETTER S WITH CARON (16#163#, 16#163#, -1), -- LATIN SMALL LETTER T WITH CEDILLA .. LATIN SMALL LETTER T WITH CEDILLA (16#165#, 16#165#, -1), -- LATIN SMALL LETTER T WITH CARON .. LATIN SMALL LETTER T WITH CARON (16#167#, 16#167#, -1), -- LATIN SMALL LETTER T WITH STROKE .. LATIN SMALL LETTER T WITH STROKE (16#169#, 16#169#, -1), -- LATIN SMALL LETTER U WITH TILDE .. LATIN SMALL LETTER U WITH TILDE (16#16B#, 16#16B#, -1), -- LATIN SMALL LETTER U WITH MACRON .. LATIN SMALL LETTER U WITH MACRON (16#16D#, 16#16D#, -1), -- LATIN SMALL LETTER U WITH BREVE .. LATIN SMALL LETTER U WITH BREVE (16#16F#, 16#16F#, -1), -- LATIN SMALL LETTER U WITH RING ABOVE .. LATIN SMALL LETTER U WITH RING ABOVE (16#171#, 16#171#, -1), -- LATIN SMALL LETTER U WITH DOUBLE ACUTE .. LATIN SMALL LETTER U WITH DOUBLE ACUTE (16#173#, 16#173#, -1), -- LATIN SMALL LETTER U WITH OGONEK .. LATIN SMALL LETTER U WITH OGONEK (16#175#, 16#175#, -1), -- LATIN SMALL LETTER W WITH CIRCUMFLEX .. LATIN SMALL LETTER W WITH CIRCUMFLEX (16#177#, 16#177#, -1), -- LATIN SMALL LETTER Y WITH CIRCUMFLEX .. LATIN SMALL LETTER Y WITH CIRCUMFLEX (16#17A#, 16#17A#, -1), -- LATIN SMALL LETTER Z WITH ACUTE .. LATIN SMALL LETTER Z WITH ACUTE (16#17C#, 16#17C#, -1), -- LATIN SMALL LETTER Z WITH DOT ABOVE .. LATIN SMALL LETTER Z WITH DOT ABOVE (16#17E#, 16#17E#, -1), -- LATIN SMALL LETTER Z WITH CARON .. LATIN SMALL LETTER Z WITH CARON (16#17F#, 16#17F#, -300), -- LATIN SMALL LETTER LONG S .. LATIN SMALL LETTER LONG S (16#183#, 16#183#, -1), -- LATIN SMALL LETTER B WITH TOPBAR .. LATIN SMALL LETTER B WITH TOPBAR (16#185#, 16#185#, -1), -- LATIN SMALL LETTER TONE SIX .. LATIN SMALL LETTER TONE SIX (16#188#, 16#188#, -1), -- LATIN SMALL LETTER C WITH HOOK .. LATIN SMALL LETTER C WITH HOOK (16#18C#, 16#18C#, -1), -- LATIN SMALL LETTER D WITH TOPBAR .. LATIN SMALL LETTER D WITH TOPBAR (16#192#, 16#192#, -1), -- LATIN SMALL LETTER F WITH HOOK .. LATIN SMALL LETTER F WITH HOOK (16#195#, 16#195#, 97), -- LATIN SMALL LETTER HV .. LATIN SMALL LETTER HV (16#199#, 16#199#, -1), -- LATIN SMALL LETTER K WITH HOOK .. LATIN SMALL LETTER K WITH HOOK (16#19E#, 16#19E#, 130), -- LATIN SMALL LETTER N WITH LONG RIGHT LEG .. LATIN SMALL LETTER N WITH LONG RIGHT LEG (16#1A1#, 16#1A1#, -1), -- LATIN SMALL LETTER O WITH HORN .. LATIN SMALL LETTER O WITH HORN (16#1A3#, 16#1A3#, -1), -- LATIN SMALL LETTER OI .. LATIN SMALL LETTER OI (16#1A5#, 16#1A5#, -1), -- LATIN SMALL LETTER P WITH HOOK .. LATIN SMALL LETTER P WITH HOOK (16#1A8#, 16#1A8#, -1), -- LATIN SMALL LETTER TONE TWO .. LATIN SMALL LETTER TONE TWO (16#1AD#, 16#1AD#, -1), -- LATIN SMALL LETTER T WITH HOOK .. LATIN SMALL LETTER T WITH HOOK (16#1B0#, 16#1B0#, -1), -- LATIN SMALL LETTER U WITH HORN .. LATIN SMALL LETTER U WITH HORN (16#1B4#, 16#1B4#, -1), -- LATIN SMALL LETTER Y WITH HOOK .. LATIN SMALL LETTER Y WITH HOOK (16#1B6#, 16#1B6#, -1), -- LATIN SMALL LETTER Z WITH STROKE .. LATIN SMALL LETTER Z WITH STROKE (16#1B9#, 16#1B9#, -1), -- LATIN SMALL LETTER EZH REVERSED .. LATIN SMALL LETTER EZH REVERSED (16#1BD#, 16#1BD#, -1), -- LATIN SMALL LETTER TONE FIVE .. LATIN SMALL LETTER TONE FIVE (16#1BF#, 16#1BF#, 56), -- LATIN LETTER WYNN .. LATIN LETTER WYNN (16#1C5#, 16#1C5#, -1), -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON (16#1C6#, 16#1C6#, -2), -- LATIN SMALL LETTER DZ WITH CARON .. LATIN SMALL LETTER DZ WITH CARON (16#1C8#, 16#1C8#, -1), -- LATIN CAPITAL LETTER L WITH SMALL LETTER J .. LATIN CAPITAL LETTER L WITH SMALL LETTER J (16#1C9#, 16#1C9#, -2), -- LATIN SMALL LETTER LJ .. LATIN SMALL LETTER LJ (16#1CB#, 16#1CB#, -1), -- LATIN CAPITAL LETTER N WITH SMALL LETTER J .. LATIN CAPITAL LETTER N WITH SMALL LETTER J (16#1CC#, 16#1CC#, -2), -- LATIN SMALL LETTER NJ .. LATIN SMALL LETTER NJ (16#1CE#, 16#1CE#, -1), -- LATIN SMALL LETTER A WITH CARON .. LATIN SMALL LETTER A WITH CARON (16#1D0#, 16#1D0#, -1), -- LATIN SMALL LETTER I WITH CARON .. LATIN SMALL LETTER I WITH CARON (16#1D2#, 16#1D2#, -1), -- LATIN SMALL LETTER O WITH CARON .. LATIN SMALL LETTER O WITH CARON (16#1D4#, 16#1D4#, -1), -- LATIN SMALL LETTER U WITH CARON .. LATIN SMALL LETTER U WITH CARON (16#1D6#, 16#1D6#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER U WITH DIAERESIS AND MACRON (16#1D8#, 16#1D8#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE (16#1DA#, 16#1DA#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND CARON .. LATIN SMALL LETTER U WITH DIAERESIS AND CARON (16#1DC#, 16#1DC#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE .. LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE (16#1DD#, 16#1DD#, -79), -- LATIN SMALL LETTER TURNED E .. LATIN SMALL LETTER TURNED E (16#1DF#, 16#1DF#, -1), -- LATIN SMALL LETTER A WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER A WITH DIAERESIS AND MACRON (16#1E1#, 16#1E1#, -1), -- LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON (16#1E3#, 16#1E3#, -1), -- LATIN SMALL LETTER AE WITH MACRON .. LATIN SMALL LETTER AE WITH MACRON (16#1E5#, 16#1E5#, -1), -- LATIN SMALL LETTER G WITH STROKE .. LATIN SMALL LETTER G WITH STROKE (16#1E7#, 16#1E7#, -1), -- LATIN SMALL LETTER G WITH CARON .. LATIN SMALL LETTER G WITH CARON (16#1E9#, 16#1E9#, -1), -- LATIN SMALL LETTER K WITH CARON .. LATIN SMALL LETTER K WITH CARON (16#1EB#, 16#1EB#, -1), -- LATIN SMALL LETTER O WITH OGONEK .. LATIN SMALL LETTER O WITH OGONEK (16#1ED#, 16#1ED#, -1), -- LATIN SMALL LETTER O WITH OGONEK AND MACRON .. LATIN SMALL LETTER O WITH OGONEK AND MACRON (16#1EF#, 16#1EF#, -1), -- LATIN SMALL LETTER EZH WITH CARON .. LATIN SMALL LETTER EZH WITH CARON (16#1F2#, 16#1F2#, -1), -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z (16#1F3#, 16#1F3#, -2), -- LATIN SMALL LETTER DZ .. LATIN SMALL LETTER DZ (16#1F5#, 16#1F5#, -1), -- LATIN SMALL LETTER G WITH ACUTE .. LATIN SMALL LETTER G WITH ACUTE (16#1F9#, 16#1F9#, -1), -- LATIN SMALL LETTER N WITH GRAVE .. LATIN SMALL LETTER N WITH GRAVE (16#1FB#, 16#1FB#, -1), -- LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE .. LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE (16#1FD#, 16#1FD#, -1), -- LATIN SMALL LETTER AE WITH ACUTE .. LATIN SMALL LETTER AE WITH ACUTE (16#1FF#, 16#1FF#, -1), -- LATIN SMALL LETTER O WITH STROKE AND ACUTE .. LATIN SMALL LETTER O WITH STROKE AND ACUTE (16#201#, 16#201#, -1), -- LATIN SMALL LETTER A WITH DOUBLE GRAVE .. LATIN SMALL LETTER A WITH DOUBLE GRAVE (16#203#, 16#203#, -1), -- LATIN SMALL LETTER A WITH INVERTED BREVE .. LATIN SMALL LETTER A WITH INVERTED BREVE (16#205#, 16#205#, -1), -- LATIN SMALL LETTER E WITH DOUBLE GRAVE .. LATIN SMALL LETTER E WITH DOUBLE GRAVE (16#207#, 16#207#, -1), -- LATIN SMALL LETTER E WITH INVERTED BREVE .. LATIN SMALL LETTER E WITH INVERTED BREVE (16#209#, 16#209#, -1), -- LATIN SMALL LETTER I WITH DOUBLE GRAVE .. LATIN SMALL LETTER I WITH DOUBLE GRAVE (16#20B#, 16#20B#, -1), -- LATIN SMALL LETTER I WITH INVERTED BREVE .. LATIN SMALL LETTER I WITH INVERTED BREVE (16#20D#, 16#20D#, -1), -- LATIN SMALL LETTER O WITH DOUBLE GRAVE .. LATIN SMALL LETTER O WITH DOUBLE GRAVE (16#20F#, 16#20F#, -1), -- LATIN SMALL LETTER O WITH INVERTED BREVE .. LATIN SMALL LETTER O WITH INVERTED BREVE (16#211#, 16#211#, -1), -- LATIN SMALL LETTER R WITH DOUBLE GRAVE .. LATIN SMALL LETTER R WITH DOUBLE GRAVE (16#213#, 16#213#, -1), -- LATIN SMALL LETTER R WITH INVERTED BREVE .. LATIN SMALL LETTER R WITH INVERTED BREVE (16#215#, 16#215#, -1), -- LATIN SMALL LETTER U WITH DOUBLE GRAVE .. LATIN SMALL LETTER U WITH DOUBLE GRAVE (16#217#, 16#217#, -1), -- LATIN SMALL LETTER U WITH INVERTED BREVE .. LATIN SMALL LETTER U WITH INVERTED BREVE (16#219#, 16#219#, -1), -- LATIN SMALL LETTER S WITH COMMA BELOW .. LATIN SMALL LETTER S WITH COMMA BELOW (16#21B#, 16#21B#, -1), -- LATIN SMALL LETTER T WITH COMMA BELOW .. LATIN SMALL LETTER T WITH COMMA BELOW (16#21D#, 16#21D#, -1), -- LATIN SMALL LETTER YOGH .. LATIN SMALL LETTER YOGH (16#21F#, 16#21F#, -1), -- LATIN SMALL LETTER H WITH CARON .. LATIN SMALL LETTER H WITH CARON (16#223#, 16#223#, -1), -- LATIN SMALL LETTER OU .. LATIN SMALL LETTER OU (16#225#, 16#225#, -1), -- LATIN SMALL LETTER Z WITH HOOK .. LATIN SMALL LETTER Z WITH HOOK (16#227#, 16#227#, -1), -- LATIN SMALL LETTER A WITH DOT ABOVE .. LATIN SMALL LETTER A WITH DOT ABOVE (16#229#, 16#229#, -1), -- LATIN SMALL LETTER E WITH CEDILLA .. LATIN SMALL LETTER E WITH CEDILLA (16#22B#, 16#22B#, -1), -- LATIN SMALL LETTER O WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER O WITH DIAERESIS AND MACRON (16#22D#, 16#22D#, -1), -- LATIN SMALL LETTER O WITH TILDE AND MACRON .. LATIN SMALL LETTER O WITH TILDE AND MACRON (16#22F#, 16#22F#, -1), -- LATIN SMALL LETTER O WITH DOT ABOVE .. LATIN SMALL LETTER O WITH DOT ABOVE (16#231#, 16#231#, -1), -- LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON (16#233#, 16#233#, -1), -- LATIN SMALL LETTER Y WITH MACRON .. LATIN SMALL LETTER Y WITH MACRON (16#253#, 16#253#, -210), -- LATIN SMALL LETTER B WITH HOOK .. LATIN SMALL LETTER B WITH HOOK (16#254#, 16#254#, -206), -- LATIN SMALL LETTER OPEN O .. LATIN SMALL LETTER OPEN O (16#256#, 16#257#, -205), -- LATIN SMALL LETTER D WITH TAIL .. LATIN SMALL LETTER D WITH HOOK (16#259#, 16#259#, -202), -- LATIN SMALL LETTER SCHWA .. LATIN SMALL LETTER SCHWA (16#25B#, 16#25B#, -203), -- LATIN SMALL LETTER OPEN E .. LATIN SMALL LETTER OPEN E (16#260#, 16#260#, -205), -- LATIN SMALL LETTER G WITH HOOK .. LATIN SMALL LETTER G WITH HOOK (16#263#, 16#263#, -207), -- LATIN SMALL LETTER GAMMA .. LATIN SMALL LETTER GAMMA (16#268#, 16#268#, -209), -- LATIN SMALL LETTER I WITH STROKE .. LATIN SMALL LETTER I WITH STROKE (16#269#, 16#269#, -211), -- LATIN SMALL LETTER IOTA .. LATIN SMALL LETTER IOTA (16#26F#, 16#26F#, -211), -- LATIN SMALL LETTER TURNED M .. LATIN SMALL LETTER TURNED M (16#272#, 16#272#, -213), -- LATIN SMALL LETTER N WITH LEFT HOOK .. LATIN SMALL LETTER N WITH LEFT HOOK (16#275#, 16#275#, -214), -- LATIN SMALL LETTER BARRED O .. LATIN SMALL LETTER BARRED O (16#280#, 16#280#, -218), -- LATIN LETTER SMALL CAPITAL R .. LATIN LETTER SMALL CAPITAL R (16#283#, 16#283#, -218), -- LATIN SMALL LETTER ESH .. LATIN SMALL LETTER ESH (16#288#, 16#288#, -218), -- LATIN SMALL LETTER T WITH RETROFLEX HOOK .. LATIN SMALL LETTER T WITH RETROFLEX HOOK (16#28A#, 16#28B#, -217), -- LATIN SMALL LETTER UPSILON .. LATIN SMALL LETTER V WITH HOOK (16#292#, 16#292#, -219), -- LATIN SMALL LETTER EZH .. LATIN SMALL LETTER EZH (16#3AC#, 16#3AC#, -38), -- GREEK SMALL LETTER ALPHA WITH TONOS .. GREEK SMALL LETTER ALPHA WITH TONOS (16#3AD#, 16#3AF#, -37), -- GREEK SMALL LETTER EPSILON WITH TONOS .. GREEK SMALL LETTER IOTA WITH TONOS (16#3B1#, 16#3C1#, -32), -- GREEK SMALL LETTER ALPHA .. GREEK SMALL LETTER RHO (16#3C2#, 16#3C2#, -31), -- GREEK SMALL LETTER FINAL SIGMA .. GREEK SMALL LETTER FINAL SIGMA (16#3C3#, 16#3CB#, -32), -- GREEK SMALL LETTER SIGMA .. GREEK SMALL LETTER UPSILON WITH DIALYTIKA (16#3CC#, 16#3CC#, -64), -- GREEK SMALL LETTER OMICRON WITH TONOS .. GREEK SMALL LETTER OMICRON WITH TONOS (16#3CD#, 16#3CE#, -63), -- GREEK SMALL LETTER UPSILON WITH TONOS .. GREEK SMALL LETTER OMEGA WITH TONOS (16#3D0#, 16#3D0#, -62), -- GREEK BETA SYMBOL .. GREEK BETA SYMBOL (16#3D1#, 16#3D1#, -57), -- GREEK THETA SYMBOL .. GREEK THETA SYMBOL (16#3D5#, 16#3D5#, -47), -- GREEK PHI SYMBOL .. GREEK PHI SYMBOL (16#3D6#, 16#3D6#, -54), -- GREEK PI SYMBOL .. GREEK PI SYMBOL (16#3D9#, 16#3D9#, -1), -- GREEK SMALL LETTER ARCHAIC KOPPA .. GREEK SMALL LETTER ARCHAIC KOPPA (16#3DB#, 16#3DB#, -1), -- GREEK SMALL LETTER STIGMA .. GREEK SMALL LETTER STIGMA (16#3DD#, 16#3DD#, -1), -- GREEK SMALL LETTER DIGAMMA .. GREEK SMALL LETTER DIGAMMA (16#3DF#, 16#3DF#, -1), -- GREEK SMALL LETTER KOPPA .. GREEK SMALL LETTER KOPPA (16#3E1#, 16#3E1#, -1), -- GREEK SMALL LETTER SAMPI .. GREEK SMALL LETTER SAMPI (16#3E3#, 16#3E3#, -1), -- COPTIC SMALL LETTER SHEI .. COPTIC SMALL LETTER SHEI (16#3E5#, 16#3E5#, -1), -- COPTIC SMALL LETTER FEI .. COPTIC SMALL LETTER FEI (16#3E7#, 16#3E7#, -1), -- COPTIC SMALL LETTER KHEI .. COPTIC SMALL LETTER KHEI (16#3E9#, 16#3E9#, -1), -- COPTIC SMALL LETTER HORI .. COPTIC SMALL LETTER HORI (16#3EB#, 16#3EB#, -1), -- COPTIC SMALL LETTER GANGIA .. COPTIC SMALL LETTER GANGIA (16#3ED#, 16#3ED#, -1), -- COPTIC SMALL LETTER SHIMA .. COPTIC SMALL LETTER SHIMA (16#3EF#, 16#3EF#, -1), -- COPTIC SMALL LETTER DEI .. COPTIC SMALL LETTER DEI (16#3F0#, 16#3F0#, -86), -- GREEK KAPPA SYMBOL .. GREEK KAPPA SYMBOL (16#3F1#, 16#3F1#, -80), -- GREEK RHO SYMBOL .. GREEK RHO SYMBOL (16#3F2#, 16#3F2#, -79), -- GREEK LUNATE SIGMA SYMBOL .. GREEK LUNATE SIGMA SYMBOL (16#3F5#, 16#3F5#, -96), -- GREEK LUNATE EPSILON SYMBOL .. GREEK LUNATE EPSILON SYMBOL (16#430#, 16#44F#, -32), -- CYRILLIC SMALL LETTER A .. CYRILLIC SMALL LETTER YA (16#450#, 16#45F#, -80), -- CYRILLIC SMALL LETTER IE WITH GRAVE .. CYRILLIC SMALL LETTER DZHE (16#461#, 16#461#, -1), -- CYRILLIC SMALL LETTER OMEGA .. CYRILLIC SMALL LETTER OMEGA (16#463#, 16#463#, -1), -- CYRILLIC SMALL LETTER YAT .. CYRILLIC SMALL LETTER YAT (16#465#, 16#465#, -1), -- CYRILLIC SMALL LETTER IOTIFIED E .. CYRILLIC SMALL LETTER IOTIFIED E (16#467#, 16#467#, -1), -- CYRILLIC SMALL LETTER LITTLE YUS .. CYRILLIC SMALL LETTER LITTLE YUS (16#469#, 16#469#, -1), -- CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS .. CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS (16#46B#, 16#46B#, -1), -- CYRILLIC SMALL LETTER BIG YUS .. CYRILLIC SMALL LETTER BIG YUS (16#46D#, 16#46D#, -1), -- CYRILLIC SMALL LETTER IOTIFIED BIG YUS .. CYRILLIC SMALL LETTER IOTIFIED BIG YUS (16#46F#, 16#46F#, -1), -- CYRILLIC SMALL LETTER KSI .. CYRILLIC SMALL LETTER KSI (16#471#, 16#471#, -1), -- CYRILLIC SMALL LETTER PSI .. CYRILLIC SMALL LETTER PSI (16#473#, 16#473#, -1), -- CYRILLIC SMALL LETTER FITA .. CYRILLIC SMALL LETTER FITA (16#475#, 16#475#, -1), -- CYRILLIC SMALL LETTER IZHITSA .. CYRILLIC SMALL LETTER IZHITSA (16#477#, 16#477#, -1), -- CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT .. CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT (16#479#, 16#479#, -1), -- CYRILLIC SMALL LETTER UK .. CYRILLIC SMALL LETTER UK (16#47B#, 16#47B#, -1), -- CYRILLIC SMALL LETTER ROUND OMEGA .. CYRILLIC SMALL LETTER ROUND OMEGA (16#47D#, 16#47D#, -1), -- CYRILLIC SMALL LETTER OMEGA WITH TITLO .. CYRILLIC SMALL LETTER OMEGA WITH TITLO (16#47F#, 16#47F#, -1), -- CYRILLIC SMALL LETTER OT .. CYRILLIC SMALL LETTER OT (16#481#, 16#481#, -1), -- CYRILLIC SMALL LETTER KOPPA .. CYRILLIC SMALL LETTER KOPPA (16#48B#, 16#48B#, -1), -- CYRILLIC SMALL LETTER SHORT I WITH TAIL .. CYRILLIC SMALL LETTER SHORT I WITH TAIL (16#48D#, 16#48D#, -1), -- CYRILLIC SMALL LETTER SEMISOFT SIGN .. CYRILLIC SMALL LETTER SEMISOFT SIGN (16#48F#, 16#48F#, -1), -- CYRILLIC SMALL LETTER ER WITH TICK .. CYRILLIC SMALL LETTER ER WITH TICK (16#491#, 16#491#, -1), -- CYRILLIC SMALL LETTER GHE WITH UPTURN .. CYRILLIC SMALL LETTER GHE WITH UPTURN (16#493#, 16#493#, -1), -- CYRILLIC SMALL LETTER GHE WITH STROKE .. CYRILLIC SMALL LETTER GHE WITH STROKE (16#495#, 16#495#, -1), -- CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK (16#497#, 16#497#, -1), -- CYRILLIC SMALL LETTER ZHE WITH DESCENDER .. CYRILLIC SMALL LETTER ZHE WITH DESCENDER (16#499#, 16#499#, -1), -- CYRILLIC SMALL LETTER ZE WITH DESCENDER .. CYRILLIC SMALL LETTER ZE WITH DESCENDER (16#49B#, 16#49B#, -1), -- CYRILLIC SMALL LETTER KA WITH DESCENDER .. CYRILLIC SMALL LETTER KA WITH DESCENDER (16#49D#, 16#49D#, -1), -- CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE (16#49F#, 16#49F#, -1), -- CYRILLIC SMALL LETTER KA WITH STROKE .. CYRILLIC SMALL LETTER KA WITH STROKE (16#4A1#, 16#4A1#, -1), -- CYRILLIC SMALL LETTER BASHKIR KA .. CYRILLIC SMALL LETTER BASHKIR KA (16#4A3#, 16#4A3#, -1), -- CYRILLIC SMALL LETTER EN WITH DESCENDER .. CYRILLIC SMALL LETTER EN WITH DESCENDER (16#4A5#, 16#4A5#, -1), -- CYRILLIC SMALL LIGATURE EN GHE .. CYRILLIC SMALL LIGATURE EN GHE (16#4A7#, 16#4A7#, -1), -- CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK (16#4A9#, 16#4A9#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN HA .. CYRILLIC SMALL LETTER ABKHASIAN HA (16#4AB#, 16#4AB#, -1), -- CYRILLIC SMALL LETTER ES WITH DESCENDER .. CYRILLIC SMALL LETTER ES WITH DESCENDER (16#4AD#, 16#4AD#, -1), -- CYRILLIC SMALL LETTER TE WITH DESCENDER .. CYRILLIC SMALL LETTER TE WITH DESCENDER (16#4AF#, 16#4AF#, -1), -- CYRILLIC SMALL LETTER STRAIGHT U .. CYRILLIC SMALL LETTER STRAIGHT U (16#4B1#, 16#4B1#, -1), -- CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE .. CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE (16#4B3#, 16#4B3#, -1), -- CYRILLIC SMALL LETTER HA WITH DESCENDER .. CYRILLIC SMALL LETTER HA WITH DESCENDER (16#4B5#, 16#4B5#, -1), -- CYRILLIC SMALL LIGATURE TE TSE .. CYRILLIC SMALL LIGATURE TE TSE (16#4B7#, 16#4B7#, -1), -- CYRILLIC SMALL LETTER CHE WITH DESCENDER .. CYRILLIC SMALL LETTER CHE WITH DESCENDER (16#4B9#, 16#4B9#, -1), -- CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE (16#4BB#, 16#4BB#, -1), -- CYRILLIC SMALL LETTER SHHA .. CYRILLIC SMALL LETTER SHHA (16#4BD#, 16#4BD#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN CHE .. CYRILLIC SMALL LETTER ABKHASIAN CHE (16#4BF#, 16#4BF#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER .. CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER (16#4C2#, 16#4C2#, -1), -- CYRILLIC SMALL LETTER ZHE WITH BREVE .. CYRILLIC SMALL LETTER ZHE WITH BREVE (16#4C4#, 16#4C4#, -1), -- CYRILLIC SMALL LETTER KA WITH HOOK .. CYRILLIC SMALL LETTER KA WITH HOOK (16#4C6#, 16#4C6#, -1), -- CYRILLIC SMALL LETTER EL WITH TAIL .. CYRILLIC SMALL LETTER EL WITH TAIL (16#4C8#, 16#4C8#, -1), -- CYRILLIC SMALL LETTER EN WITH HOOK .. CYRILLIC SMALL LETTER EN WITH HOOK (16#4CA#, 16#4CA#, -1), -- CYRILLIC SMALL LETTER EN WITH TAIL .. CYRILLIC SMALL LETTER EN WITH TAIL (16#4CC#, 16#4CC#, -1), -- CYRILLIC SMALL LETTER KHAKASSIAN CHE .. CYRILLIC SMALL LETTER KHAKASSIAN CHE (16#4CE#, 16#4CE#, -1), -- CYRILLIC SMALL LETTER EM WITH TAIL .. CYRILLIC SMALL LETTER EM WITH TAIL (16#4D1#, 16#4D1#, -1), -- CYRILLIC SMALL LETTER A WITH BREVE .. CYRILLIC SMALL LETTER A WITH BREVE (16#4D3#, 16#4D3#, -1), -- CYRILLIC SMALL LETTER A WITH DIAERESIS .. CYRILLIC SMALL LETTER A WITH DIAERESIS (16#4D5#, 16#4D5#, -1), -- CYRILLIC SMALL LIGATURE A IE .. CYRILLIC SMALL LIGATURE A IE (16#4D7#, 16#4D7#, -1), -- CYRILLIC SMALL LETTER IE WITH BREVE .. CYRILLIC SMALL LETTER IE WITH BREVE (16#4D9#, 16#4D9#, -1), -- CYRILLIC SMALL LETTER SCHWA .. CYRILLIC SMALL LETTER SCHWA (16#4DB#, 16#4DB#, -1), -- CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS .. CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS (16#4DD#, 16#4DD#, -1), -- CYRILLIC SMALL LETTER ZHE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZHE WITH DIAERESIS (16#4DF#, 16#4DF#, -1), -- CYRILLIC SMALL LETTER ZE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZE WITH DIAERESIS (16#4E1#, 16#4E1#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN DZE .. CYRILLIC SMALL LETTER ABKHASIAN DZE (16#4E3#, 16#4E3#, -1), -- CYRILLIC SMALL LETTER I WITH MACRON .. CYRILLIC SMALL LETTER I WITH MACRON (16#4E5#, 16#4E5#, -1), -- CYRILLIC SMALL LETTER I WITH DIAERESIS .. CYRILLIC SMALL LETTER I WITH DIAERESIS (16#4E7#, 16#4E7#, -1), -- CYRILLIC SMALL LETTER O WITH DIAERESIS .. CYRILLIC SMALL LETTER O WITH DIAERESIS (16#4E9#, 16#4E9#, -1), -- CYRILLIC SMALL LETTER BARRED O .. CYRILLIC SMALL LETTER BARRED O (16#4EB#, 16#4EB#, -1), -- CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS .. CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS (16#4ED#, 16#4ED#, -1), -- CYRILLIC SMALL LETTER E WITH DIAERESIS .. CYRILLIC SMALL LETTER E WITH DIAERESIS (16#4EF#, 16#4EF#, -1), -- CYRILLIC SMALL LETTER U WITH MACRON .. CYRILLIC SMALL LETTER U WITH MACRON (16#4F1#, 16#4F1#, -1), -- CYRILLIC SMALL LETTER U WITH DIAERESIS .. CYRILLIC SMALL LETTER U WITH DIAERESIS (16#4F3#, 16#4F3#, -1), -- CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE .. CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE (16#4F5#, 16#4F5#, -1), -- CYRILLIC SMALL LETTER CHE WITH DIAERESIS .. CYRILLIC SMALL LETTER CHE WITH DIAERESIS (16#4F9#, 16#4F9#, -1), -- CYRILLIC SMALL LETTER YERU WITH DIAERESIS .. CYRILLIC SMALL LETTER YERU WITH DIAERESIS (16#501#, 16#501#, -1), -- CYRILLIC SMALL LETTER KOMI DE .. CYRILLIC SMALL LETTER KOMI DE (16#503#, 16#503#, -1), -- CYRILLIC SMALL LETTER KOMI DJE .. CYRILLIC SMALL LETTER KOMI DJE (16#505#, 16#505#, -1), -- CYRILLIC SMALL LETTER KOMI ZJE .. CYRILLIC SMALL LETTER KOMI ZJE (16#507#, 16#507#, -1), -- CYRILLIC SMALL LETTER KOMI DZJE .. CYRILLIC SMALL LETTER KOMI DZJE (16#509#, 16#509#, -1), -- CYRILLIC SMALL LETTER KOMI LJE .. CYRILLIC SMALL LETTER KOMI LJE (16#50B#, 16#50B#, -1), -- CYRILLIC SMALL LETTER KOMI NJE .. CYRILLIC SMALL LETTER KOMI NJE (16#50D#, 16#50D#, -1), -- CYRILLIC SMALL LETTER KOMI SJE .. CYRILLIC SMALL LETTER KOMI SJE (16#50F#, 16#50F#, -1), -- CYRILLIC SMALL LETTER KOMI TJE .. CYRILLIC SMALL LETTER KOMI TJE (16#561#, 16#586#, -48), -- ARMENIAN SMALL LETTER AYB .. ARMENIAN SMALL LETTER FEH (16#1E01#, 16#1E01#, -1), -- LATIN SMALL LETTER A WITH RING BELOW .. LATIN SMALL LETTER A WITH RING BELOW (16#1E03#, 16#1E03#, -1), -- LATIN SMALL LETTER B WITH DOT ABOVE .. LATIN SMALL LETTER B WITH DOT ABOVE (16#1E05#, 16#1E05#, -1), -- LATIN SMALL LETTER B WITH DOT BELOW .. LATIN SMALL LETTER B WITH DOT BELOW (16#1E07#, 16#1E07#, -1), -- LATIN SMALL LETTER B WITH LINE BELOW .. LATIN SMALL LETTER B WITH LINE BELOW (16#1E09#, 16#1E09#, -1), -- LATIN SMALL LETTER C WITH CEDILLA AND ACUTE .. LATIN SMALL LETTER C WITH CEDILLA AND ACUTE (16#1E0B#, 16#1E0B#, -1), -- LATIN SMALL LETTER D WITH DOT ABOVE .. LATIN SMALL LETTER D WITH DOT ABOVE (16#1E0D#, 16#1E0D#, -1), -- LATIN SMALL LETTER D WITH DOT BELOW .. LATIN SMALL LETTER D WITH DOT BELOW (16#1E0F#, 16#1E0F#, -1), -- LATIN SMALL LETTER D WITH LINE BELOW .. LATIN SMALL LETTER D WITH LINE BELOW (16#1E11#, 16#1E11#, -1), -- LATIN SMALL LETTER D WITH CEDILLA .. LATIN SMALL LETTER D WITH CEDILLA (16#1E13#, 16#1E13#, -1), -- LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW (16#1E15#, 16#1E15#, -1), -- LATIN SMALL LETTER E WITH MACRON AND GRAVE .. LATIN SMALL LETTER E WITH MACRON AND GRAVE (16#1E17#, 16#1E17#, -1), -- LATIN SMALL LETTER E WITH MACRON AND ACUTE .. LATIN SMALL LETTER E WITH MACRON AND ACUTE (16#1E19#, 16#1E19#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW (16#1E1B#, 16#1E1B#, -1), -- LATIN SMALL LETTER E WITH TILDE BELOW .. LATIN SMALL LETTER E WITH TILDE BELOW (16#1E1D#, 16#1E1D#, -1), -- LATIN SMALL LETTER E WITH CEDILLA AND BREVE .. LATIN SMALL LETTER E WITH CEDILLA AND BREVE (16#1E1F#, 16#1E1F#, -1), -- LATIN SMALL LETTER F WITH DOT ABOVE .. LATIN SMALL LETTER F WITH DOT ABOVE (16#1E21#, 16#1E21#, -1), -- LATIN SMALL LETTER G WITH MACRON .. LATIN SMALL LETTER G WITH MACRON (16#1E23#, 16#1E23#, -1), -- LATIN SMALL LETTER H WITH DOT ABOVE .. LATIN SMALL LETTER H WITH DOT ABOVE (16#1E25#, 16#1E25#, -1), -- LATIN SMALL LETTER H WITH DOT BELOW .. LATIN SMALL LETTER H WITH DOT BELOW (16#1E27#, 16#1E27#, -1), -- LATIN SMALL LETTER H WITH DIAERESIS .. LATIN SMALL LETTER H WITH DIAERESIS (16#1E29#, 16#1E29#, -1), -- LATIN SMALL LETTER H WITH CEDILLA .. LATIN SMALL LETTER H WITH CEDILLA (16#1E2B#, 16#1E2B#, -1), -- LATIN SMALL LETTER H WITH BREVE BELOW .. LATIN SMALL LETTER H WITH BREVE BELOW (16#1E2D#, 16#1E2D#, -1), -- LATIN SMALL LETTER I WITH TILDE BELOW .. LATIN SMALL LETTER I WITH TILDE BELOW (16#1E2F#, 16#1E2F#, -1), -- LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE (16#1E31#, 16#1E31#, -1), -- LATIN SMALL LETTER K WITH ACUTE .. LATIN SMALL LETTER K WITH ACUTE (16#1E33#, 16#1E33#, -1), -- LATIN SMALL LETTER K WITH DOT BELOW .. LATIN SMALL LETTER K WITH DOT BELOW (16#1E35#, 16#1E35#, -1), -- LATIN SMALL LETTER K WITH LINE BELOW .. LATIN SMALL LETTER K WITH LINE BELOW (16#1E37#, 16#1E37#, -1), -- LATIN SMALL LETTER L WITH DOT BELOW .. LATIN SMALL LETTER L WITH DOT BELOW (16#1E39#, 16#1E39#, -1), -- LATIN SMALL LETTER L WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER L WITH DOT BELOW AND MACRON (16#1E3B#, 16#1E3B#, -1), -- LATIN SMALL LETTER L WITH LINE BELOW .. LATIN SMALL LETTER L WITH LINE BELOW (16#1E3D#, 16#1E3D#, -1), -- LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW (16#1E3F#, 16#1E3F#, -1), -- LATIN SMALL LETTER M WITH ACUTE .. LATIN SMALL LETTER M WITH ACUTE (16#1E41#, 16#1E41#, -1), -- LATIN SMALL LETTER M WITH DOT ABOVE .. LATIN SMALL LETTER M WITH DOT ABOVE (16#1E43#, 16#1E43#, -1), -- LATIN SMALL LETTER M WITH DOT BELOW .. LATIN SMALL LETTER M WITH DOT BELOW (16#1E45#, 16#1E45#, -1), -- LATIN SMALL LETTER N WITH DOT ABOVE .. LATIN SMALL LETTER N WITH DOT ABOVE (16#1E47#, 16#1E47#, -1), -- LATIN SMALL LETTER N WITH DOT BELOW .. LATIN SMALL LETTER N WITH DOT BELOW (16#1E49#, 16#1E49#, -1), -- LATIN SMALL LETTER N WITH LINE BELOW .. LATIN SMALL LETTER N WITH LINE BELOW (16#1E4B#, 16#1E4B#, -1), -- LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW (16#1E4D#, 16#1E4D#, -1), -- LATIN SMALL LETTER O WITH TILDE AND ACUTE .. LATIN SMALL LETTER O WITH TILDE AND ACUTE (16#1E4F#, 16#1E4F#, -1), -- LATIN SMALL LETTER O WITH TILDE AND DIAERESIS .. LATIN SMALL LETTER O WITH TILDE AND DIAERESIS (16#1E51#, 16#1E51#, -1), -- LATIN SMALL LETTER O WITH MACRON AND GRAVE .. LATIN SMALL LETTER O WITH MACRON AND GRAVE (16#1E53#, 16#1E53#, -1), -- LATIN SMALL LETTER O WITH MACRON AND ACUTE .. LATIN SMALL LETTER O WITH MACRON AND ACUTE (16#1E55#, 16#1E55#, -1), -- LATIN SMALL LETTER P WITH ACUTE .. LATIN SMALL LETTER P WITH ACUTE (16#1E57#, 16#1E57#, -1), -- LATIN SMALL LETTER P WITH DOT ABOVE .. LATIN SMALL LETTER P WITH DOT ABOVE (16#1E59#, 16#1E59#, -1), -- LATIN SMALL LETTER R WITH DOT ABOVE .. LATIN SMALL LETTER R WITH DOT ABOVE (16#1E5B#, 16#1E5B#, -1), -- LATIN SMALL LETTER R WITH DOT BELOW .. LATIN SMALL LETTER R WITH DOT BELOW (16#1E5D#, 16#1E5D#, -1), -- LATIN SMALL LETTER R WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER R WITH DOT BELOW AND MACRON (16#1E5F#, 16#1E5F#, -1), -- LATIN SMALL LETTER R WITH LINE BELOW .. LATIN SMALL LETTER R WITH LINE BELOW (16#1E61#, 16#1E61#, -1), -- LATIN SMALL LETTER S WITH DOT ABOVE .. LATIN SMALL LETTER S WITH DOT ABOVE (16#1E63#, 16#1E63#, -1), -- LATIN SMALL LETTER S WITH DOT BELOW .. LATIN SMALL LETTER S WITH DOT BELOW (16#1E65#, 16#1E65#, -1), -- LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE .. LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE (16#1E67#, 16#1E67#, -1), -- LATIN SMALL LETTER S WITH CARON AND DOT ABOVE .. LATIN SMALL LETTER S WITH CARON AND DOT ABOVE (16#1E69#, 16#1E69#, -1), -- LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE .. LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE (16#1E6B#, 16#1E6B#, -1), -- LATIN SMALL LETTER T WITH DOT ABOVE .. LATIN SMALL LETTER T WITH DOT ABOVE (16#1E6D#, 16#1E6D#, -1), -- LATIN SMALL LETTER T WITH DOT BELOW .. LATIN SMALL LETTER T WITH DOT BELOW (16#1E6F#, 16#1E6F#, -1), -- LATIN SMALL LETTER T WITH LINE BELOW .. LATIN SMALL LETTER T WITH LINE BELOW (16#1E71#, 16#1E71#, -1), -- LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW (16#1E73#, 16#1E73#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS BELOW .. LATIN SMALL LETTER U WITH DIAERESIS BELOW (16#1E75#, 16#1E75#, -1), -- LATIN SMALL LETTER U WITH TILDE BELOW .. LATIN SMALL LETTER U WITH TILDE BELOW (16#1E77#, 16#1E77#, -1), -- LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW (16#1E79#, 16#1E79#, -1), -- LATIN SMALL LETTER U WITH TILDE AND ACUTE .. LATIN SMALL LETTER U WITH TILDE AND ACUTE (16#1E7B#, 16#1E7B#, -1), -- LATIN SMALL LETTER U WITH MACRON AND DIAERESIS .. LATIN SMALL LETTER U WITH MACRON AND DIAERESIS (16#1E7D#, 16#1E7D#, -1), -- LATIN SMALL LETTER V WITH TILDE .. LATIN SMALL LETTER V WITH TILDE (16#1E7F#, 16#1E7F#, -1), -- LATIN SMALL LETTER V WITH DOT BELOW .. LATIN SMALL LETTER V WITH DOT BELOW (16#1E81#, 16#1E81#, -1), -- LATIN SMALL LETTER W WITH GRAVE .. LATIN SMALL LETTER W WITH GRAVE (16#1E83#, 16#1E83#, -1), -- LATIN SMALL LETTER W WITH ACUTE .. LATIN SMALL LETTER W WITH ACUTE (16#1E85#, 16#1E85#, -1), -- LATIN SMALL LETTER W WITH DIAERESIS .. LATIN SMALL LETTER W WITH DIAERESIS (16#1E87#, 16#1E87#, -1), -- LATIN SMALL LETTER W WITH DOT ABOVE .. LATIN SMALL LETTER W WITH DOT ABOVE (16#1E89#, 16#1E89#, -1), -- LATIN SMALL LETTER W WITH DOT BELOW .. LATIN SMALL LETTER W WITH DOT BELOW (16#1E8B#, 16#1E8B#, -1), -- LATIN SMALL LETTER X WITH DOT ABOVE .. LATIN SMALL LETTER X WITH DOT ABOVE (16#1E8D#, 16#1E8D#, -1), -- LATIN SMALL LETTER X WITH DIAERESIS .. LATIN SMALL LETTER X WITH DIAERESIS (16#1E8F#, 16#1E8F#, -1), -- LATIN SMALL LETTER Y WITH DOT ABOVE .. LATIN SMALL LETTER Y WITH DOT ABOVE (16#1E91#, 16#1E91#, -1), -- LATIN SMALL LETTER Z WITH CIRCUMFLEX .. LATIN SMALL LETTER Z WITH CIRCUMFLEX (16#1E93#, 16#1E93#, -1), -- LATIN SMALL LETTER Z WITH DOT BELOW .. LATIN SMALL LETTER Z WITH DOT BELOW (16#1E95#, 16#1E95#, -1), -- LATIN SMALL LETTER Z WITH LINE BELOW .. LATIN SMALL LETTER Z WITH LINE BELOW (16#1E9B#, 16#1E9B#, -59), -- LATIN SMALL LETTER LONG S WITH DOT ABOVE .. LATIN SMALL LETTER LONG S WITH DOT ABOVE (16#1EA1#, 16#1EA1#, -1), -- LATIN SMALL LETTER A WITH DOT BELOW .. LATIN SMALL LETTER A WITH DOT BELOW (16#1EA3#, 16#1EA3#, -1), -- LATIN SMALL LETTER A WITH HOOK ABOVE .. LATIN SMALL LETTER A WITH HOOK ABOVE (16#1EA5#, 16#1EA5#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE (16#1EA7#, 16#1EA7#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE (16#1EA9#, 16#1EA9#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE (16#1EAB#, 16#1EAB#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE (16#1EAD#, 16#1EAD#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW (16#1EAF#, 16#1EAF#, -1), -- LATIN SMALL LETTER A WITH BREVE AND ACUTE .. LATIN SMALL LETTER A WITH BREVE AND ACUTE (16#1EB1#, 16#1EB1#, -1), -- LATIN SMALL LETTER A WITH BREVE AND GRAVE .. LATIN SMALL LETTER A WITH BREVE AND GRAVE (16#1EB3#, 16#1EB3#, -1), -- LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE .. LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE (16#1EB5#, 16#1EB5#, -1), -- LATIN SMALL LETTER A WITH BREVE AND TILDE .. LATIN SMALL LETTER A WITH BREVE AND TILDE (16#1EB7#, 16#1EB7#, -1), -- LATIN SMALL LETTER A WITH BREVE AND DOT BELOW .. LATIN SMALL LETTER A WITH BREVE AND DOT BELOW (16#1EB9#, 16#1EB9#, -1), -- LATIN SMALL LETTER E WITH DOT BELOW .. LATIN SMALL LETTER E WITH DOT BELOW (16#1EBB#, 16#1EBB#, -1), -- LATIN SMALL LETTER E WITH HOOK ABOVE .. LATIN SMALL LETTER E WITH HOOK ABOVE (16#1EBD#, 16#1EBD#, -1), -- LATIN SMALL LETTER E WITH TILDE .. LATIN SMALL LETTER E WITH TILDE (16#1EBF#, 16#1EBF#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE (16#1EC1#, 16#1EC1#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE (16#1EC3#, 16#1EC3#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE (16#1EC5#, 16#1EC5#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE (16#1EC7#, 16#1EC7#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW (16#1EC9#, 16#1EC9#, -1), -- LATIN SMALL LETTER I WITH HOOK ABOVE .. LATIN SMALL LETTER I WITH HOOK ABOVE (16#1ECB#, 16#1ECB#, -1), -- LATIN SMALL LETTER I WITH DOT BELOW .. LATIN SMALL LETTER I WITH DOT BELOW (16#1ECD#, 16#1ECD#, -1), -- LATIN SMALL LETTER O WITH DOT BELOW .. LATIN SMALL LETTER O WITH DOT BELOW (16#1ECF#, 16#1ECF#, -1), -- LATIN SMALL LETTER O WITH HOOK ABOVE .. LATIN SMALL LETTER O WITH HOOK ABOVE (16#1ED1#, 16#1ED1#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE (16#1ED3#, 16#1ED3#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE (16#1ED5#, 16#1ED5#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE (16#1ED7#, 16#1ED7#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE (16#1ED9#, 16#1ED9#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW (16#1EDB#, 16#1EDB#, -1), -- LATIN SMALL LETTER O WITH HORN AND ACUTE .. LATIN SMALL LETTER O WITH HORN AND ACUTE (16#1EDD#, 16#1EDD#, -1), -- LATIN SMALL LETTER O WITH HORN AND GRAVE .. LATIN SMALL LETTER O WITH HORN AND GRAVE (16#1EDF#, 16#1EDF#, -1), -- LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE (16#1EE1#, 16#1EE1#, -1), -- LATIN SMALL LETTER O WITH HORN AND TILDE .. LATIN SMALL LETTER O WITH HORN AND TILDE (16#1EE3#, 16#1EE3#, -1), -- LATIN SMALL LETTER O WITH HORN AND DOT BELOW .. LATIN SMALL LETTER O WITH HORN AND DOT BELOW (16#1EE5#, 16#1EE5#, -1), -- LATIN SMALL LETTER U WITH DOT BELOW .. LATIN SMALL LETTER U WITH DOT BELOW (16#1EE7#, 16#1EE7#, -1), -- LATIN SMALL LETTER U WITH HOOK ABOVE .. LATIN SMALL LETTER U WITH HOOK ABOVE (16#1EE9#, 16#1EE9#, -1), -- LATIN SMALL LETTER U WITH HORN AND ACUTE .. LATIN SMALL LETTER U WITH HORN AND ACUTE (16#1EEB#, 16#1EEB#, -1), -- LATIN SMALL LETTER U WITH HORN AND GRAVE .. LATIN SMALL LETTER U WITH HORN AND GRAVE (16#1EED#, 16#1EED#, -1), -- LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE (16#1EEF#, 16#1EEF#, -1), -- LATIN SMALL LETTER U WITH HORN AND TILDE .. LATIN SMALL LETTER U WITH HORN AND TILDE (16#1EF1#, 16#1EF1#, -1), -- LATIN SMALL LETTER U WITH HORN AND DOT BELOW .. LATIN SMALL LETTER U WITH HORN AND DOT BELOW (16#1EF3#, 16#1EF3#, -1), -- LATIN SMALL LETTER Y WITH GRAVE .. LATIN SMALL LETTER Y WITH GRAVE (16#1EF5#, 16#1EF5#, -1), -- LATIN SMALL LETTER Y WITH DOT BELOW .. LATIN SMALL LETTER Y WITH DOT BELOW (16#1EF7#, 16#1EF7#, -1), -- LATIN SMALL LETTER Y WITH HOOK ABOVE .. LATIN SMALL LETTER Y WITH HOOK ABOVE (16#1EF9#, 16#1EF9#, -1), -- LATIN SMALL LETTER Y WITH TILDE .. LATIN SMALL LETTER Y WITH TILDE (16#1F00#, 16#1F07#, 8), -- GREEK SMALL LETTER ALPHA WITH PSILI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI (16#1F10#, 16#1F15#, 8), -- GREEK SMALL LETTER EPSILON WITH PSILI .. GREEK SMALL LETTER EPSILON WITH DASIA AND OXIA (16#1F20#, 16#1F27#, 8), -- GREEK SMALL LETTER ETA WITH PSILI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI (16#1F30#, 16#1F37#, 8), -- GREEK SMALL LETTER IOTA WITH PSILI .. GREEK SMALL LETTER IOTA WITH DASIA AND PERISPOMENI (16#1F40#, 16#1F45#, 8), -- GREEK SMALL LETTER OMICRON WITH PSILI .. GREEK SMALL LETTER OMICRON WITH DASIA AND OXIA (16#1F51#, 16#1F51#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA .. GREEK SMALL LETTER UPSILON WITH DASIA (16#1F53#, 16#1F53#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA (16#1F55#, 16#1F55#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA (16#1F57#, 16#1F57#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI .. GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI (16#1F60#, 16#1F67#, 8), -- GREEK SMALL LETTER OMEGA WITH PSILI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI (16#1F70#, 16#1F71#, 74), -- GREEK SMALL LETTER ALPHA WITH VARIA .. GREEK SMALL LETTER ALPHA WITH OXIA (16#1F72#, 16#1F75#, 86), -- GREEK SMALL LETTER EPSILON WITH VARIA .. GREEK SMALL LETTER ETA WITH OXIA (16#1F76#, 16#1F77#, 100), -- GREEK SMALL LETTER IOTA WITH VARIA .. GREEK SMALL LETTER IOTA WITH OXIA (16#1F78#, 16#1F79#, 128), -- GREEK SMALL LETTER OMICRON WITH VARIA .. GREEK SMALL LETTER OMICRON WITH OXIA (16#1F7A#, 16#1F7B#, 112), -- GREEK SMALL LETTER UPSILON WITH VARIA .. GREEK SMALL LETTER UPSILON WITH OXIA (16#1F7C#, 16#1F7D#, 126), -- GREEK SMALL LETTER OMEGA WITH VARIA .. GREEK SMALL LETTER OMEGA WITH OXIA (16#1F80#, 16#1F87#, 8), -- GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI (16#1F90#, 16#1F97#, 8), -- GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI (16#1FA0#, 16#1FA7#, 8), -- GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI (16#1FB0#, 16#1FB1#, 8), -- GREEK SMALL LETTER ALPHA WITH VRACHY .. GREEK SMALL LETTER ALPHA WITH MACRON (16#1FB3#, 16#1FB3#, 9), -- GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI (16#1FBE#, 16#1FBE#, -7205), -- GREEK PROSGEGRAMMENI .. GREEK PROSGEGRAMMENI (16#1FC3#, 16#1FC3#, 9), -- GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI (16#1FD0#, 16#1FD1#, 8), -- GREEK SMALL LETTER IOTA WITH VRACHY .. GREEK SMALL LETTER IOTA WITH MACRON (16#1FE0#, 16#1FE1#, 8), -- GREEK SMALL LETTER UPSILON WITH VRACHY .. GREEK SMALL LETTER UPSILON WITH MACRON (16#1FE5#, 16#1FE5#, 7), -- GREEK SMALL LETTER RHO WITH DASIA .. GREEK SMALL LETTER RHO WITH DASIA (16#1FF3#, 16#1FF3#, 9), -- GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI (16#FF41#, 16#FF5A#, -32), -- FULLWIDTH LATIN SMALL LETTER A .. FULLWIDTH LATIN SMALL LETTER Z (16#10428#, 16#1044D#, -40) -- DESERET SMALL LETTER LONG I .. DESERET SMALL LETTER ENG ); ************************************************************* From: Randy Brukardt Sent: Wednesday, November 27, 2002 11:01 AM Thanks for doing this. Where are you finding the information that you are using to do this? A quick search of the net didn't turn up anything machine-readable... ************************************************************* From: Michael F. Yoder Sent: Wednesday, November 27, 2002 12:20 PM The root link is www.unicode.org and the "latest version" link goes to http://www.unicode.org/unicode/reports/tr28/ The "this version" link at the top goes to a page with some relevant stuff. The page with the machine-readable files for V3.2 is: http://www.unicode.org/Public/UNIDATA/ . The current organization seems to be harder to navigate than it used to be; I'm unsure why. N.B. version 3.2 of Unicode claims to be "fully synchronized" with ISO 10646, so it is strongly preferable to earlier versions. ************************************************************* From: Robert I. Eachus Sent: Tuesday, March 18, 2003 12:56 AM I hate to reopen the character set can of worms, but I think we need to do it. In effect Latin 1 is being replaced by Latin 9 (ISO 8859-15). Latin 9 adds the Euro sign, OE Ligatures, and S and Z with caron, and capital Y with diaresis to Latin 1, removing the currency symbol, broken bar, some accents and the vulgar fractions. See http://www.cs.tut.fi/~jkorpela/latin9.html for a fuller explanation. Latin 9 is slowly being adopted. Of course some countries in the Euro zone are already using a "localized" version of Latin 1 with the currency sign representation looking suspiciously like a Euro symbol. So we could decide to leave this issue to Ada-1Z or whatever. However, I think that at the least we should add a Latin9 package to Ada with the correct character names. What else should or could be done? One possibility would be to redefine Ada.Characters.Handling to correctly treat seven new codes as lower or upper case characters. I would much prefer to go for the whole nine yards so we never need to do this again. Add an enumeration type Sets to Ada Characters, or if you prefer Character_Sets. It should enumerate all the ISO 8859 character sets. (If you want to be clever, we could start with ISO646 so that Sets'Pos(N) = ISO 8859-N.) In any case we should allow implementations to extend the type. This would allow both for new ISO 8859 character sets, and for Unicode, EBCDIC, IBM code pages, and so on. Now add procedure Set_Default_Character_Set, and function Current_Character_Set to Ada.Characters.Handling. (Or if you prefer to Ada.Characters.) As far as I am concerned the only required behavior for Set_Character_Set should be to accept an argument of Latin_1. It it probably a day or two of work to modify the functions in Ada.Character.Handling to support all the current ISO 8859 mappings. It is at least ten times harder to actually test all possible combinations of character set and Ada.Characters.Handling functions. It could be another five to ten times that to add tests to the validation suite, with very little practical effect. This is why I favor a minimalist approach to the requirements. (National bodies can of course require supporting other values for Character_Sets. For example, the Japanese national body could require Shift-JIS support if they felt like it, without requiring that compilers that comply to the Japanes national standard be incompatible with ISO 8652, without the ARG spending all of its time on character set issues.) What about names in Ada programs? ARGH! If your compiler is written in Ada and uses Ada.Characters.Handling, modifying the compiler is not a problem. Defining what it means to compile a program written using a non-Latin-1 character set threatens to expand clause 2 (Lexical Elements) to the size of a small telephone directory. I would prefer to just modify 2.1 to direct people to ISO 10646-1, which is the size of a large telephone directory, plus currently five ammendments, for the meaning of lexical elements in non-Latin_1 source representations, and let national bodies decide what they want to define locally. ************************************************************* From: Pascal Leroy Sent: Tuesday, March 18, 2003 2:15 AM > I hate to reopen the character set can of worms, but I think we need to > do it. In effect Latin 1 is being replaced by Latin 9 (ISO 8859-15). > Latin 9 adds the Euro sign, OE Ligatures, and S and Z with caron, and > capital Y with diaresis to Latin 1, removing the currency symbol, broken > bar, some accents and the vulgar fractions. See > http://www.cs.tut.fi/~jkorpela/latin9.html for a fuller explanation. This issue was discussed at some length as part of AI 285/01 (of which I am the editor). It is clear that adding support for Latin-9 in Ada.Characters (and children) is relatively straightforward. However there is the much nastier question of type Standard.Character, (which has pretty much to remain Latin-1 if you don't want to introduce awful incompatibilities) and of the interactions between what happens at compile-time and what happens at run-time. Consider for instance the call: Ada.Characters.Latin_9.Handling.Is_Letter ('έ') It has pretty much to return True (that's an S-caron in Latin-9), but that's certainly surprising! This amounts to breaking the Character abstraction and interpreting characters as bytes/code points, which is likely to lead to confusion in an Ada program that would deal with character sets having different encodings. Another interesting example is mentioned in the minutes of the Bedford meeting (http://www.ada-auth.org/ai-files/minutes/min-0210.html#AI285): "Consider the enumeration identifier "˜" (latin small letter y diaeresis). E'Image(˜) = "˜" in Latin-1 (there is no upper case version), but "Y" in Latin-9 (there is an upper case version). So we would need the identifier semantics to be changed depending on the character set. Pascal claims that this is important to reading French." After giving it more thought, I have come to the conclusion that the entire Latin-9 approach is misguided because: 1 - There is relatively little support in software out there for this encoding (heck, I am even reading that some mail gateways bounce back messages that use Latin-9 as their character encoding). Most of the editors that I have played with just go to Unicode when you type the Euro sign. That provides support for this new character without causing endless compatibility nightmares. 2 - I have gone through a similar "code point shuffle" mess at the beginning of the 80s: at the time we only had 7 bits per character (as you probably remember, the 8th bit was often used for parity) and some genius had invented to encode the French accented characters using the code points normally assigned to [, ], \, and the like. I have written thousands of lines of Pascal where an array indexing looked like Arr‡IŠ (instead of Arr[I]) just because of this silliness. What was painful-but-tolerable 20 years ago is just not going to fly nowadays: I am ready to bet that the world will go Unicode before it goes Latin-9. Therefore, the latest version of AI 285 proposes to go to Unicode for the text representation of programs, relying on the categorization work done by the Unicode people so that we don't have to argue endlessly about which characters can appear in identifiers, etc. And it entirely ignores Latin-9, or any other Latin-N for that matter. ************************************************************* From: Robert I. Eachus Sent: Tuesday, March 18, 2003 2:15 AM Pascal Leroy wrote: > This issue was discussed at some length as part of AI 285/01 (of which I am > the editor). It is clear that adding support for Latin-9 in Ada.Characters > (and children) is relatively straightforward. However there is the much > nastier question of type Standard.Character, (which has pretty much to > remain Latin-1 if you don't want to introduce awful incompatibilities) and > of the interactions between what happens at compile-time and what happens at > run-time. I thought we had an AI on the subject, but searching for Latin in the title didn't find it. I see what happened is that the name of the AI was changed. (I don't want to make work for Randy, and this may be a rare occurance or may not be. Perhaps a set of links to "old" names somewhere.) So I guess that the title of the original post is correct, because as I see it, the issue of Latin 9 support is completely separate from the issues with 16 and 32 bit character sets. Now to pull some magic by quoting from rev 1.4 of AI 285: An implementation is allowed to provide a library package named Ada.Characters.Latin_9. This package shall be identical to Ada.Characters.Latin_1, except for the following differences: - It doesn't declare the constants Currency_Sign, Broken_Bar, Diaeresis, Acute, Cedilla, Fraction_One_Quarter, Fraction_One_Half, and Fraction_Three_Quarter. - It declares the following constants: Euro_Sign : constant Character := '€'; -- Character'Val (164) UC_S_Caron : constant Character := 'S'; -- Character'Val (166) LC_S_Caron : constant Character := 's'; -- Character'Val (168) UC_Z_Caron : constant Character := 'Ž'; -- Character'Val (180) LC_Z_Caron : constant Character := 'ž'; -- Character'Val (184) UC_OE_Diphthong : constant Character := 'O'; -- Character'Val (188) LC_OE_Diphthong : constant Character := 'o'; -- Character'Val (189) UC_Y_Diaeresis : constant Character := 'Y'; -- Character'Val (190) In Netscape 7.01, with the encoding set to Latin-1, this displays (correctly) the Latin 9 representations! As does OpenOffice.org, Notepad and so on. Now let me abstract from the Ada.Characters.Latin 1 Currency_Sign : constant Character := ''; --Character'Val(164) Broken_Bar : constant Character := 'έ'; --Character'Val(166) Diaeresis : constant Character := '"'; --Character'Val(168) Acute : constant Character := '''; --Character'Val(180) Cedilla : constant Character := ','; --Character'Val(184) Fraction_One_Quarter : constant Character := '¬'; --Character'Val(188) Fraction_One_Half : constant Character := '«'; --Character'Val(189) Fraction_Three_Quarters : constant Character := '_'; --Character'Val(190) How can this work? Easy, other standards, in particular ISO/IEC 2022, http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=22747 specify control characters and escape sequences which can be used with Latin 1--or any ISO 8859 character set--to access characters from other sets. > Consider for instance the call: > > Ada.Characters.Latin_9.Handling.Is_Letter ('έ') > > It has pretty much to return True (that's an S-caron in Latin-9), but that's > certainly surprising! This amounts to breaking the Character abstraction > and interpreting characters as bytes/code points, which is likely to lead to > confusion in an Ada program that would deal with character sets having > different encodings. Why would you expect this call to work? You could argue that a compiler "should" raise Program_Error or Constraint_Error, but I would expect any reasonable compiler to object at compile time to an invalid character literal. Remember, notationally Ada is written in Unicode/ISO 10646 BMP, however it is represented. In context, that call is illegal and Ada.Characters.Latin_9.Handling.Is_Letter ('S') is legal and should return true. (Assuming we recommend having or allowing a package Ada.Characters.Latin_9.Handling.) But this discussion has done a lot to convince me that the best solution is to add a function Ada.Characters.Current_Set to Ada.Characters. The required work is trival for compilers that want to stay in the Latin 1 only world, and for compilers that do want to implement support for other 8-bit character sets, they really have to do most of the same work anyway. To repeat my proposal: Add an enumeration type Sets to Ada Characters, or if you prefer Character_Sets. It should enumerate all the ISO 8859 character sets. (If you want to be clever, we could start with ISO646 so that Sets'Pos(N) = ISO 8859-N.) In any case we should allow implementations to extend the type. This would allow both for new ISO 8859 character sets, and for Unicode, EBCDIC, IBM code pages, and so on. Now add procedure Set_Default_Character_Set, and function Current_Character_Set to Ada.Characters.Handling. (Or if you prefer to Ada.Characters.) As far as I am concerned the only required behavior for Set_Character_Set should be to accept an argument of Latin_1. We should probably also add a library pragma to change the default mapping of Character. (Compilers will probably accept command line setting of character mappings, but I think that a stanard pragma would help standardization.) If my proposal is accepted: Ada.Characters.Handling.Is_Letter ('έ') should return false when Ada.Characters.Current_Set is Latin_1, and Ada.Characters.Handling.Is_Letter ('S') should return true when Ada.Characters.Current_Set is Latin_9. The behavior of Ada.Characters.Handling.Is_Letter (Character'Val(166) Should depend on the current value of Ada.Characters.Current_Set. What happens in the other cases will at best be implementation defined. In other words, if your program contains a (Unicode/BMP or UTF8) character literal that is not in a supported character set, I expect Program_Error, if a character in a literal is not a legal literal for Character, it is an error, just like other misspellings of literals. > Another interesting example is mentioned in the minutes of the Bedford > meeting (http://www.ada-auth.org/ai-files/minutes/min-0210.html#AI285): > "Consider the enumeration identifier "˜" (latin small letter y diaeresis). > E'Image(˜) = "˜" in Latin-1 (there is no upper case version), but "Y" in > Latin-9 (there is an upper case version). So we would need the identifier > semantics to be changed depending on the character set. Pascal claims that > this is important to reading French." Exactly why I think a way is needed for the programmer to be able to determine what the actual character set mapping is. Almost no burden for compilers that support Latin 1 only, and not that much additional for compilers that do support other 8-bit mappings. (Actually, I may be wrong but I think all currently validated compilers accept source in non-Latin 1 character sets.) > After giving it more thought, I have come to the conclusion that the entire > Latin-9 approach is misguided because: > > 1 - There is relatively little support in software out there for this > encoding (heck, I am even reading that some mail gateways bounce back > messages that use Latin-9 as their character encoding). Most of the editors > that I have played with just go to Unicode when you type the Euro sign. > That provides support for this new character without causing endless > compatibility nightmares. > > 2 - I have gone through a similar "code point shuffle" mess at the > beginning of the 80s: at the time we only had 7 bits per character (as you > probably remember, the 8th bit was often used for parity) and some genius > had invented to encode the French accented characters using the code points > normally assigned to [, ], \, and the like. I have written thousands of > lines of Pascal where an array indexing looked like Arr‡IŠ (instead of > Arr[I]) just because of this silliness. What was painful-but-tolerable 20 > years ago is just not going to fly nowadays: I am ready to bet that the > world will go Unicode before it goes Latin-9. > > Therefore, the latest version of AI 285 proposes to go to Unicode for the > text representation of programs, relying on the categorization work done by > the Unicode people so that we don't have to argue endlessly about which > characters can appear in identifiers, etc. And it entirely ignores Latin-9, > or any other Latin-N for that matter. Couldn't agree more. The right solution is not to switch from Latin 1 to any other character set as a standard, but to supply a standard method for localization, and keep with the assumption of current Unicode/BMP for Wide_Character and for (notational) source. Does any implementor see a problem implementing the above recommendation? We could also go to the extreme of adding another optional annex dealing with character representation issues, but I think we all agree that the ARG should stay away from piecemeal character set bindings. On the other hand, I can see having a standard Wide_Character categorization, and allowing other characterizations to fall out from that. But let's keep that discussion in AI-285. ************************************************************* From: Randy Brukardt Sent: Tuesday, March 18, 2003 6:04 PM > (Actually, I may be > wrong but I think all currently validated compilers accept source in > non-Latin 1 character sets.) Since the only currently validated compilers are from Rational and DDC-I, that isn't saying much at all. You have to at least talk about widely-used compilers, but then you get into definitional problems. ************************************************************* From: Robert I. Eachus Sent: Tuesday, March 18, 2003 6:51 PM We are in the standards business. I think that this is an area where a small extention to the standard will be very helpful in providing portability. But we can't really worry about the cost of conformity for nonstandardized compilers. ;-) That is why I think that a definition which names the various character sets should be standardized: type Character_Sets is (ISO_646, Latin_1, Latin_2,...Latin_Greek...); This would help standarized the way that non-Latin 1 character sets are named for compatibility. But I think we should stay out of the business of defining which characters are which for Latin_Greek, etc. That is ISO/IEC JTC1/SC2's job, and I think they do it pretty well. Now if my proposal is accepted and say, GNAT, chooses to support the function Ada.Characters.Current_Set in a useful manner. However, ACT sees no demand for Ada.Characters.Set_Default_Character_Set to do anything useful, and therefore raises an exception if you try to change the value. (In other words Ada.Characters.Set_Default_Character_Set (Ada.Characters.Current_Set) does not raise an exception, but actually trying to change the value does.) Some other vendor may have a customer who requires Latin_Hebrew support, but could care less about Latin 9. Fine. Assigning Ada names to the various 8859 character sets is in our area of competence. Deciding which sets compiler vendors support should be left up to their customers. Is this useful progress towards standarization? Sure. Is arguing over whether there is demand for Linear_B support way out of the way of anything that the ARG wants to get involved in? Obviously. Or worse, whether a variable named with the Greek Alpha, should match a Latin A? Arggh! (If you think that is bad what about CJK unification? Do we want to get into political cat fights about whether or not a Japanese Kanji code point matches a (Korean) Hangul character with a different appearence? Please! Anything but that...) That is why I think we should be in the business of defining how to change character sets, but should stay well out of the politics of whether, say, compilers purchased by the Canadian government must support Latin 9. ************************************************************* From: Pascal Leroy Sent: Wednesday, March 19, 2003 3:54 AM > In Netscape 7.01, with the encoding set to Latin-1, this displays > (correctly) the Latin 9 representations! As does OpenOffice.org, > Notepad and so on. Now let me abstract from the Ada.Characters.Latin 1 In the case of Notepad, it just goes to Unicode (encoded as UTF-8) as soon as you type a non-Latin-1 character. So I am not sure what your point is. (Didn't check the other software packages that you mention.) > Add an enumeration type Sets to Ada Characters, or if you prefer > Character_Sets. It should enumerate all the ISO 8859 character sets. > ... > Now add procedure Set_Default_Character_Set, and function > Current_Character_Set to Ada.Characters.Handling. > ... > We should probably also add a library pragma to change the default > mapping of Character. (Compilers will probably accept command line > setting of character mappings, but I think that a stanard pragma would > help standardization.) I understand the usefulness of a pragma, but I don't really understand what sense it makes to change the default character set (whatever that is) at run-time. Consider the case where you compile a program in Latin-9 mode, and it has an enumeration literal with an S-caron in it. Then at run-time you switch to Latin-1. Would the 'Image attribute now return a string including a broken bar? That would be very strange. I can imagine why a program might want to juggle with different character encodings (by withing different Latin_N units) but it seems to me that the default character set has to be fixed at compilation time. Anyway none of this changes my opinion that the Latin-N sets are far too unimportant to spend precious ARG time on them. > Or worse, > whether a variable named with the Greek Alpha, should match a Latin A? > Arggh! (If you think that is bad what about CJK unification? Do we want > to get into political cat fights about whether or not a Japanese Kanji > code point matches a (Korean) Hangul character with a different > appearence? Please! Anything but that...) As a matter of fact, the current AI 285 does exactly that, and I don't see this as a political cat fight. The idea is to just follow what the Unicode folks are doing (and I suppose _they_ do quite a bit of political cat fight). So to answer your questions, a Latin A is not the same thing as a Greek Alpha or a Cyrillic A. And at this point the kanjis and hanguls are not letters, so they are not allowed in identifiers. When the Unicode people decide that ideograms are letters, we will update the definition in Ada. ************************************************************* From: Jean-Pierre Rosen Sent: Wednesday, March 19, 2003 4:19 AM > I understand the usefulness of a pragma, but I don't really understand > what sense it makes to change the default character set (whatever that > is) at run-time. Consider the case where you compile a program in > Latin-9 mode, and it has an enumeration literal with an S-caron in it. > Then at run-time you switch to Latin-1. Would the 'Image attribute now > return a string including a broken bar? That would be very strange. > And if you go that way, you may want different tasks to use different encodings.... Did I hear "can of worms" ? ************************************************************* From: Robert I. Eachus Sent: Wednesday, March 19, 2003 4:43 PM First, let me get this out of the way. I really like UTF-8, and for that matter UTF-16. I would also love to put real Unicode/BMP support into Chapter (Clause) 2 and elsewhere in the RM. I would like to see a (standard) Wide_Text_IO that supported UTF-1. But it is a lot of work. However, even if users do eventually migrate toward 16-bit and 32-bit character standards, we currently have an 8-bit character type in the standard. My reasons behind arguing for a minimal AI in this area is that I think that it would "clear the decks" forever in the 8-bit area, and let us concentrate on enhancing 16-bit support in the future. Pascal Leroy wrote: > In the case of Notepad, it just goes to Unicode (encoded as UTF-8) as > soon as you type a non-Latin-1 character. So I am not sure what your > point is. (Didn't check the other software packages that you mention.) I guess you missed the point. Windows actually uses a superset of Latin 1 that contains all the Latin 9 characters with different code-points. Windows also has IANA-registered extended versions of some other Latin sets. (These are Windows-1291 et. seq.) See the MIME and HTML standards for more details. Notepad and other applications may switch to Unicode internally when you enter non-Latin 1 (or non-Windows 1291) characters. But if you cut-and-paste into a text document from one with a different mapping, most PC software seems to use ISO 2022 control characters to avoid having to reprocess the entire document. This can be done as long as you use at most three ISO 8859 (or Windows) font variants. > I understand the usefulness of a pragma, but I don't really understand > what sense it makes to change the default character set (whatever that > is) at run-time. > I can imagine why a program might want to juggle with different > character encodings (by withing different Latin_N units) but it seems to > me that the default character set has to be fixed at compilation time. You may be right which is why I gave that hypothetical GNAT example. I think it would be almost trivial for them to support a current character set enquiry function, but a procedure to change the character set at run-time might take a lot more work. Where you would want to be able to change the default character set at run-time would be for things like Character to UTF-8 encoders and decoders. > Consider the case where you compile a program in Latin-9 mode, and > it has an enumeration literal with an S-caron in it. Then at > run-time you switch to Latin-1. Would the 'Image attribute now > return a string including a broken bar? That would be very strange. Why? The character or string literal gets translated from Latin 9 to Character at compile time. Then you conceptually remap all Character and String values when you change the default character set at run-time. If you convert the literal from Latin 9 to UTF-8 or Unicode at compile time, then try to convert back with a default character set of Latin 1, you can and should expect a Constraint_Error. > Anyway none of this changes my opinion that the Latin-N sets are far too > unimportant to spend precious ARG time on them. In one sense, as I said I agree. But I think that since we do have compilers around that support remapping of Character, a standard way of querying that setting is needed for standardization. As I indicated, I can easily be convinced that a way of setting the default mapping at run-time is a bit too much. Certainly though, the same issues will come up with respect to Wide_Character if and when compilers support different Wide_Character mappings. In the Wide_Character case determining at run-time what the actual mapping is may be important, but I certainly agree that requiring support for changing the Wide_Character mapping at run-time (say from Shift-JIS to Unicode) would be extreme. Remember that all that my current proposal requires is that changing from Latin 1 to Latin 1 succeed. I agree that anything else should be left outside the scope of the (ISO) standard. I have no trouble with leaving the procedure to change the default character set out altogether, or making it optional. > As a matter of fact, the current AI 285 does exactly that, and I don't > see this as a political cat fight. The idea is to just follow what the > Unicode folks are doing (and I suppose _they_ do quite a bit of > political cat fight). So to answer your questions, a Latin A is not the > same thing as a Greek Alpha or a Cyrillic A. And at this point the > kanjis and hanguls are not letters, so they are not allowed in > identifiers. When the Unicode people decide that ideograms are letters, > we will update the definition in Ada. Exactly my point, except that I think we officially follow ISO 10646 not Unicode. So in theory we should update to Unicode 3.2 compatibility when DIS 10646(2003) is accepted. (Those battles come closer to vendettas than cat fights. The major battles are Japanese vs. Korean, Chinese vs. Japanese, Russian vs. Georgian, Greeks vs. Macedonians, and francophones vs. everybody. Did I miss anyone?) If any other ARG--or CRG--members really care about all this, you too can join the madness in Prague next week. (http://www.unicode.org/iuc/iuc23/ ;-) ************************************************************* From: Randy Brukardt Sent: Wednesday, March 19, 2003 7:25 PM > Pascal Leroy wrote: > > > In the case of Notepad, it just goes to Unicode (encoded as UTF-8) as > > soon as you type a non-Latin-1 character. So I am not sure what your > > point is. (Didn't check the other software packages that you mention.) > > I guess you missed the point. Windows actually uses a superset of Latin > 1 that contains all the Latin 9 characters with different code-points. > Windows also has IANA-registered extended versions of some other Latin > sets. (These are Windows-1291 et. seq.) See the MIME and HTML standards > for more details. Notepad and other applications may switch to Unicode > internally when you enter non-Latin 1 (or non-Windows 1291) characters. Humm, the messages you are sending are encoded as "Windows-1252", which is the standard Windows character set. That hardly proves anything at all (other than that Windows doesn't use Latin-1 itself). (I checked this out in the spam filter.) > But if you cut-and-paste into a text document from one with a > different mapping, most PC software seems to use ISO 2022 control > characters to avoid having to reprocess the entire document. This can > be done as long as you use at most three ISO 8859 (or Windows) font > variants. Nope, it doesn't change the text at all (if its in the standard Windows character set, which most everything is). And if you paste it into the DOS box (which uses the OEM character set - which is how I edit the AIs with my circa-1986 text editor), it just gets converted to the nearest equivalents. For instance, I get a capital Y for UC_Y_Diaeresis (which, BTW, is how your note will appear in the !appendix to AI-285). Generalizations about Windows are almost always wrong. :-) ************************************************************* From: Robert I. Eachus Sent: Thursday, March 20, 2003 12:20 AM Randy Brukardt wrote: > Humm, the messages you are sending are encoded as "Windows-1252", which is > the standard Windows character set. That hardly proves anything at all > (other than that Windows doesn't use Latin-1 itself). (I checked this out in > the spam filter.) (Sorry 1291 et. seq. instead of 1251 et. seq. was a typo.) I guess I shouldn't be surprised that 1252 as succeeded 1251 as the "standard" Windows binding in the US, but I hadn't noticed. But that more clearly makes my point. Users might want to be able to use 8-bit bindings that the ARG as a group should have little or no interest in. But there is the IANA registry, and I think we can bind to a pointer to those names with little difficulty, and leave it to compiler vendors and others to do the "proper" binding to the character set they want to use. We should in no way require compilers to reject S or o (S-caron or the oe ligature) in a name. But we should fix that through references to the Unicode & ISO/IEC 10646 standards, and let compiler vendors support the 8-bit sets their users want to use. (Including 8-bit standards like Shift-JIS and UTF-8.) > Nope, it doesn't change the text at all (if its in the standard Windows > character set, which most everything is). Oh, there are those who would make you pay dearly for those comments, unless you meant Unicode as the "standard" Windows character set. But the reality is that there is NO standard 8-bit character set for Windows, versions for different countries use different character sets. > And if you paste it into the DOS box (which uses the OEM character set - > which is how I edit the AIs with my circa-1986 text editor), it just gets > converted to the nearest equivalents. For instance, I get a capital Y for > UC_Y_Diaeresis (which, BTW, is how your note will appear in the !appendix > to AI-285). Ouch, does that mean I should write the proposal up as a new draft AI, so people can read it? > Generalizations about Windows are almost always wrong. :-) I have learned the hard way that generalizations about preferred character sets are ALWAYS wrong. ************************************************************* From: Randy Brukardt Sent: Thursday, March 20, 2003 5:51 PM > Randy Brukardt wrote: > > > Humm, the messages you are sending are encoded as "Windows-1252", which is > > the standard Windows character set. That hardly proves anything at all > > (other than that Windows doesn't use Latin-1 itself). (I checked this out in > > the spam filter.) > > (Sorry 1291 et. seq. instead of 1251 et. seq. was a typo.) > > I guess I shouldn't be surprised that 1252 as succeeded 1251 as the > "standard" Windows binding in the US, but I hadn't noticed. FYI, that's confused. 1251 is "Cyrillic", while 1252 is "Western European". ... > We should in no way require compilers to reject S or o (S-caron or the > oe ligature) in a name. But we should fix that through references to > the Unicode & ISO/IEC 10646 standards, and let compiler > vendors support the 8-bit sets their users want to use. (Including 8-bit > standards like Shift-JIS and UTF-8.) Which is exactly what Pascal has proposed. But it should be pointed out that this is a very pervasive change. It means that the representation for names at runtime (in things like the tables for 'Image, for 'External_Tag, for exception information) has to be changed (at the very least to UTF-8). For Janus/Ada, where most of the runtime code that deals with those things is written in assembler, such a change will be very expensive. And that will be true to some extent or other for all compilers. > > Nope, it doesn't change the text at all (if its in the standard Windows > > character set, which most everything is). > > Oh, there are those who would make you pay dearly for those comments, > unless you meant Unicode as the "standard" Windows character > set. But > the reality is that there is NO standard 8-bit character set for > Windows, versions for different countries use different > character sets. Of course. I should have said "standard US Windows character set"; didn't mean to imply that it is the same for everyone. > > And if you paste it into the DOS box (which uses the OEM character set - > > which is how I edit the AIs with my circa-1986 text editor), it just gets > > converted to the nearest equivalents. For instance, I get a capital Y for > > UC_Y_Diaeresis (which, BTW, is how your note will appear in the !appendix > > to AI-285). > > Ouch, does that mean I should write the proposal up as a new draft AI, > so people can read it? Nope, AIs go through the same text editor. Using non-7-bit characters in AIs is strongly discouraged. (If we wanted to start using HTML for AIs, then perhaps a little more flexibility could be allowed.) > > Generalizations about Windows are almost always wrong. :-) > > I have learned the hard way that generalizations about preferred > character sets are ALWAYS wrong. Correct. The less the standard says about character sets, the better. Your proposal seems to require a lot of additional verbiage and support to solve a problem that doesn't seem to actually exist. The Unicode/ISO 10646 problem does exist, but once we support that fully, compilers can support anything they want without us getting in the way. (It would be nice to have a way to convert to and from UTF-8 in Ada programs. But, that's one of many things that "easy enough to write yourself", so its hard to say if it worth adding anything for that.) ************************************************************* From: Robert Dewar Sent: Saturday, March 22, 2003 11:38 AM I find all this discussion of character sets going way off target. All we are talking about here is some predefined names for some of the characters, nothing more and nothing less. ************************************************************* From: Robert I. Eachus Sent: Sunday, March 23, 2003 11:43 AM I'm confused. I am certainly proposing using the IANA registry for the names of character sets, and a way for programmers to determine which set is in use. As understand the 16-bit character set issues, in addition to character names, there is characterization in terms of 2.2 Lexical Elements for non-Latin 1 characters. In other words which characters can be used in names and numeric literals. I suspect that what Robert is referring to the fact that if someone uses a non-Latin 1 eight-bit set by a command-line argument, that the names won't match the characters as displayed. If so, I am actually recommending the permission for implementors to 'fix' more than that. For example, I don't think we should require that implementations support the Windows 1252 character set, but it would be nice to allow implementations which choose to do so to get it right. Some of that will follow at compile time if implementations map Windows 1252 to the appropriate Unicode/BMP characters. But it would also be nice to allow the Ada.Characters heirarchy and Ada.Text_IO features to be used with non-Latin 1 character sets. The keyword in the previous paragraph is "allow." As I said, I think we can go a bit farther and provide a standard way to determine the current character set mapping. But there is no reason for us to say you must support these character sets (other than Latin 1!), but must not support these other sets. This really has to do with code points in the 00 to 3F and 80 to BF ranges being printable characters instead of control characters. ************************************************************* !topic To_Ada conversion in case of wchar_t'Size > 16 !reference RM95 B.3(58), RM95 B.3(60) !from Vadim Godunko 2003-01-21 !discussion At least one C library implementation (glibc) use 32-bit values for wchar_t type. In this case the behavior of conversion functions To_Ada is not determined. I propose that those function must return Wide_Character'Val (16#FFFD#) value (Replacement character) if value of Item is outside of BMP. ************************************************************* From: Kiyoshi Ishihata Sent: Monday, July 28, 2003 10:01 AM > The next meeting of Japanese SC22 is on July 18. After that, I will > send you a brief report about our thought, but please understand that > this may be a tentative position. Sorry for the delay. I summarized our discussion as follows. If our position is accepted, the AI should go through major rewrite process. =============================================================================== (1) Do not refer to Unicode The current AI frequently refers to Unicode and the Web site of the Unicode Consortium. It is not appropriate in ISO/IEC context. Simply changing the word "Unicode" to "ISO/IEC 10646" is not enough, since two systems are much different than you might think. Characters of 10646 and Unicode are identical, or at least intended to be identical. Their code positions are the same. However, the following Unicode products mentioned in this AI do not exist in the 10646 world. Character categorization http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html Recommendation of character repertoire for identifiers http://www.unicode.org/unicode/reports/tr15/tr15-22.html Normalization Form KC http://www.unicode.org/unicode/reports/tr15/tr15-22.html Full case folding http://www.unicode.org/Public/3.2-Update/CaseFolding-3.2.0.txt Characters terminating lines http://www.unicode.org/reports/tr13 These specifications are defined by the Unicode Consortium, and should not be regarded as internationally agreed standards. We do not agree with the idea to define Ada rules based on these reports. Yes, some languages like Java and C# refer to these Unicode reports. But, they are "imported" languages in ISO/IEC standardization processes, different from Ada in this respect. (2) Recommendation of character repertoire for identifiers -- TR 10176 The following Technical Report has been published by JTC1. ISO/IEC TR 10176:2003 Information technology -- Guidelines for the preparation of programming language standards (Fourth edition) This report contains a table of characters which should be made usable in identifiers (Annex A). We believe that this document and ISO/IEC 10646 itself are the only possible references in the Ada standard. The table of characters for identifiers is supposed to be identical to the above mentioned Unicode report. In fact, the term "Unicode Character Database" does appear in Annex A of the TR. The TR has been frequently revised, probably following the requests from Unicode people. Therefore, changing the reference from Unicode to TR 10176 does not significantly change the definition of character repertoire. The TR defines the character repertoire by enumerating all allowed characters. In a formal sense, it does not depend upon the concept "character categorization". Although Annex A of the TR gives categorization of each allowed character, the categorization itself does not contribute to the definition. A possible demerit of referring to 10176 is the lack of timeliness of revisions. In the future, Unicode reports will be promptly revised. Compared to this, the revision process of 10176 would be slower. However, since 10176 is the only possible reference in the ISO/IEC world, we have no other options. Note that some languages including C++ and Cobol define characters for identifiers based on the recommendation of TR 10176. (3) No national variants of numeric literals We do not like to extend the character repertoire for numeric literals. Identifiers are used to denote entities, and often words or phrases of natural languages are used to compose identifiers. Therefore, it is quite beneficial to use one's mother tongue in spelling identifiers. Numeric literals are much more universal. They denote numeric values, and, unlike identifiers, do not have culture sensitive nuances in them. At least we Japanese are happy to write numeric literals only using ASCII characters. We understand that there is no Unicode report recommending to extend characters for numeric literals. Numeric literals denote values, and their values should be computed from the value denoted by each digit. If national variants are allowed, people not knowing other countries' characters cannot compute the values of literals. On the other hand, identifiers can be recognized by people of other countries through the process of pattern matching of characters. This is not easy, but anyway is possible. In summary, we believe that extending characters for numeric literals do more harm than the benefit gained. Of course, national variants of number representation may be useful in Input-Output. But, this is a different issue. (4) No normalization We cannot refer to "Normalization Form KC" which is a Unicode term. Neither 10646 nor 10176 provides substitute for this concept. Therefore we cannot introduce the character normalization process. This is not bad, we think. For example, the character "letter A with umlaut" is regarded different from the combination of two characters "letter A" and "umlaut". But, in the first place, it is not a good idea to have two representations of a single conceptual character. People would try to define their own canonical representation of characters. Regarding "A with umlaut" and "A"+"umlaut" pair as different would not be a severe burden for them. Implementations would be much easier, since they can resort to simple byte-to-byte comparison. You say > This is to ensure that identifiers which look visually the same are > considered as identical, even if they are composed of different characters. but this principle is not strongly enforced. The obvious example is Latin A and Greek Alpha. They look identical but are distinguished in identifiers. We think that they are inherently different characters and there are no reasons to consider them the same in identifiers. (5) Uppercase-lowercase correspondence In Ada, we must have one particular normalization process, which is the uppercase-lowercase correspondence. 10176 does not say anything on this topic, so we have to devise some feasible definition. One possible way of definition is to utilize character names defined in ISO/IEC 10646. We can see the obvious correspondence between "Latin capital letter A" and "Latin small letter A". We do not know whether this can easily be implemented in Ada compilers or not. We notice that there are cases not covered by this simple correspondence. For example, German "SS" corresponds to two lowercase sequences. One is the string "ss", and the other is the es-zett character. We feel that such complicated cases should be untouched in this time frame, waiting for the future standardization of appropriate ISO/IEC standards or technical reports. (6) Miscellaneous > "JTC 1/SC 22 believes that programming languages should offer the appropriate > support for ISO/IEC 10646, and the Unicode character set where appropriate." I like to have the reference information attached to this sentence. This is "Resolution 02-24: Recommendation on Coded Character Sets Support" of SC22 2002 plenary. ************************************************************* From: Pascal Leroy Sent: Wednesday, July 30, 2003 8:43 AM Thank you for the extensive feedback. I will obviously need to give more thought to your comments, and we will need to discuss them at a meeting. However, clearly the most contentious issue is that of eliminating references to Unicode. As I am sure you realize, Unicode has much more technical "meat" than 10646. So the good thing about relying on the Unicode database and similar documents is that we can just say "the Unicode folks did the work for us, we trust that they know what they are doing". After all, the Unicode consortium has invested numerous man-years in their recommendations, and we don't have the resources or the expertise to do similar work. As I see it, we have three options: 1 - Do nothing, keep the language as it is. 2 - Base support of 16- and 32-bit characters on Unicode. 3 - Base support of 16- and 32-bit characters on 10646. Evidently option #1 is easier, and frankly as a vendor I have not seen a lot of interest for the existing 16-bit character support, so adding a sizeable implementation complexity is quite hard to justify from an economical point of view. The problem with this option is that it might make SC22 unhappy. Option #2 is the simplest technically, as we can merely reference the Unicode documents, and avoid having to dig into the properties of each character. But as you point out, it is not kosher for an ISO standard to reference a non-ISO document. So politically it is probably not going to work. Option #3 is evidently ISO-compliant, but 10646 says very little regarding the properties of characters (others than their name and code points). I realize that 10176 has a list of allowed characters, but then it's a TR so it has relatively little teeth. Of course we could just do what 10176 does in its annex A, i.e. list all the characters that we allow (and the case-conversion tables, and possibly the normalization tables) but that would add 50 pages of gibberish to the RM. The problem with this option is that it would take a lot of work, and it would probably degenerate into cat fight about how case conversion or normalization or whatever ought to work. At this point I am going to consult with Jim to see how he thinks we should proceed. If need be I'll refer the issue to WG9 to get guidance. ************************************************************* From: Pascal Leroy Sent: Wednesday, August 6, 2003 10:20 AM >(4) No normalization >We cannot refer to "Normalization Form KC" which is a Unicode term. >Neither 10646 nor 10176 provides substitute for this concept. Therefore >we cannot introduce the character normalization process. >This is not bad, we think. For example, the character "letter A with >umlaut" is regarded different from the combination of two characters >"letter A" and "umlaut". But, in the first place, it is not a good idea >to have two representations of a single conceptual character. People >would try to define their own canonical representation of characters. >Regarding "A with umlaut" and "A"+"umlaut" pair as different would not >be a severe burden for them. >Implementations would be much easier, since they can resort to simple >byte-to-byte comparison. I have given more thought to normalization, and I believe that it is important for practical use. Ignore for a moment the issue of referencing Unicode. Assume that we have no difficulties in describing normalization. The question is: is normalization good for users? The problem I see is that when using a Unicode editor you have generally no idea how it represents a character internally. When you type "letter A with umlaut" it may represent this with a single character or as "letter A" + "umlaut" and that's hidden to the user of the editor. That's true regardless of whether you typed one or two characters on your keyboard. Now imagine the situation where two people write distinct compilation units with different editors (or maybe with different settings in a single editor). You might end up with the situation where the declaration of an entity has (in the file stored on disk) "letter A" + "umlaut" and the usage has "letter A with umlaut" (or vice-versa). And that would be invisible to the user because both editors would merely display Ž. In this situation, in order to avoid utmost confusion and bewilderment, I think it is necessary to specify that the compiler treats the two sequences the same. That's the purpose of normalization. As Unicode editing is going to become more and more common in years to come, and editors will undoubtedly become more and more fancy, I think it's important to deal with usability issues like this one. Incidentally, it seems to me that this issue is particularly important for Korean Hangul. ************************************************************* From: Pascal Leroy Sent: Friday, August 8, 2003 4:39 AM Kiyoshi said: > In short, I do not agree. Fine. Let's get the discussion started, then. (I am not sending this on the ARG mailing list as I don't want to start an endless chatter about whether we should be doing this at all, etc. At some point I'll want Randy to record our discussion, though, to make sure it gets appended to the AI.) > (1) design of character code > > I believe that a single logical character should not have two > different representations. If 10646 or Unicode have two > representations for A with umlaut, it is the fault of the > character code system. It should be remedied. In an ideal world, you are evidently right. (Irrelevant comment here: in an ideal world, Latin-1, the character set for Western European languages, would be suitable for writing French.) The Unicode folks explain (and I agree with them) that the "right" representation is "letter A"+"umlaut". The reason is that you have many diacritical signs used by existing languages (mostly based on the Latin alphabet) and that assigning code points to all the combinations is impractical (code points are a scarce resource, especially if you want commonly used languages to remain in the BMP). Unicode currently has more than 110 diacritical signs. The Western European languages only use very few of these, and they mostly combine with vowels, but still that consumes most of the upper half of Latin-1. Greek and Vietnamese, among others, can combine two diacriticals, and that's a sizeable number of code points. Now the Unicode documents explain that there are marginal languages (they mention Navajo) which make complex use of diacriticals and would require many more code points, for a very small community of users. Using combining diacriticals is the right way to go for these languages. And of course there may be particular applications where people want to create unanticipated combination of characters (when I was trying to learn Chinese many many years ago, my textbook had a diacritical on top of each ideogram to indicate the tone; that would seem like a perfect application of combining diacriticals). Finally, there is the issue of fonts: developing a font that contains all the combinations like "letter A with umlaut" is expensive, and the resulting font is bulky. Again, combining diacriticals are better. So why assign code points to characters like "letter A with umlaut" in the first place? I suppose that the answer is compatibility to some extent (Latin-1 existed before Unicode, and you have to support files coded with Latin-1 with as little perturbation as possible), and political catfight to some extent (if German has specific code points, why not do the same for Polish or Greek; if you do it for Greek, why not Macedonian? etc., etc., ad nauseam). There may also have been a concern, when Unicode started, that uniformly using combining diacriticals would require more complex text handling algorithms, which would have been too costly for the computers of the time. > (2) role of programming language > > Let's denote "A with umlaut" by A", and the sequence "A" and > umlaut by A+. If one likes to search the character "A with > umlaut", he must perform two search operations, one with A" > and then with A+. This is very tedious, and if the target > string contains many such special characters, the operation > is nearly impossible. I entirely agree, but I would view this as a bug. If you search for "letter A with umlaut", if should actually catch both representation (this is not hard to do, just normalize during search). I noticed that Internet Explorer behaves as you describe, and this is really a pain as the two sequences look exactly the same. > So, my opinion is that normalization is not a role of > programming languages or compilers. It should be performed > in some lower layers in order to maximize the convenience of > text file handling. Unfortunately, there are different forms of normalization, and Unicode recommends distinct normalization depending on whether the programming language is case-sensitive or not (I am not exactly sure why; I need to study this). So the file system cannot do the normalization, it has to be done by a programming language tool, and the only one for which we can impose a behavior is the compiler. ************************************************************* From: Pascal Leroy Sent: Tuesday, August 5, 2003 6:33 AM > At this point I am going to consult with Jim to see how he thinks we > should proceed. If need be I'll refer the issue to WG9 to get guidance. I have talked to Jim. He said that, from a procedural point of view, there is no intrinsic problem in referencing the Unicode standard in an ISO standard. All we need to do is some paperwork to justify the decision. Of course, one or several countries could always vote "no" on the amendment because they don't like references to Unicode, but at least there is no procedural impossibility. I have also talked to John Benito, the convener of WG14 (C language). The C folks are in the process of doing more-or-less what we are doing, only it's part of a technical report, not of an amendment. They have been running into the exact same issue, i.e. opposition at the SC22 level from a number of countries which don't want to see references to Unicode (Japan, Canada, Norway and Germany are the countries he named). John believes that it is nearly impossible to properly integrate 16- and 32-bit character in his standard without referencing Unicode. His plan is to try to convince the SC22 delegations of the aforementioned countries that this issue is moot because Unicode and 10646:2003 should be indistinguishable (and 10646:2003 references Unicode anyway). ************************************************************* From: Christophe Grein Sent: Thursday, July 15, 2004 5:03 PM It appears to me that in the document 04-06-03 AI95-00285/07, the category identifier_letter is no longer defined. The section 2.1(4-14) were replaced with new wording that does not include identifier_letter, yet this category is used to define "identifier". I think identifier_extend ::= identifier_start | <------- mark_non_spacing | mark_spacing_combining | number_decimal_digit | other_format is meant. ************************************************************* From: Vincent Celier Sent: Wednesday, January 26, 2005 2:40 PM Note that the tables given by Pascal in November 2002 in the AI (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/AIs/AI-00285.TXT?rev=1.22) are no longer in conformance with the current Unicode database (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt). There are missing digits (LIMBU DIGITs and OSMANYA DIGITs) and characters of category "Number, Other" that should not be in Digits (for example SUPERSCRIPT TWO B2). The Letters table also has characters missing and characters that should not be there. ************************************************************* From: Pascal Leroy Sent: Thursday, January 27, 2005 1:23 AM Right, as I recall these tables were based on Unicode 3.2.0. There was a concern at the time that the tables might be gigantic. The purpose of the exercise was to show that they were not (and the move to Unicode 4.0.0 doesn't change this conclusion). As I recall it took me roughly half a day to produce them, so it cannot take Robert more than half an hour ;-) *************************************************************