!standard A.3.5 (0) 10-10-15 AI05-0185-1/03 !standard A.3.6 (0) !class amendment 09-11-02 !status Amendment 2012 10-08-11 !status work item 10-10-18 !status ARG Approved 8-0-0 10-06-20 !status work item 09-11-02 !status received 09-11-02 !priority Medium !difficulty Medium !subject Wide_Character and Wide_Wide_Character classification and folding !summary Packages are added to provide support for the classification and case folding of Wide_Character and Wide_Wide_Character values. !problem The package Ada.Characters.Handling provides functions to classify a Character, and provides procedures to convert a Character to upper case and lower case. There are no such capabilities for Wide_Character and Wide_Wide_Character. Support for classification and case folding of the Wide_Character and Wide_Wide_Character types should be added to the language. !proposal The current version of the GNAT compiler has defined the following implementation-defined packages; Ada.Wide_Characters.Unicode Ada.Wide_Wide_Characters.Unicode While Ada.Wide_Characters and Ada.Wide_Wide_Characters are standard Ada 2005 packages, the Unicode child packages are non-standard. This proposal to create two standard packages; Ada.Wide_Characters.Handling and Ada.Wide_Wide_Characters.Handling is based on the GNAT Unicode packages, but without the functions that accept Unicode Category parameters. !wording A.3.5 The Package Wide_Characters.Handling The package Wide_Characters.Handling provides operations for classifying Wide_Characters and case folding for Wide_Characters. Static Semantics The library package Wide_Characters.Handling has the following declaration: package Ada.Wide_Characters.Handling is function Is_Control (Item : Wide_Character) return Boolean; function Is_Letter (Item : Wide_Character) return Boolean; function Is_Lower (Item : Wide_Character) return Boolean; function Is_Upper (Item : Wide_Character) return Boolean; function Is_Digit (Item : Wide_Character) return Boolean; function Is_Decimal_Digit (Item : Wide_Character) return Boolean renames Is_Digit; function Is_Hexadecimal_Digit (Item : Wide_Character) return Boolean; function Is_Alphanumeric (Item : Wide_Character) return Boolean; function Is_Special (Item : Wide_Character) return Boolean; function Is_Line_Terminator (Item : Wide_Character) return Boolean; function Is_Mark (Item : Wide_Character) return Boolean; function Is_Other (Item : Wide_Character) return Boolean; function Is_Punctuation (Item : Wide_Character) return Boolean; function Is_Space (Item : Wide_Character) return Boolean; function Is_Graphic (Item : Wide_Character) return Boolean; function To_Lower (Item : Wide_Character) return Wide_Character; function To_Upper (Item : Wide_Character) return Wide_Character; function To_Lower (Item : Wide_String) return Wide_String; function To_Upper (Item : Wide_String) return Wide_String; end Ada.Wide_Characters.Handling; The subprograms defined in Ada.Wide_Characters.Handling are locale independent. function Is_Control (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as other_control, otherwise returns False. function Is_Letter (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as letter_uppercase, letter_lowercase, letter_titlecase, letter_modifier, letter_other, or number_letter; otherwise returns False. function Is_Lower (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as letter_lowercase, otherwise returns False. function Is_Upper (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as letter_uppercase, otherwise returns False. function Is_Digit (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as number_decimal, otherwise returns False. function Is_Hexadecimal_Digit (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as number_decimal, or is in the range 'A' .. 'F' or 'a' .. 'f', otherwise returns False. function Is_Alphanumeric (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as letter_uppercase, letter_lowercase, letter_titlecase, letter_modifier, letter_other, number_letter, or number_decimal; otherwise returns False. function Is_Special (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as graphic_character, but not categorized as letter_uppercase, letter_lowercase, letter_titlecase, letter_modifier, letter_other, number_letter, or number_decimal; otherwise returns False. function Is_Line_Terminator (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as separator_line or separator_paragraph, or if Item is a conventional line terminator character (CR, LF, VT, or FF); otherwise returns False. function Is_Mark (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as mark_non_spacing or mark_spacing_combining, otherwise returns False. function Is_Other (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as other_format, otherwise returns False. function Is_Punctuation (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as punctuation_connector, otherwise returns False. function Is_Space (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as separator_space, otherwise returns False. function Is_Graphic (Item : Wide_Character) return Boolean; Returns True if the Wide_Character designated by Item is categorized as graphic_character, otherwise returns False. function To_Lower (Item : Wide_Character) return Wide_Character; Returns the Simple Lowercase Mapping of the Wide_Character designated by Item. If the Simple Lowercase Mapping does not exist for the Wide_Character designated by Item, then the value of Item is returned. function To_Lower (Item : Wide_String) return Wide_String; Returns the result of applying the To_Lower Wide_Character to Wide_Character conversion to each element of the Wide_String designated by Item. The result is the null Wide_String if the value of the formal parameter is the null Wide_String. function To_Upper (Item : Wide_Character) return Wide_Character; Returns the Simple Uppercase Mapping of the Wide_Character designated by Item. If the Simple Uppercase Mapping does not exist for the Wide_Character designated by Item, then the value of Item is returned. function To_Upper (Item : Wide_String) return Wide_String; Returns the result of applying the To_Upper Wide_Character to Wide_Character conversion to each element of the Wide_String designated by Item. The result is the null Wide_String if the value of the formal parameter is the null Wide_String. A.3.6 The Package Wide_Wide_Characters.Handling The package Wide_Wide_Characters.Handling has the same contents as Wide_Characters.Handling except that each occurrence of Wide_Character is replaced by Wide_Wide_Character, and each occurrence of Wide_String is replaced by Wide_Wide_String. !discussion The GNAT Unicode packages define a Category type which maps to the Unicode standard. Second forms of most of the classification routines exist that operate on category type parameters instead of Wide_Character or Wide_Wide_Character. The reason for these routines is that it is claimed they are more efficient if multiple classification tests are to be performed on a Wide_Character or Wide_Wide_Character value, otherwise the other form of the call that accepts Wide_Character or Wide_Wide_Character is expected to be more efficient. The category type however would tie the package more closely to the Unicode standard, whereas it is desirable to hide that abstraction. Furthermore, adding these routines would likely mean having to define a package like System.UTF_32 which is currently defined in GNAT. It seems that the categorization routines are not necessary for the standard, and might be better left as implementation-defined functionality. The package Ada.Characters.Handling defines classification routines that are not present in the GNAT Wide_Characters.Unicode and GNAT Ada.Wide_Characters.Handling and Ada.Wide_Wide_Characters.Handling. Specifically, Is_Control, Is_Lower, Is_Upper, Is_Basic, Is_Decimal_Digit, Is_Graphic, Is_Hexadecimal_Digit, Is_Alphanumeric, and Is_Special are absent. These should be provided to be consistent with Ada.Characters.Handling The Non_Graphic routine was replaced with Graphic, otherwise the remaining functions were added, except for the Is_Basic function, and the To_Basic functions. It is not clear whether these functions have any meaning in Wide_Character or Wide_Wide_Character contexts, as there do not appear to be any Unicode functions for stripping off diacritical marks, and it is not clear that doing so would result in a string that was meaningful. Also, the ISO_646 related functions were not added, since those deal with 8-bit values, they were deemed not appropriate for Wide_Character and Wide_Wide_Character contexts. Another question is whether some of the new classification functions should be added to Ada.Characters.Handling. The wording in the RM for that package describes the classification in terms of character ranges rather than the categories defined in 2.1. Should these be reworded in terms of these categories? [That question is tangentally covered by AI05-0114-1 - Editor.] !example (See discussion.) !corrigendum A.3.5(0) @dinsc The package Wide_Characters.Handling provides operations for classifying Wide_Characters and case folding for Wide_Characters. @s8<@i> The library package Wide_Characters.Handling has the following declaration: @xcode<@b Ada.Wide_Characters.Handling @b @b Is_Control (Item : Wide_Character) @b Boolean; @b Is_Letter (Item : Wide_Character) @b Boolean; @b Is_Lower (Item : Wide_Character) @b Boolean; @b Is_Upper (Item : Wide_Character) @b Boolean; @b Is_Digit (Item : Wide_Character) @b Boolean; @b Is_Decimal_Digit (Item : Wide_Character) @b Boolean @b Is_Digit; @b Is_Hexadecimal_Digit (Item : Wide_Character) @b Boolean; @b Is_Alphanumeric (Item : Wide_Character) @b Boolean; @b Is_Special (Item : Wide_Character) @b Boolean; @b Is_Line_Terminator (Item : Wide_Character) @b Boolean; @b Is_Mark (Item : Wide_Character) @b Boolean; @b Is_Other (Item : Wide_Character) @b Boolean; @b Is_Punctuation (Item : Wide_Character) @b Boolean; @b Is_Space (Item : Wide_Character) @b Boolean; @b Is_Graphic (Item : Wide_Character) @b Boolean; @b To_Lower (Item : Wide_Character) @b Wide_Character; @b To_Upper (Item : Wide_Character) @b Wide_Character; @b To_Lower (Item : Wide_String) @b Wide_String; @b To_Upper (Item : Wide_String) @b Wide_String; @b Ada.Wide_Characters.Handling;> The subprograms defined in Ada.Wide_Characters.Handling are locale independent. @xcode<@b Is_Control (Item : Wide_Character) @b Boolean;> @xindent, otherwise returns False.> @xcode<@b Is_Letter (Item : Wide_Character) @b Boolean;> @xindent, @fa, @fa, @fa, @fa, or @fa; otherwise returns False.> @xcode<@b Is_Lower (Item : Wide_Character) @b Boolean;> @xindent, otherwise returns False.> @xcode<@b Is_Upper (Item : Wide_Character) @b Boolean;> @xindent, otherwise returns False.> @xcode<@b Is_Digit (Item : Wide_Character) @b Boolean;> @xindent, otherwise returns False.> @xcode<@b Is_Hexadecimal_Digit (Item : Wide_Character) @b Boolean;> @xindent, or is in the range 'A' .. 'F' or 'a' .. 'f', otherwise returns False.> @xcode<@b Is_Alphanumeric (Item : Wide_Character) @b Boolean;> @xindent, @fa, @fa, @fa, @fa, @fa, or @fa; otherwise returns False.> @xcode<@b Is_Special (Item : Wide_Character) @b Boolean;> @xindent, but not categorized as @fa, @fa, @fa, @fa, @fa, @fa, or number_decimal; otherwise returns False.> @xcode<@b Is_Line_Terminator (Item : Wide_Character) @b Boolean;> @xindent or @fa, or if Item is a conventional line terminator character (CR, LF, VT, or FF); otherwise returns False.> @xcode<@b Is_Mark (Item : Wide_Character) @b Boolean;> @xindent or @fa, otherwise returns False.> @xcode<@b Is_Other (Item : Wide_Character) @b Boolean;> @xindent, otherwise returns False.> @xcode<@b Is_Punctuation (Item : Wide_Character) @b Boolean;> @xindent, otherwise returns False.> @xcode<@b Is_Space (Item : Wide_Character) @b Boolean;> @xindent, otherwise returns False.> @xcode<@b Is_Graphic (Item : Wide_Character) @b Boolean;> @xindent, otherwise returns False.> @xcode<@b To_Lower (Item : Wide_Character) @b Wide_Character;> @xindent @xcode<@b To_Lower (Item : Wide_String) @b Wide_String;> @xindent @xcode<@b To_Upper (Item : Wide_Character) @b Wide_Character;> @xindent @xcode<@b To_Upper (Item : Wide_String) @b Wide_String;> @xindent !corrigendum A.3.6(0) @dinsc The package Wide_Wide_Characters.Handling has the same contents as Wide_Characters.Handling except that each occurrence of Wide_Character is replaced by Wide_Wide_Character, and each occurrence of Wide_String is replaced by Wide_Wide_String. !ACATS test ACATS C-Tests should be constructed for these packages. !appendix From: Randy Brukardt Sent: Wednesday, August 11, 2010 9:44 PM The text in this AI says: function Is_Decimal_Digit (Item : Wide_Character) return Boolean; This function is a rename of Is_Digit. We don't write this in English, we just do it when desired. That is, the specification ought to be: function Is_Decimal_Digit (Item : Wide_Character) return Boolean renames Is_Digit; and the text description removed. **************************************************************** Editor's note: This AI was reopened to address the items mentioned above and others raised during Editorial Review. Specifically: I had previously asked that the names Is_Other and Is_Punctuation be changed to Is_Other_Format and Is_Punctuation_Connector; the latter in particular is very misleading (it is true for underscore, but not for period or comma). I had also noted that the implementation advice in A.3.1 is now dubious; no one commented on that. So I don't even know what to suggest there. Robert noted that To_Lower and To_Upper doesn't define the bounds of the result. (It should be 1, to be consistent with Ada.Characters.Handling.) John would prefer that Is_Line_Terminator, .. Is_Graphic be added to Ada.Characters.Handling. Finally we need to clarify the definition of Simple Lowercase Mapping and Simple Uppercase Mapping. The first is a Unicode terms; but we can't refer to Unicode normatively in the Standard. The second doesn't exist anywhere. Moreover, these are different than what identifiers use. Robert and I had an e-mail meltdown on this back in July. And the identifier definition is completely daft, as the "convert to uppercase" definition says use Unicode full case folding -- but *that* is a conversion to *lower* case! So we need to decide what we really want here. ****************************************************************