Version 1.3 of ai05s/ai05-0185-1.txt
!standard A.3.5 (0) 10-10-15 AI05-0185-1/03
!standard A.3.6 (0)
!class amendment 09-11-02
!status Amendment 2012 10-08-11
!status work item 10-10-18
!status ARG Approved 8-0-0 10-06-20
!status work item 09-11-02
!status received 09-11-02
!priority Medium
!difficulty Medium
!subject Wide_Character and Wide_Wide_Character classification and folding
!summary
Packages are added to provide support for the classification and case folding
of Wide_Character and Wide_Wide_Character values.
!problem
The package Ada.Characters.Handling provides functions to classify a Character,
and provides procedures to convert a Character to upper case and lower case.
There are no such capabilities for Wide_Character and Wide_Wide_Character.
Support for classification and case folding of the Wide_Character and
Wide_Wide_Character types should be added to the language.
!proposal
The current version of the GNAT compiler has defined the following
implementation-defined packages;
Ada.Wide_Characters.Unicode
Ada.Wide_Wide_Characters.Unicode
While Ada.Wide_Characters and Ada.Wide_Wide_Characters are standard Ada 2005
packages, the Unicode child packages are non-standard.
This proposal to create two standard packages;
Ada.Wide_Characters.Handling and
Ada.Wide_Wide_Characters.Handling
is based on the GNAT Unicode packages, but without the functions that accept
Unicode Category parameters.
!wording
A.3.5 The Package Wide_Characters.Handling
The package Wide_Characters.Handling provides operations for classifying
Wide_Characters and case folding for Wide_Characters.
Static Semantics
The library package Wide_Characters.Handling has the following declaration:
package Ada.Wide_Characters.Handling is
function Is_Control (Item : Wide_Character) return Boolean;
function Is_Letter (Item : Wide_Character) return Boolean;
function Is_Lower (Item : Wide_Character) return Boolean;
function Is_Upper (Item : Wide_Character) return Boolean;
function Is_Digit (Item : Wide_Character) return Boolean;
function Is_Decimal_Digit (Item : Wide_Character) return Boolean
renames Is_Digit;
function Is_Hexadecimal_Digit (Item : Wide_Character) return Boolean;
function Is_Alphanumeric (Item : Wide_Character) return Boolean;
function Is_Special (Item : Wide_Character) return Boolean;
function Is_Line_Terminator (Item : Wide_Character) return Boolean;
function Is_Mark (Item : Wide_Character) return Boolean;
function Is_Other (Item : Wide_Character) return Boolean;
function Is_Punctuation (Item : Wide_Character) return Boolean;
function Is_Space (Item : Wide_Character) return Boolean;
function Is_Graphic (Item : Wide_Character) return Boolean;
function To_Lower (Item : Wide_Character) return Wide_Character;
function To_Upper (Item : Wide_Character) return Wide_Character;
function To_Lower (Item : Wide_String) return Wide_String;
function To_Upper (Item : Wide_String) return Wide_String;
end Ada.Wide_Characters.Handling;
The subprograms defined in Ada.Wide_Characters.Handling are locale independent.
function Is_Control (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
other_control, otherwise returns False.
function Is_Letter (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, letter_lowercase, letter_titlecase, letter_modifier,
letter_other, or number_letter; otherwise returns False.
function Is_Lower (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_lowercase, otherwise returns False.
function Is_Upper (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, otherwise returns False.
function Is_Digit (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
number_decimal, otherwise returns False.
function Is_Hexadecimal_Digit (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
number_decimal, or is in the range 'A' .. 'F' or 'a' .. 'f', otherwise
returns False.
function Is_Alphanumeric (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, letter_lowercase, letter_titlecase, letter_modifier,
letter_other, number_letter, or number_decimal; otherwise returns False.
function Is_Special (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
graphic_character, but not categorized as letter_uppercase,
letter_lowercase, letter_titlecase, letter_modifier, letter_other,
number_letter, or number_decimal; otherwise returns False.
function Is_Line_Terminator (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
separator_line or separator_paragraph, or if Item is a conventional line
terminator character (CR, LF, VT, or FF); otherwise returns False.
function Is_Mark (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
mark_non_spacing or mark_spacing_combining, otherwise returns False.
function Is_Other (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
other_format, otherwise returns False.
function Is_Punctuation (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
punctuation_connector, otherwise returns False.
function Is_Space (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
separator_space, otherwise returns False.
function Is_Graphic (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
graphic_character, otherwise returns False.
function To_Lower (Item : Wide_Character) return Wide_Character;
Returns the Simple Lowercase Mapping of the Wide_Character designated by
Item. If the Simple Lowercase Mapping does not exist for the
Wide_Character designated by Item, then the value of Item is returned.
function To_Lower (Item : Wide_String) return Wide_String;
Returns the result of applying the To_Lower Wide_Character to
Wide_Character conversion to each element of the Wide_String designated by
Item. The result is the null Wide_String if the value of the formal
parameter is the null Wide_String.
function To_Upper (Item : Wide_Character) return Wide_Character;
Returns the Simple Uppercase Mapping of the Wide_Character designated by
Item. If the Simple Uppercase Mapping does not exist for the
Wide_Character designated by Item, then the value of Item is returned.
function To_Upper (Item : Wide_String) return Wide_String;
Returns the result of applying the To_Upper Wide_Character to
Wide_Character conversion to each element of the Wide_String designated by
Item. The result is the null Wide_String if the value of the formal
parameter is the null Wide_String.
A.3.6 The Package Wide_Wide_Characters.Handling
The package Wide_Wide_Characters.Handling has the same contents as
Wide_Characters.Handling except that each occurrence of Wide_Character is
replaced by Wide_Wide_Character, and each occurrence of Wide_String is replaced
by Wide_Wide_String.
!discussion
The GNAT Unicode packages define a Category type which maps to the Unicode
standard. Second forms of most of the classification routines exist that operate
on category type parameters instead of Wide_Character or Wide_Wide_Character.
The reason for these routines is that it is claimed they are more efficient if
multiple classification tests are to be performed on a Wide_Character or
Wide_Wide_Character value, otherwise the other form of the call that accepts
Wide_Character or Wide_Wide_Character is expected to be more efficient. The
category type however would tie the package more closely to the Unicode
standard, whereas it is desirable to hide that abstraction. Furthermore, adding
these routines would likely mean having to define a package like System.UTF_32
which is currently defined in GNAT. It seems that the categorization routines
are not necessary for the standard, and might be better left as
implementation-defined functionality.
The package Ada.Characters.Handling defines classification routines that are not
present in the GNAT Wide_Characters.Unicode and GNAT
Ada.Wide_Characters.Handling and Ada.Wide_Wide_Characters.Handling.
Specifically, Is_Control, Is_Lower, Is_Upper, Is_Basic, Is_Decimal_Digit,
Is_Graphic, Is_Hexadecimal_Digit, Is_Alphanumeric, and Is_Special are absent.
These should be provided to be consistent with Ada.Characters.Handling
The Non_Graphic routine was replaced with Graphic, otherwise the remaining
functions were added, except for the Is_Basic function, and the To_Basic
functions. It is not clear whether these functions have any meaning in
Wide_Character or Wide_Wide_Character contexts, as there do not appear to be any
Unicode functions for stripping off diacritical marks, and it is not clear that
doing so would result in a string that was meaningful.
Also, the ISO_646 related functions were not added, since those deal with 8-bit
values, they were deemed not appropriate for Wide_Character and
Wide_Wide_Character contexts.
Another question is whether some of the new classification functions should be
added to Ada.Characters.Handling. The wording in the RM for that package
describes the classification in terms of character ranges rather than the
categories defined in 2.1. Should these be reworded in terms of these
categories? [That question is tangentally covered by AI05-0114-1 - Editor.]
!example
(See discussion.)
!corrigendum A.3.5(0)
Insert new clause:
The package Wide_Characters.Handling provides operations for classifying
Wide_Characters and case folding for Wide_Characters.
Static Semantics
The library package Wide_Characters.Handling has the following declaration:
package Ada.Wide_Characters.Handling is
function Is_Control (Item : Wide_Character) return Boolean;
function Is_Letter (Item : Wide_Character) return Boolean;
function Is_Lower (Item : Wide_Character) return Boolean;
function Is_Upper (Item : Wide_Character) return Boolean;
function Is_Digit (Item : Wide_Character) return Boolean;
function Is_Decimal_Digit (Item : Wide_Character) return Boolean
renames Is_Digit;
function Is_Hexadecimal_Digit (Item : Wide_Character) return Boolean;
function Is_Alphanumeric (Item : Wide_Character) return Boolean;
function Is_Special (Item : Wide_Character) return Boolean;
function Is_Line_Terminator (Item : Wide_Character) return Boolean;
function Is_Mark (Item : Wide_Character) return Boolean;
function Is_Other (Item : Wide_Character) return Boolean;
function Is_Punctuation (Item : Wide_Character) return Boolean;
function Is_Space (Item : Wide_Character) return Boolean;
function Is_Graphic (Item : Wide_Character) return Boolean;
function To_Lower (Item : Wide_Character) return Wide_Character;
function To_Upper (Item : Wide_Character) return Wide_Character;
function To_Lower (Item : Wide_String) return Wide_String;
function To_Upper (Item : Wide_String) return Wide_String;
end Ada.Wide_Characters.Handling;
The subprograms defined in Ada.Wide_Characters.Handling are locale independent.
function Is_Control (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
other_control, otherwise returns False.
function Is_Letter (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, letter_lowercase, letter_titlecase,
letter_modifier, letter_other, or number_letter; otherwise
returns False.
function Is_Lower (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_lowercase, otherwise returns False.
function Is_Upper (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, otherwise returns False.
function Is_Digit (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
number_decimal, otherwise returns False.
function Is_Hexadecimal_Digit (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
number_decimal, or is in the range 'A' .. 'F' or 'a' .. 'f', otherwise
returns False.
function Is_Alphanumeric (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, letter_lowercase, letter_titlecase,
letter_modifier, letter_other, number_letter, or
number_decimal; otherwise returns False.
function Is_Special (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
graphic_character, but not categorized as letter_uppercase,
letter_lowercase, letter_titlecase, letter_modifier,
letter_other, number_letter, or number_decimal; otherwise returns
False.
function Is_Line_Terminator (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
separator_line or separator_paragraph, or if Item is a conventional
line terminator character (CR, LF, VT, or FF); otherwise returns False.
function Is_Mark (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
mark_non_spacing or mark_spacing_combining, otherwise returns False.
function Is_Other (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
other_format, otherwise returns False.
function Is_Punctuation (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
punctuation_connector, otherwise returns False.
function Is_Space (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
separator_space, otherwise returns False.
function Is_Graphic (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
graphic_character, otherwise returns False.
function To_Lower (Item : Wide_Character) return Wide_Character;
Returns the Simple Lowercase Mapping of the Wide_Character designated
by Item. If the Simple Lowercase Mapping does not exist for the Wide_Character
designated by Item, then the value of Item is returned.
function To_Lower (Item : Wide_String) return Wide_String;
Returns the result of applying the To_Lower Wide_Character to
Wide_Character conversion to each element of the Wide_String designated by Item.
The result is the null Wide_String if the value of the formal parameter is the
null Wide_String.
function To_Upper (Item : Wide_Character) return Wide_Character;
Returns the Simple Uppercase Mapping of the Wide_Character designated
by Item. If the Simple Uppercase Mapping does not exist for the Wide_Character
designated by Item, then the value of Item is returned.
function To_Upper (Item : Wide_String) return Wide_String;
Returns the result of applying the To_Upper Wide_Character to
Wide_Character conversion to each element of the Wide_String designated by Item.
The result is the null Wide_String if the value of the formal parameter is the
null Wide_String.
!corrigendum A.3.6(0)
Insert new clause:
The package Wide_Wide_Characters.Handling has the same contents as
Wide_Characters.Handling except that each occurrence of Wide_Character is
replaced by Wide_Wide_Character, and each occurrence of Wide_String is replaced
by Wide_Wide_String.
!ACATS test
ACATS C-Tests should be constructed for these packages.
!appendix
From: Randy Brukardt
Sent: Wednesday, August 11, 2010 9:44 PM
The text in this AI says:
function Is_Decimal_Digit (Item : Wide_Character) return Boolean;
This function is a rename of Is_Digit.
We don't write this in English, we just do it when desired. That is, the
specification ought to be:
function Is_Decimal_Digit (Item : Wide_Character) return Boolean renames Is_Digit;
and the text description removed.
****************************************************************
Editor's note: This AI was reopened to address the items mentioned above and others
raised during Editorial Review. Specifically:
I had previously asked that the names Is_Other and Is_Punctuation be changed to
Is_Other_Format and Is_Punctuation_Connector; the latter in particular is very
misleading (it is true for underscore, but not for period or comma).
I had also noted that the implementation advice in A.3.1 is now dubious; no one
commented on that. So I don't even know what to suggest there.
Robert noted that To_Lower and To_Upper doesn't define the bounds of the result.
(It should be 1, to be consistent with Ada.Characters.Handling.)
John would prefer that Is_Line_Terminator, .. Is_Graphic be added to
Ada.Characters.Handling.
Finally we need to clarify the definition of Simple Lowercase Mapping and Simple
Uppercase Mapping. The first is a Unicode terms; but we can't refer to Unicode
normatively in the Standard. The second doesn't exist anywhere.
Moreover, these are different than what identifiers use. Robert and I had an e-mail
meltdown on this back in July. And the identifier definition is completely daft,
as the "convert to uppercase" definition says use Unicode full case folding -- but
*that* is a conversion to *lower* case!
So we need to decide what we really want here.
****************************************************************
Questions? Ask the ACAA Technical Agent