Version 1.5 of ai05s/ai05-0185-1.txt
!standard A.3.5 (0) 10-11-23 AI05-0185-1/04
!standard A.3.6 (0)
!standard A.3.1(7/2)
!standard A.3.2(4)
!standard A.3.2(32)
!class amendment 09-11-02
!status Amendment 2012 10-08-11
!status ARG Approved 5-0-3 10-10-31
!status work item 10-10-18
!status ARG Approved 8-0-0 10-06-20
!status work item 09-11-02
!status received 09-11-02
!priority Medium
!difficulty Medium
!subject Wide_Character and Wide_Wide_Character classification and folding
!summary
Packages are added to provide support for the classification and case folding
of Wide_Character and Wide_Wide_Character values.
!problem
The package Ada.Characters.Handling provides functions to classify a Character,
and provides procedures to convert a Character to upper case and lower case.
There are no such capabilities for Wide_Character and Wide_Wide_Character.
Support for classification and case folding of the Wide_Character and
Wide_Wide_Character types should be added to the language.
!proposal
The current version of the GNAT compiler has defined the following
implementation-defined packages;
Ada.Wide_Characters.Unicode
Ada.Wide_Wide_Characters.Unicode
While Ada.Wide_Characters and Ada.Wide_Wide_Characters are standard Ada 2005
packages, the Unicode child packages are non-standard.
This proposal to create two standard packages;
Ada.Wide_Characters.Handling and
Ada.Wide_Wide_Characters.Handling
is based on the GNAT Unicode packages, but without the functions that accept
Unicode Category parameters.
!wording
Modify A.3.1(7/2):
If an implementation chooses to provide implementation-defined operations on
Wide_Character or Wide_String (such as [case mapping, classification, ]collating
and sorting, etc.) it should do so by providing child units of Wide_Characters.
Similarly if it chooses to provide implementation-defined operations on
Wide_Wide_Character or Wide_Wide_String it should do so by providing child units
of Wide_Wide_Characters.
Add to the end of A.3.2(4):
function Is_Line_Terminator (Item : in Character) return Boolean;
function Is_Mark (Item : in Character) return Boolean;
function Is_Other_Format (Item : in Character) return Boolean;
function Is_Punctuation_Connector (Item : in Character) return Boolean;
function Is_Space (Item : in Character) return Boolean;
Add following A.3.2(32)
Is_Line_Terminator
True if Item is a character with position 10 .. 13.
Is_Mark
Never True (no value of type Character has categories Mark, Non-Spacing
or Mark, Spacing Combining).
Is_Other_Format
True if Item is a character with position 173 (Soft_Hyphen).
Is_Punctuation_Connector
True if Item is a character with position 95 ('_', known as Low_Line
or Underscore).
Is_Space
True if Item is a character with position 32 (' ') or 160 (No_Break_Space).
A.3.5 The Package Wide_Characters.Handling
The package Wide_Characters.Handling provides operations for classifying
Wide_Characters and case folding for Wide_Characters.
Static Semantics
The library package Wide_Characters.Handling has the following declaration:
package Ada.Wide_Characters.Handling is
function Is_Control (Item : Wide_Character) return Boolean;
function Is_Letter (Item : Wide_Character) return Boolean;
function Is_Lower (Item : Wide_Character) return Boolean;
function Is_Upper (Item : Wide_Character) return Boolean;
function Is_Digit (Item : Wide_Character) return Boolean;
function Is_Decimal_Digit (Item : Wide_Character) return Boolean
renames Is_Digit;
function Is_Hexadecimal_Digit (Item : Wide_Character) return Boolean;
function Is_Alphanumeric (Item : Wide_Character) return Boolean;
function Is_Special (Item : Wide_Character) return Boolean;
function Is_Line_Terminator (Item : Wide_Character) return Boolean;
function Is_Mark (Item : Wide_Character) return Boolean;
function Is_Other_Format (Item : Wide_Character) return Boolean;
function Is_Punctuation_Connector (Item : Wide_Character) return Boolean;
function Is_Space (Item : Wide_Character) return Boolean;
function Is_Graphic (Item : Wide_Character) return Boolean;
function To_Lower (Item : Wide_Character) return Wide_Character;
function To_Upper (Item : Wide_Character) return Wide_Character;
function To_Lower (Item : Wide_String) return Wide_String;
function To_Upper (Item : Wide_String) return Wide_String;
end Ada.Wide_Characters.Handling;
The subprograms defined in Ada.Wide_Characters.Handling are locale independent.
function Is_Control (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
other_control, otherwise returns False.
function Is_Letter (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, letter_lowercase, letter_titlecase, letter_modifier,
letter_other, or number_letter; otherwise returns False.
function Is_Lower (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_lowercase, otherwise returns False.
function Is_Upper (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, otherwise returns False.
function Is_Digit (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
number_decimal, otherwise returns False.
function Is_Hexadecimal_Digit (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
number_decimal, or is in the range 'A' .. 'F' or 'a' .. 'f', otherwise
returns False.
function Is_Alphanumeric (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, letter_lowercase, letter_titlecase, letter_modifier,
letter_other, number_letter, or number_decimal; otherwise returns False.
function Is_Special (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
graphic_character, but not categorized as letter_uppercase,
letter_lowercase, letter_titlecase, letter_modifier, letter_other,
number_letter, or number_decimal; otherwise returns False.
function Is_Line_Terminator (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
separator_line or separator_paragraph, or if Item is a conventional line
terminator character (CR, LF, VT, or FF); otherwise returns False.
function Is_Mark (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
mark_non_spacing or mark_spacing_combining, otherwise returns False.
function Is_Other_Format (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
other_format, otherwise returns False.
function Is_Punctuation_Connector (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
punctuation_connector, otherwise returns False.
function Is_Space (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
separator_space, otherwise returns False.
function Is_Graphic (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
graphic_character, otherwise returns False.
function To_Lower (Item : Wide_Character) return Wide_Character;
Returns the Simple Lowercase Mapping as defined by documents referenced in
the note in section 1 of ISO/IEC 10646:2003 of the Wide_Character designated by
Item. If the Simple Lowercase Mapping does not exist for the
Wide_Character designated by Item, then the value of Item is returned.
function To_Lower (Item : Wide_String) return Wide_String;
Returns the result of applying the To_Lower Wide_Character to
Wide_Character conversion to each element of the Wide_String designated by
Item. The result is the null Wide_String if the value of the formal
parameter is the null Wide_String. The lower bound of the result Wide_String is 1.
function To_Upper (Item : Wide_Character) return Wide_Character;
Returns the Simple Uppercase Mapping as defined by documents referenced in
the note in section 1 of ISO/IEC 10646:2003 of the Wide_Character designated by
Item. If the Simple Uppercase Mapping does not exist for the
Wide_Character designated by Item, then the value of Item is returned.
function To_Upper (Item : Wide_String) return Wide_String;
Returns the result of applying the To_Upper Wide_Character to
Wide_Character conversion to each element of the Wide_String designated by
Item. The result is the null Wide_String if the value of the formal
parameter is the null Wide_String. The lower bound of the result Wide_String is 1.
A.3.6 The Package Wide_Wide_Characters.Handling
The package Wide_Wide_Characters.Handling has the same contents as
Wide_Characters.Handling except that each occurrence of Wide_Character is
replaced by Wide_Wide_Character, and each occurrence of Wide_String is replaced
by Wide_Wide_String.
!discussion
The GNAT Unicode packages define a Category type which maps to the Unicode
standard. Second forms of most of the classification routines exist that operate
on category type parameters instead of Wide_Character or Wide_Wide_Character.
The reason for these routines is that it is claimed they are more efficient if
multiple classification tests are to be performed on a Wide_Character or
Wide_Wide_Character value, otherwise the other form of the call that accepts
Wide_Character or Wide_Wide_Character is expected to be more efficient. The
category type however would tie the package more closely to the Unicode
standard, whereas it is desirable to hide that abstraction. Furthermore, adding
these routines would likely mean having to define a package like System.UTF_32
which is currently defined in GNAT. It seems that the categorization routines
are not necessary for the standard, and might be better left as
implementation-defined functionality.
The package Ada.Characters.Handling defines classification routines that are not
present in the GNAT Wide_Characters.Unicode and GNAT
Ada.Wide_Characters.Handling and Ada.Wide_Wide_Characters.Handling.
Specifically, Is_Control, Is_Lower, Is_Upper, Is_Basic, Is_Decimal_Digit,
Is_Graphic, Is_Hexadecimal_Digit, Is_Alphanumeric, and Is_Special are absent.
These should be provided to be consistent with Ada.Characters.Handling
The Non_Graphic routine was replaced with Graphic, otherwise the remaining
functions were added, except for the Is_Basic function, and the To_Basic
functions. It is not clear whether these functions have any meaning in
Wide_Character or Wide_Wide_Character contexts, as there do not appear to be any
Unicode functions for stripping off diacritical marks, and it is not clear that
doing so would result in a string that was meaningful.
Also, the ISO_646 related functions were not added, since those deal with 8-bit
values, they were deemed not appropriate for Wide_Character and
Wide_Wide_Character contexts.
Another question is whether some of the new classification functions should be
added to Ada.Characters.Handling. The wording in the RM for that package
describes the classification in terms of character ranges rather than the
categories defined in 2.1. Should these be reworded in terms of these
categories? [That question is tangentally covered by AI05-0114-1 - Editor.]
!example
(See discussion.)
!corrigendum A.3.1(7/2)
Replace the paragraph:
If an implementation chooses to provide implementation-defined operations on
Wide_Character or Wide_String (such as case mapping, classification, collating
and sorting, etc.) it should do so by providing child units of Wide_Characters.
Similarly if it chooses to provide implementation-defined operations on
Wide_Wide_Character or Wide_Wide_String it should do so by providing child units
of Wide_Wide_Characters.
by:
If an implementation chooses to provide implementation-defined operations on
Wide_Character or Wide_String (such as collating
and sorting, etc.) it should do so by providing child units of Wide_Characters.
Similarly if it chooses to provide implementation-defined operations on
Wide_Wide_Character or Wide_Wide_String it should do so by providing child units
of Wide_Wide_Characters.
!corrigendum A.3.2(4)
Replace the paragraph:
function Is_Control (Item : in Character) return Boolean;
function Is_Graphic (Item : in Character) return Boolean;
function Is_Letter (Item : in Character) return Boolean;
function Is_Lower (Item : in Character) return Boolean;
function Is_Upper (Item : in Character) return Boolean;
function Is_Basic (Item : in Character) return Boolean;
function Is_Digit (Item : in Character) return Boolean;
function Is_Decimal_Digit (Item : in Character) return Boolean;
renames Is_Digit;
function Is_Hexadecimal_Digit (Item : in Character) return Boolean;
function Is_Alphanumeric (Item : in Character) return Boolean;
function Is_Special (Item : in Character) return Boolean;
by:
function Is_Control (Item : in Character) return Boolean;
function Is_Graphic (Item : in Character) return Boolean;
function Is_Letter (Item : in Character) return Boolean;
function Is_Lower (Item : in Character) return Boolean;
function Is_Upper (Item : in Character) return Boolean;
function Is_Basic (Item : in Character) return Boolean;
function Is_Digit (Item : in Character) return Boolean;
function Is_Decimal_Digit (Item : in Character) return Boolean;
renames Is_Digit;
function Is_Hexadecimal_Digit (Item : in Character) return Boolean;
function Is_Alphanumeric (Item : in Character) return Boolean;
function Is_Special (Item : in Character) return Boolean;
function Is_Line_Terminator (Item : in Character) return Boolean;
function Is_Mark (Item : in Character) return Boolean;
function Is_Other_Format (Item : in Character) return Boolean;
function Is_Punctuation_Connector (Item : in Character) return Boolean;
function Is_Space (Item : in Character) return Boolean;
!corrigendum A.3.2(32)
Insert after the paragraph:
- Is_Special
-
True if Item is a special graphic character. A special graphic character is a graphic
character that is not alphanumeric.
the new paragraphs:
- Is_Line_Terminator
-
Is_Line_Terminator
True if Item is a character with position 10 .. 13.
- Is_Mark
-
Never True (no value of type Character has categories Mark, Non-Spacing
or Mark, Spacing Combining).
- Is_Other_Format
-
True if Item is a character with position 173 (Soft_Hyphen).
- Is_Punctuation_Connector
-
True if Item is a character with position 95 ('_', known as Low_Line
or Underscore).
- Is_Space
-
True if Item is a character with position 32 (' ') or 160 (No_Break_Space).
!corrigendum A.3.5(0)
Insert new clause:
The package Wide_Characters.Handling provides operations for classifying
Wide_Characters and case folding for Wide_Characters.
Static Semantics
The library package Wide_Characters.Handling has the following declaration:
package Ada.Wide_Characters.Handling is
function Is_Control (Item : Wide_Character) return Boolean;
function Is_Letter (Item : Wide_Character) return Boolean;
function Is_Lower (Item : Wide_Character) return Boolean;
function Is_Upper (Item : Wide_Character) return Boolean;
function Is_Digit (Item : Wide_Character) return Boolean;
function Is_Decimal_Digit (Item : Wide_Character) return Boolean
renames Is_Digit;
function Is_Hexadecimal_Digit (Item : Wide_Character) return Boolean;
function Is_Alphanumeric (Item : Wide_Character) return Boolean;
function Is_Special (Item : Wide_Character) return Boolean;
function Is_Line_Terminator (Item : Wide_Character) return Boolean;
function Is_Mark (Item : Wide_Character) return Boolean;
function Is_Other_Format (Item : Wide_Character) return Boolean;
function Is_Punctuation_Connector (Item : Wide_Character) return Boolean;
function Is_Space (Item : Wide_Character) return Boolean;
function Is_Graphic (Item : Wide_Character) return Boolean;
function To_Lower (Item : Wide_Character) return Wide_Character;
function To_Upper (Item : Wide_Character) return Wide_Character;
function To_Lower (Item : Wide_String) return Wide_String;
function To_Upper (Item : Wide_String) return Wide_String;
end Ada.Wide_Characters.Handling;
The subprograms defined in Ada.Wide_Characters.Handling are locale independent.
function Is_Control (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
other_control, otherwise returns False.
function Is_Letter (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, letter_lowercase, letter_titlecase,
letter_modifier, letter_other, or number_letter; otherwise
returns False.
function Is_Lower (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_lowercase, otherwise returns False.
function Is_Upper (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, otherwise returns False.
function Is_Digit (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
number_decimal, otherwise returns False.
function Is_Hexadecimal_Digit (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
number_decimal, or is in the range 'A' .. 'F' or 'a' .. 'f', otherwise
returns False.
function Is_Alphanumeric (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
letter_uppercase, letter_lowercase, letter_titlecase,
letter_modifier, letter_other, number_letter, or
number_decimal; otherwise returns False.
function Is_Special (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
graphic_character, but not categorized as letter_uppercase,
letter_lowercase, letter_titlecase, letter_modifier,
letter_other, number_letter, or number_decimal; otherwise returns
False.
function Is_Line_Terminator (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
separator_line or separator_paragraph, or if Item is a conventional
line terminator character (CR, LF, VT, or FF); otherwise returns False.
function Is_Mark (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
mark_non_spacing or mark_spacing_combining, otherwise returns False.
function Is_Other_Format (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
other_format, otherwise returns False.
function Is_Punctuation_Connector (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
punctuation_connector, otherwise returns False.
function Is_Space (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
separator_space, otherwise returns False.
function Is_Graphic (Item : Wide_Character) return Boolean;
Returns True if the Wide_Character designated by Item is categorized as
graphic_character, otherwise returns False.
function To_Lower (Item : Wide_Character) return Wide_Character;
Returns the Simple Lowercase Mapping as defined by documents referenced
in the note in section 1 of ISO/IEC 10646:2003 of the Wide_Character designated
by Item. If the Simple Lowercase Mapping does not exist for the Wide_Character
designated by Item, then the value of Item is returned.
function To_Lower (Item : Wide_String) return Wide_String;
Returns the result of applying the To_Lower Wide_Character to
Wide_Character conversion to each element of the Wide_String designated by Item.
The result is the null Wide_String if the value of the formal parameter is the
null Wide_String. The lower bound of the result Wide_String is 1.
function To_Upper (Item : Wide_Character) return Wide_Character;
Returns the Simple Uppercase Mapping as defined by documents referenced
in the note in section 1 of ISO/IEC 10646:2003 of the Wide_Character designated
by Item. If the Simple Uppercase Mapping does not exist for the Wide_Character
designated by Item, then the value of Item is returned.
function To_Upper (Item : Wide_String) return Wide_String;
Returns the result of applying the To_Upper Wide_Character to
Wide_Character conversion to each element of the Wide_String designated by Item.
The result is the null Wide_String if the value of the formal parameter is the
null Wide_String. The lower bound of the result Wide_String is 1.
!corrigendum A.3.6(0)
Insert new clause:
The package Wide_Wide_Characters.Handling has the same contents as
Wide_Characters.Handling except that each occurrence of Wide_Character is
replaced by Wide_Wide_Character, and each occurrence of Wide_String is replaced
by Wide_Wide_String.
!ACATS test
ACATS C-Tests should be constructed for these packages.
!appendix
From: Robert Dewar
Sent: Saturday, July 3, 2010 3:29 PM
we forgot to say what the bounds of the result are for To_Lower and To_Upper. I
suggest the same as the bounds of the input parameter (the alternative is always
1 as the low bound).
****************************************************************
From: Robert Dewar
Sent: Saturday, July 3, 2010 3:45 PM
The Inline pragma for Is_Graphic says Is_Non_Graphic
****************************************************************
From: Robert Dewar
Sent: Saturday, July 3, 2010 4:05 PM
I object to the pragma Inline's that's up to the implementation what makes sense
to mark as inlined.
****************************************************************
From: Randy Brukardt
Sent: Saturday, July 3, 2010 5:28 PM
Robert, in the future, please indicate the AI and version (and Bob would like
the title as well) that you are looking at, because it can be hard to find
whatever is being referred to.
Anyway, once I figured out that you are talking about AI05-0185-1, the first
note in the yet-to-be-published minutes says: "Drop all of the pragma Inline."
It's not that helpful to review AIs between the end of a meeting and the
publishing of the minutes, because it is likely that you'll just comment on
stuff that has already been decided -- and that just adds to my workload without
any corresponding benefit. There will be an editorial review of all of the newly
completed AIs that will start shortly, and that is the appropriate time for
reviewing these.
****************************************************************
From: Robert Dewar
Sent: Saturday, July 3, 2010 6:40 PM
No problem, I am just making comments as I implement things and notice them, but
this stuff is hardly critical!
One thing that does concern me is To_Lower, Ihope everyone realizes that
To_Upper and To_Lower are not easily reversible.
For instance in identifiers, you definitely want lower case i with a dot to be
equivalent to upper case I without a dot. Anything else would be a big surprise
to anyone who is not Turkish.
But there are for characters lower case i with and without a dot, and upper case
I with and without a dot. The natural folding would be to keep the dot, but
that's obviously not what you want.
So my current implementation of To_Upper folds lower case i with a dot to upper
case I without a dot. But I am not sure what the To_Upper and To_Lower functions
in these packages in the AI are supposed to do.
Who has studied tyhe To_Upper/To_Lower issue carefully for the purpose of this
AI? Someone I hope! Or were these routines just stuck in casually without
thinking about the difficult problems behind them (I suspect this is the case,
please tell me it isn't and that someone can tell me EXACTLY what they had in
mind).
I follow the locale independent case folding discussed in note 1 of ISO/IEC
10646:2003 for To_Upper_Case currently.
And now I can't even find this standard to look at it again :-(
UGH! Case folding was one of the hardest things to deal with, and here it is in
even greater glory in this package. Oh well I can always implement something or
other. The RM certainly does not say what it means (though what *is* the
reference to Simple_Lower_Case???)
****************************************************************
From: Robert Dewar
Sent: Saturday, July 3, 2010 6:44 PM
> For instance in identifiers, you definitely want lower case i with a
> dot to be equivalent to upper case I without a dot. Anything else
> would be a big surprise to anyone who is not Turkish.
To expand on this a bit, my current To_Upper function maps both lower case i
with dot and lower case i with no dot to upper case I with no dot. I am sure
this is what is wanted for identifier case equivalence (anything else would be
an incompatible disaster). But that means that To_Upper is a many-to-one
mapping, and thus is not reversible.
****************************************************************
From: Randy Brukardt
Sent: Sunday, July 4, 2010 6:07 PM
...
> One thing that does concern me is To_Lower, Ihope everyone realizes
> that To_Upper and To_Lower are not easily reversible.
Those of us in the ARG who (sort of) understand the character stuff surely know
that. But it probably would be a good idea to make it clear to regular
end-users, so it would make sense to add a user note.
...
> So my current implementation of To_Upper folds lower case i with a dot
> to upper case I without a dot. But I am not sure what the To_Upper and
> To_Lower functions in these packages in the AI are supposed to do.
My understanding is that they are supposed to use the "Simple Uppercase Mapping"
(and "Simple Lowercase Mapping") as defined by 10646. If there is no such thing,
we have a problem! Probably the wording should make this clearer rather than
just using Titlecase for the terms. That is, say something like "Simple
Uppercase Mapping of ISO/IEC 10646:2003."
> Who has studied tyhe To_Upper/To_Lower issue carefully for the purpose
> of this AI? Someone I hope! Or were these routines just stuck in
> casually without thinking about the difficult problems behind them (I
> suspect this is the case, please tell me it isn't and that someone can
> tell me EXACTLY what they had in mind).
>
> I follow the locale independent case folding discussed in note 1 of
> ISO/IEC 10646:2003 for To_Upper_Case currently.
>
> And now I can't even find this standard to look at it again :-(
I vaguely recall someone saying that this standard has free availability;
presuming that is true there should be no problem getting a copy. (That said, I
don't have a copy and should get one.)
> UGH! Case folding was one of the hardest things to deal with, and here
> it is in even greater glory in this package. Oh well I can always
> implement something or other. The RM certainly does not say what it
> means (though what *is* the reference to
> Simple_Lower_Case???)
It's "Simple Uppercase Mapping", and I presume there is something with that name
in 10646. If not, we don't have a defined functionality, and that *surely* would
be a problem.
I personally had thought that this was talking about the same mapping used for
Ada Identifiers, but having read the definition again, I'm not so sure anymore.
That's because To_Upper for strings is defined in terms of To_Upper for
characters, and that surely doesn't work for the full character set (how can
To_Upper for a character return the *three* characters needed in some extreme
cases??). So I suspect that you are right that there is a definitional problem
here.
****************************************************************
From: Robert Dewar
Sent: Sunday, July 4, 2010 6:30 PM
> My understanding is that they are supposed to use the "Simple
> Uppercase Mapping" (and "Simple Lowercase Mapping") as defined by
> 10646. If there is no such thing, we have a problem! Probably the
> wording should make this clearer rather than just using Titlecase for
> the terms. That is, say something like "Simple Uppercase Mapping of ISO/IEC
> 10646:2003."
I don't know what this refers to, can someone find a reference?
> I personally had thought that this was talking about the same mapping
> used for Ada Identifiers, but having read the definition again, I'm
> not so sure anymore. That's because To_Upper for strings is defined in
> terms of To_Upper for characters, and that surely doesn't work for the
> full character set (how can To_Upper for a character return the
> *three* characters needed in some extreme cases??). So I suspect that
> you are right that there is a definitional problem here.
To_Upper cannot return three characters for one, what are you talking about?
10646 has one code per point, we are not talking about UTF-8 strings here.
For source it's up to you how the characters are represented, but conceptually
identifiers are a sequence of wide_wide_characters.
[This thread is rapidly turning to talk about identifiers; as such
it continues in AI05-0227-1.]
****************************************************************
From: Randy Brukardt
Sent: Wednesday, August 11, 2010 9:44 PM
The text in this AI says:
function Is_Decimal_Digit (Item : Wide_Character) return Boolean;
This function is a rename of Is_Digit.
We don't write this in English, we just do it when desired. That is, the
specification ought to be:
function Is_Decimal_Digit (Item : Wide_Character) return Boolean renames Is_Digit;
and the text description removed.
****************************************************************
From: Randy Brukardt
Sent: Wednesday, August 11, 2010 9:55 PM
Should we change the Implementation Advice in A.3.1 since we are now providing
some form of case mapping and classification? It says:
If an implementation chooses to provide implementation-defined operations on
Wide_Character or Wide_String (such as case mapping, classification, collating
and sorting, etc.) it should do so by providing child units of Wide_Characters.
Similarly if it chooses to provide implementation-defined operations on
Wide_Wide_Character or Wide_Wide_String it should do so by providing child units
of Wide_Wide_Characters.
Argubly it is still correct, since one could easily imagine further
classification functions and "full case folding". But it seems a bit misleading,
especially as it originally was added because we were *not* adding
Wide_Characters.Handling in Ada 2005; now that we decided to do that, it not
clear that it is as useful. (And it is a bit weird that it doesn't mention
String; why not make the same statement for it?)
Thoughts??
****************************************************************
From: Randy Brukardt
Sent: Wednesday, August 11, 2010 10:38 PM
Having looked at this a bit more I wonder if the names of the Is_Other and
Is_Punctuation routines are misleading.
Is_Other returns True for Other_Format characters, but not for other characters
classified as "other, something". I think this routine would be better called
Is_Other_Format.
I was going to ignore Is_Other, but then I saw that Is_Punctuation is very
misleading. This returns true for characters in category punctuation_connector
(that is, for underscore), but will return False for common punctuation like '.'
and ','. Punctuation_connector is the only category used in the Ada grammar (in
identifiers), so it is the only one our standard defines. As such, it probably
is the only one we really want to support here, but clearly we need a name that
isn't misleading. Is_Punctuation_Connector would be a much better name.
Thoughts??
****************************************************************
From: Robert Dewar
Sent: Wednesday, August 11, 2010 11:14 PM
No objections to these name changes, they seem minor and are easy enough to
adjust in existing code.
****************************************************************
Editor's note: This AI was reopened to address the items mentioned above and
others raised during Editorial Review. Specifically:
I had previously asked that the names Is_Other and Is_Punctuation be changed to
Is_Other_Format and Is_Punctuation_Connector; the latter in particular is very
misleading (it is true for underscore, but not for period or comma).
I had also noted that the implementation advice in A.3.1 is now dubious; no one
commented on that. So I don't even know what to suggest there.
Robert noted that To_Lower and To_Upper doesn't define the bounds of the result.
(It should be 1, to be consistent with Ada.Characters.Handling.)
John would prefer that Is_Line_Terminator, .. Is_Graphic be added to
Ada.Characters.Handling.
Finally we need to clarify the definition of Simple Lowercase Mapping and Simple
Uppercase Mapping. The first is a Unicode terms; but we can't refer to Unicode
normatively in the Standard. The second doesn't exist anywhere.
Moreover, these are different than what identifiers use. Robert and I had an
e-mail meltdown on this back in July. And the identifier definition is
completely daft, as the "convert to uppercase" definition says use Unicode full
case folding -- but *that* is a conversion to *lower* case! See AI05-0227-1.
So we need to decide what we really want here.
****************************************************************
Questions? Ask the ACAA Technical Agent