Version 1.2 of ai05s/ai05-0114-1.txt

Unformatted version of ai05s/ai05-0114-1.txt version 1.2
Other versions for file ai05s/ai05-0114-1.txt

!standard A.3.2(59)          10-08-12 AI05-0114-1/02
!standard A.4.6(7)
!class binding interpretation 08-10-06
!status Amendment 2012 10-08-12
!status ARG Approved 7-0-1 10-08-12
!status work item 08-10-06
!status received 08-06-13
!priority Low
!difficulty Easy
!qualifier Omission
!subject Conflicting definition of Letter
!summary
No change is suggested to Ada.Characters.Handling or Ada.Strings.Maps.Constants; however we add a user note to point out the inconsistency.
!question
Ada 2005 has enhanced the set of characters allowed to compose identifiers. In particular, 2.3(2/2) specifies that an identifier is made up of items including identifier_start. Then 2.3(3/2) specifies that any letter_lowercase can be a component of identifier_start. Then 2.1(9/2) defines letter_lowercase to be "Any character whose General Category is defined to be "Letter, Lowercase"" by ISO/IEC 10646:2003.
The Unicode Data File lists each of the following as category Ll, meaning Letter, Lowercase:
Code point u+00AA, named FEMININE ORDINAL INDICATOR; Code point u+00B5, named MICRO SIGN; Code point u+00BA, named MASCULINE ORDINAL INDICATOR.
Therefore, each of these three characters should be considered lowercase letters and allowed in identifiers according to the various Clause 2.1 and 2.3 paragraphs mentioned above.
Since Ada considers these three characters as letters suitable for being part of identifiers, the functions Is_Letter and Is_Lower in package Ada.Characters.Handling should now correspondingly return True for the characters with code points 170, 181, and 186. Should this change be made? (No.)
!recommendation
(See Summary.)
!wording
Add after A.3.2(59):
7 There are certain characters which are defined to be lower case letters by ISO
10646 and are therefore allowed in identifiers, but are not considered lower case letters by Ada.Characters.Handling.
AARM Reason: This is to maintain compatibility with the Ada 95 definitions of these functions.
Add after A.4.6(7)
NOTES 1 There are certain characters which are defined to be lower case letters by ISO
10646 and are therefore allowed in identifiers, but are not considered lower case letters by Ada.Strings.Maps.Constants.
AARM Reason: This is to maintain compatibility with the Ada 95 definitions of these constants.
!discussion
Changing the definition of Ada.Characters.Handling has the potential of breaking existing programs. Moreover, this is the worst kind of incompatibility: one where the behavior of a program silently changes.
The questioner also forgets that these definitions are used in other places: specifically, the constants of Ada.Strings.Maps and its relatives. This would spread the incompatibility to the majority of programs that use the Ada.Strings packages.
The questioner also seems to assume that there is some correlation between Ada.Characters.Handling and identifiers. But this has never been true; both concepts are defined separately.
While it is likely that many programs will not use any characters in the changed range, the potential incompatibility is so wide spread that such a runtime change cannot be contemplated.
We could have resolved this difference by using the Ada 95 classification for Row 00 (that is, Latin-1). One way to do that would be to explicitly say that these three characters are not letters in Ada, even though they would qualify via Unicode. This is unlikely to be a major problem (all of the characters appear to have counterparts elsewhere in the Unicode set), but it would be unusual. And we would be saying that our good taste is more important than the carefully considered (we hope!) classifications of the character set standards.
!corrigendum A.3.2(59)
Insert after the paragraph:
— Special graphic characters
the new paragraph:
7 There are certain characters which are defined to be lower case letters by ISO 10646 and are therefore allowed in identifiers, but are not considered lower case letters by Ada.Characters.Handling.
!corrigendum A.4.6(7)
Insert after the paragraph:
Each of these constants represents a correspondingly named set of characters or character mapping in Characters.Handling (see A.3.2).
the new paragraph:
NOTES
12 There are certain characters which are defined to be lower case letters by ISO 10646 and are therefore allowed in identifiers, but are not considered lower case letters by Ada.Strings.Maps.Constants.
!ACATS Test
Create ACATS C-Tests and (if we disallow the three additional characters in identifiers) B-Tests to check that whatever is decided is enforced.
!appendix

!topic Inconsistency in Ada 2005 definition of letter
!reference Ada 2005 A.3.2(24,25)
!from Howard W. Ludwig 08-06-26
!keywords identifier_start, letter_lowercase, Is_Letter, Is_Lower
!discussion

Ada 2005 has enhanced the set of characters allowed to compose identifiers. In particular,
2.3(2/2) specifies that an identifier is made up of items including identifier_start.
Then 2.3(3/2) specifies that any letter_lowercase can be a component of identifier_start.
Then 2.1(9/2) defines letter_lowercase to be "Any character whose General Category is
defined to be "Letter, Lowercase"" by ISO/IEC 10646:2003.

The Unicode Data File lists each of the following as category Ll, meaning Letter, Lowercase:
    Code point u+00AA, named FEMININE ORDINAL INDICATOR;
    Code point u+00B5, named MICRO SIGN;
    Code point u+00BA, named MASCULINE ORDINAL INDICATOR.

Therefore, each of these three characters should be considered lowercase letters and allowed
in identifiers according to the various Clause 2.1 and 2.3 paragraphs mentioned above.

This is contrary to Ada 95, in which 2.1(7..9) allows only characters in Row 00 of ISO 10646 BMP
whose name begins "Latin Capital Letter" or "Latin Small Letter". The MICRO SIGN and
two ORDINAL INDICATORs did not qualify as identifier characters under this Ada 95 rule but
do satisfy the Unicode lowercase letter categorization requirement for Ada 2005 identifiers.
Thus, it is not that Ada 2005 added as allowed identifier characters only code points beyond
Row 00 of the BMP but also changed the categorization of these three within Row 00.

Now that Ada considers these three characters as letters suitable for being part of identifiers,
the functions Is_Letter and Is_Lower in package Ada.Characters.Handling should now
correspondingly return True for the characters with code points 170, 181, and 186. I do not
have any strong opinion as to what Is-Basic should return as a value for these three
characters (first casual thought is True for 181 and still False for the other two). Now,
Is_Letter and Is_Lower do not have any relevance outside of type Character (that is,
beyond code point 255) for deciding what is acceptable (or not) for being part of an
identifier (though I think such functionality would be useful and should be included
for the broader category of characters, as Java does), but it should match for Row 00,
where both concepts meaningfully overlap.

I understand this would be a compatibility issue with respect to Ada 95 in that the same
program source code could yield different results under Ada 2005. However, the current letter
of the law is a conceptual incompatibility in that in Ada 95, whether a character was regarded
as a suitable letter for an identifier and whether Is_Letter returned a value of True matched
in Ada 95 but do not with current Ada 2005 wording.

****************************************************************

From: Robert Dewar
Sent: Friday, July 9, 2010  4:28 PM

For the record, GNAT does not permit any of the codes AA, B5, BA in identifiers.
I have no intention of changing this unless someone thinks this incompatibility with
Ada 95 is *really* important, it's of course an acceptable incompatibility (something
that was illegal becoming legal), but to what purpose? None of these three symbols

(MICRO SIGN, FEMININE ORDINAL INDICATOR, MASCULINE ORDINAL INDICATOR)

are reasonable in identifiers.

Yes, once we stray outside the Latin-1, all sorts of bizarre characters are valid in
identifiers, but let's keep the basic 256 character set free of such oddity!!!

****************************************************************

From: Bob Duff
Sent: Friday, July 9, 2010  6:00 PM

> For the record, GNAT does not permit any of the codes AA, B5, BA in
> identifiers.
> I have no intention of changing this unless someone thinks this
> incompatibility with Ada 95 is *really* important, it's of course an
> acceptable incompatibility (something that was illegal becoming
> legal), but to what purpose?

Illegal-->legal is not called an "incompatibility";
it's called a "language extension".

I don't care one way or the other what we do, so long as you don't call it an
incompatibility.  ;-)

****************************************************************

From: Robert Dewar
Sent: Friday, July 9, 2010  6:07 PM

> I don't care one way or the other what we do, so long as you don't
> call it an incompatibility.  ;-)

Fair enough, but I like to regard "language extension"
as the implementation of something useful, not the accidental permitting of
something silly :-)

I would be embarrassed to call this a language extension

>> (MICRO SIGN, FEMININE ORDINAL INDICATOR, MASCULINE ORDINAL INDICATOR)
>>
>> are reasonable in identifiers.
>>
>> Yes, once we stray outside the Latin-1, all sorts of bizarre
>> characters are valid in identifiers, but let's keep the basic 256
>> character set free of such oddity!!!

Are you really neutral on this, do you think it is rasonable to allow thse three
symbols (I can't bring myself to call them letters) in identifiers.

And wouldn't you find it a bit weird for a Latin-1 character that is classified
neither as a letter nor a digit by Ada.Characters.Handling was allowed in
identiiers???

****************************************************************

From: Randy Brukardt
Sent: Friday, July 9, 2010  6:49 PM

> Are you really neutral on this, do you think it is rasonable to allow
> thse three symbols (I can't bring myself to call them letters) in
> identifiers.


The problem here is determining what is a letter. Either we can depend on the
definitions given by the character set standards (that is ISO/IEC 10646:2003) or
we can invent our own. Depending on the character set standard means that we're
going to have incompatibilities and extensions every time those standards change
(add new characters, for instance). While we can stick with a particular version
of a standard for a while, eventually we'll need to update the reference to it.

Inventing our own definition is fraught with danger, and in particular works
poorly when new characters are added to the character set standards. Neither
works very well in my view.

BTW, Unicode has a solution to prevent incompatibilities (but not
extensions!) in identifiers: it has a special classification for characters that
were once classified as letters and are not anymore. That classification is
recommended to be included in identifiers. (We didn't follow this advice in Ada
2005, as it was added after Unicode 4.0, and besides, you don't think we should
be referring to Unicode at all, so it's probably not possible -- the
classification isn't in 10646:2003.)

As far as the oddity itself goes, I find any characters outside of A-Z and
0-9 odd, so I don't really care (and it might come in handy for enumeration
literals).

> And wouldn't you find it a bit weird for a Latin-1 character that is
> classified neither as a letter nor a digit by Ada.Characters.Handling
> was allowed in identiiers???

Ada.Characters.Handling is based on an obsolete standard. We can't change it
because of the concern about silent changes in program behavior. This is the
same problem that we have with the Upper_Case_Map for Wide_Characters; it's
badly defined, but we don't want to change the behavior of existing programs.

Note that we'll have the same problem with Ada.Wide_Characters.Handling down the
road. It will be tied to some particular version of 10646, and we won't feel
able to change the results to some other version (which presumably will have new
kinds letters with new upper and lower case mappings). Either that or we'll have
to accept behavior changes. (Neither sound great to me.)

It might be valuable to have a package whose behavior is defined to be exactly
that of Ada identifiers (for whatever version of the language is being used).
That package would intentionally change behaviors between versions as needed to
match changes in the character sets. (I don't think that is the main purpose of
Ada.Characters.Handling, and changing it to do that is nasty to programs that
are not managing identifiers.)

****************************************************************

From: Robert Dewar
Sent: Friday, July 9, 2010  7:00 PM

> As far as the oddity itself goes, I find any characters outside of A-Z and
> 0-9 odd, so I don't really care (and it might come in handy for
> enumeration literals).

Randy that is an absurd and unhelpful position to take. It is obvious that you
have to allow common characters in European languages. MANY of our users take
advantage of this (e.g. acute accents). I find your position even worse than
Whitacker's O26 keypunch stuff :-) :-)

Ther is a heck of a difference between E-acute and these three characters.

> Ada.Characters.Handling is based on an obsolete standard. We can't
> change it because of the concern about silent changes in program
> behavior. This is the same problem that we have with the
> Upper_Case_Map for Wide_Characters; it's badly defined, but we don't
> want to change the behavior of existing programs.

Never mind the standards! Ada.Characters.Is_Letter returns entirely sensible
results. If there is no standard specifying this sensible result, too bad, we
definitely SHOULD invent our own rules for the first 256 characters.

And we should allow E acute as a letter, but not the MASCULINE and FEMININE
symbols!

> Note that we'll have the same problem with
> Ada.Characters.Wide_Handling down the road. It will be tied to some
> particular version of 10646, and we won't feel able to change the
> results to some other version (which presumably will have new kinds
> letters with new upper and lower case mappings). Either that or we'll
> have to accept behavior changes. (Neither sound great to me.)

Easily handled at the compiler level with option flags (GNAT allows LOTS of
character sets in identifiers, e.g. all the Latin sets).

****************************************************************

From: Bob Duff
Sent: Friday, July 9, 2010  7:22 PM

> Fair enough, but I like to regard "language extension"
> as the implementation of something useful, not the accidental
> permitting of something silly :-)
>
> I would be embarrassed to call this a language extension

Then I suggest you use a term like "silly, stupid, useless, rubbage language
extension".  ;-)

I think it's important to reserve "incompatibility" for cases where we're
potentially breaking someone's previously-working code. That's a serious charge,
which I take seriously.

Otherwise the discussion gets confused.  (During Ada 9X there were some cases
where people used "incompatible" to mean "incompatible with my personal good
taste", and that confused the discussion!)

> >> None of these three symbols
> >>
> >> (MICRO SIGN, FEMININE ORDINAL INDICATOR, MASCULINE ORDINAL
> >> INDICATOR)
> >>
> >> are reasonable in identifiers.
> >>
> >> Yes, once we stray outside the Latin-1, all sorts of bizarre
> >> characters are valid in identifiers, but let's keep the basic 256
> >> character set free of such oddity!!!
>
> Are you really neutral on this, do you think it is rasonable to allow
> thse three symbols (I can't bring myself to call them letters) in
> identifiers.

I don't know what they look like (though I can guess), nor how to type them in,
nor will I ever be likely to use them.  I'm neutral because I don't have enough
knowledge.  Like if you asked me what's the best restaurant in Moscow -- I've
never been there.  If there's some ISO character-set standard that says those
things ought to be "letters", or "identifier chars" or whatever, then who am I
to say nay?

> And wouldn't you find it a bit weird for a Latin-1 character that is
> classified neither as a letter nor a digit by Ada.Characters.Handling
> was allowed in identiiers???

Yes, I suppose so.  But then somebody will say "Ada.Characters.Handling is for
human-readable text and blah blah", and I'll reply "Yeah, OK with me."

****************************************************************

From: Robert Dewar
Sent: Friday, July 9, 2010  4:28 PM

>> I would be embarrassed to call this a language extension
>
> Then I suggest you use a term like "silly, stupid, useless, rubbage
> language extension".  ;-)

OK, henceforth, we shall call these SSURLE's  pronouned ssssurley "-)

> I think it's important to reserve "incompatibility" for cases where
> we're potentially breaking someone's previously-working code.
> That's a serious charge, which I take seriously.

yes of course

****************************************************************


Questions? Ask the ACAA Technical Agent