Version 1.2 of ai05s/ai05-0227-1.txt

Unformatted version of ai05s/ai05-0227-1.txt version 1.2
Other versions for file ai05s/ai05-0227-1.txt

!standard 1.1.4(14.2/2)          11-01-22 AI05-0227-1/02
!class binding interpretation 10-10-21
!status work item 10-10-21
!status received 10-07-03
!priority Medium
!difficulty Hard
!qualifier Error
!subject Identifier equivalence
!summary
Identifier equivalence is based on Unicode "locale-independent simple case folding".
The result of 'Wide_Wide_Image is based on the "simple upper case mapping" of the enumeration literals. An enumeration type is illegal if two literals have the same "simple upper case mapping".
!question
AARM 2.3(5.c/2) implies that Ada 2005 may be incompatible with Ada 95 in obscure cases. That's because some of the sequences which are neither identifiers nor reserved words in Ada 2005 were legal identifiers in Ada 95. Moreover, the expected change in identifier equivalence also could introduce incompatibilies or in very unusual cases, inconsistencies. Neither of these were discussed nor documented for Ada 2005.
However, this AARM note is wrong, because 1.1.4(14.2/2) says that "convert to upper case" means "locale-independent full case folding". Full case folding does not consider the character at 16#DF# (the German character that looks like a beta) to be the same as "ss". Thus, these incompatibilies don't arise. However, "full case folding" is a mapping to lower case!! Thus the definition in 1.1.4(14.2/2) is madness. Moreover, taken literally, it requires 'Image to produce lower case versions of enumeration literals, which would be completely inconsistent with Ada 95 and before. That cannot have been intended.
So what is the rule? (See Summary.)
!wording
Replace 1.1.4(14.2/2):
When this International Standard mentions the conversion of some character or sequence of characters to upper case, it means the character or sequence of characters obtained by using simple upper case mapping, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2003.
Replace 2.3(5/3):
Two identifiers are considered the same if they consist of the same sequence of characters after applying locale-independent simple case folding, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2003.
Replace 2.3(5.3/3):
After applying simple case folding, an identifier shall not be identical to a reserved word.
AARM Discussion:
Simple case folding is a mapping to lower case, so this is a matching to the defining (lower case) version of a reserved word. We could have mentioned case folding of the reserved words, but as that is an identity function, it would have no effect.
Modify 2.3(5.b/3):
...after [converting to upper case]{applying case folding} so that the rules...
Replace 2.3(5.c/2):
The rules for reserved words differ in one way: they define case conversion on letters rather than sequences. This means that it is possible that there exist some unusual sequences that are neither identifiers nor reserved words. We are not aware of any such sequences so long as we use simple case folding (as opposed to full case folding), but we have defined the rules in case any are introduced in future character set standards. This originally was a problem when converting to upper case: “<i>f” and “acce<s>” have upper case conversions of “IF” and “ACCESS” respectively. We would not want these to be treated as reserved words. But neither of these cases exist when using simple case folding.
Replace the notes 2.3(6.a/2-6.i/2) by: (Turkish characters surrounded by <> here, <I> - dotted capital I, <i> - dotless lower-case I).
For instance, in most languages, the simple case folded equivalent of LATIN CAPTIAL LETTER I (a upper case letter without a dot above) is LATIN SMALL LETTER I (an lower case letter with a dot above). In Turkish, though, LATIN CAPITAL LETTER I and LATIN CAPITAL LETTER WITH DOT ABOVE are two distinct letters, so the case folded equivalent of LATIN CAPTIAL LETTER I is LATIN SMALL LETTER DOTLESS I, and the case folded equivalent of LATIN CAPTIAL LETTER WITH DOT ABOVE I is LATIN SMALL LETTER I. Take for instance the following identifier (which is the name of a city on the Tigris river in Eastern Anatolia):
D<I>YARBAKIR -- First I is dotted, second is not.
A Turkish speaker would expect that the original identifier is equivalent to:
diyarbak<i>r
However, case folding (and thus Ada) map this to:
d<I>yarbakir
which is different any of:
<the four values in 6.f>
including the "correct" matching identifier for Turkish. Upper case conversion (used in '[Wide_]Wide_Image) introduces additional problems.
An implementation targeting the Turkish market is allowed (in fact, expected) to provide a nonstandard mode where case folding is appropriate for Turkish.
Replace 3.5.1(5):
The defining_identifiers in upper case Redunant[and the defining_character_literals] listed in an enumeration_type_definition shall be distinct.
AARM Reason: To ease implementation of the attribute Wide_Wide_Value, we require that all enumeration literals have distinct images.
!discussion
The original intent was to follow the Unicode recommendations in this area. Therefore, the author went back and read the Unicode recommendations.
Unicode says that case-insensitive identifier equivalence should be done by converting both strings using "locale-independent case folding", and comparing the results. But this is not a conversion to some case: it is solely intended for comparisions.
Unicode also provides a variety of mappings to convert strings from mixed case into all upper case or all lower case versions. The "full" versions are preferred (these may change the length of the strings); the "simple" versions leave the string lengths the same.
The important take-away here is that there is not a single operation that can both be used for case conversion and case insensitive comparisions. These are two different operations.
An important property of "full case folding" (and "simple case folding as well) is that Unicode guarantees that it is stable. That is, it will always provide the same results for two strings (that contain only defined code-points) in any current or newer version of Unicode. Conversely, the various case mappings are not stable: each new version of Unicode may provide different results.
Since "full case folding" is stable, it is appropriate for use in programming languages: identifiers will remain compatible with future versions of Unicode. Case mapping is not considered appropriate, because it will change with each new version of Unicode.
The Problem.
Ada 95 and Ada 2005 have confused the two concerns. "Convert to upper case" is used both for identifier equivalence (which is a comparison operation) and for the result of 'Image (which is a case conversion problem). To follow the Unicode recommendations, these have to be treated separately.
That means that notion of identifier equivalence in 2.3(5.2/2) needs to be replaced by a direct reference to full case folding. Most of the other references in 2.3 and 2.9 to upper case would also have to be replaced by this same definition of equivalence.
At the same time, we also need to replace 1.1.4(14.2/2) by a reference to "full upper case mapping". This is what '[Wide_Wide_]Image would use. ('Image cannot use "full case folding" as this is neither intended to be a case conversion mapping, and in any case it goes to lower case.)
This makes Ada 2012 as compatible with the Unicode recommendations as possible. However, there are problems problem.
Both "full case folding" and "full upper case mapping" can cause strings to change lengths. This adds implementation complexity.
"Locale-idependent full case folding" maps the 16#DF# character (German sharp s, represented here by <S>) character maps to "ss". This leads directly to the various incompatibilities noted in the question. In particular all of the following identifiers would be equivalent in Ada 2012 if this was adopted:
Bass BASS BAss Ba<S> ba<S>
This is clearly incompatible with Ada 95 (where the last two are different than the first three). Moreover, this incompatibility could in fact lead to a beaujolias-like inconsistency if there are nested identifiers that used to be considered different and now are considered the same.
We might be prepared to live with an incompatibility, but an inconsistency here is unconsionable. So we cannot use "locale-independent full case folding".
If we use "locale-independent simple case folding" instead, it then makes no sense to use the more complex "full upper case mapping" for '[Wide_Wide_]Image. We would have the same problem with 'Image that we had previously with identifier equivalence.For instance, the sharp s character maps to "SS" in full upper case mapping. Thus, 'Image (Bass) = "BASS" 'Image (Ba<S>) = "BASS". This particular example is especially nasty because it is inconsistent with the Ada 95 handling of this identifier.
Additionally, '[Wide_Wide_]Value would have to use the relatively complex full case folding to determine which identifier was provided. This would be more complex to do than the trivial conversions currently done; that would have a runtime cost, both in time and space.
Even using "simple case folding" and "simple upper case mapping"s, it is still possible for two different identifiers to have the same upper case mapping (the dotless i is likely to be such an example). In order to keep the implementation of '[Wide_Wide_]Value manageable, we have also adopted a rule that all of the literals of an enumeration type have distinct upper case mappings. This allows '[Wide_Wide_]Value to compare the upper case mappings of its parameter, rather than having to use case folding.
ALTERNATIVE SOLUTIONS
The bad effects above come about because case conversions and case equivalence are separate entities. Clearly, Unicode does not anticipate a function like 'Image in a programming language. Probably they would recommend that it return the original case of the identifier. But it's many decades too late to do that.
Thus, we also considered simpler changes where we leave the semantic rules as they are and just define "convert to upper case" meaningfully.
The obvious thing to do is to define "convert to upper case" in 1.1.4(14.2/2) to be full upper case mapping. Indeed, the AARM notes were constructed using the notion that this is what we were doing.
However, this has the same problem as using "locale-independent full case folding". The Sharp-S mapping is such that identifiers are incompatable with Ada 95. Thus this solution has to be rejected.
The next obvious thing to do is to define "convert to upper case" in 1.1.4(14.2/2) to be simple upper case mapping. This always goes to the same length string, so the problem given above does not occur. Indeed, we are not aware of any incompatibility with Ada 95 here.
Note however, that this problem has not gone away, it just has left Latin-1. For instance, the dotless i and normal i both map to 'I'. That means that "dotless-intent" and "intent" are still considered the same, and we still have sequences that are neither identifiers nor reserved words ("dotless-if" for instance).
Since we already "solved" that problem in the Ada 2005 definition, this is not too terrible. At least it is true that Image is always reversible with Value, as anything that is considered the same would have the same upper case mapping.
Note however that the definition of "simple upper case mapping" is not stable, meaning that switching to a newer version of Unicode (presumably when 10646 is updated) almost certainly would introduce incompatibilities.
We could solve this by abandoning Unicode altogether. Since we've pretty much booted the Unicode recommendations at this point, it doesn't matter much if we simply ignore them altogether. We could do that by defining our own case mapping algorithm. (Robert Dewar has a suggestion of how we could do that in the
!appendix
wanted to avoid incompatibilities in the future. Thus simply depending on
character names/classifications is not enough, as a future character set standard
will surely change some of those.


!ACATS Test

Adjust the Unicode identifier tests to reflect this decision.

!appendix

From: Randy Brukardt
Sent: Sunday, July 4, 2010  7:09 PM

[Split from a thread about AI05-0185-1. See that AI for previous mail.]

> > I personally had thought that this was talking about the same
> > mapping used for Ada Identifiers, but having read the definition
> > again, I'm not so sure anymore. That's because To_Upper for strings
> > is defined in terms of To_Upper for characters, and that surely
> > doesn't work for the full character set (how can To_Upper for a
> > character return the
> > *three* characters needed in some extreme cases??). So I suspect
> > that you are right that there is a definitional problem here.
>
> To_Upper cannot return three characters for one, what are you talking
> about? 10646 has one code per point, we are not talking about UTF-8
> strings here.

The upper case mapping for Unicode characters can be 2 and supposely 3
characters. The obvious example is the LC_German_Sharp_S (as it is named in
Ada.Characters.Latin_1): the upper case mapping is "SS". It is certainly
intended that Ada identifiers containing the LC_German_Sharp_S are considered
the same as those containing "SS" in the same position (if I ever get around to
creating ACATS tests for wide characters in identifiers, that will be one the
first tests).

Thus it doesn't make much sense to define To_Upper for strings in terms of
To_Upper for characters, assuming that the same results as for identifiers is
intended. (It's certainly what I would expect to be intended, it would be
strange to get different results.)

> For source it's up to you how the characters are represented, but
> conceptually identifiers are a sequence of wide_wide_characters.

Right. And the decision as to whether two identifiers are the same is made using
the mapping defined in 2.1(5/2). (Ah-ha: this talks about "Simple Uppercase
Mapping"; apparently this is a Unicode construct, as the "note 1" reference is
just a way to get a veiled reference to Unicode into an ISO/IEC document. I
definitely think we need to make it clearer in the AI wording that this is what
is being talked about.)

****************************************************************

From: Robert Dewar
Sent: Sunday, July 4, 2010  8:18 PM

> The upper case mapping for Unicode characters can be 2 and supposely 3
> characters. The obvious example is the LC_German_Sharp_S (as it is
> named in
> Ada.Characters.Latin_1): the upper case mapping is "SS". It is
> certainly intended that Ada identifiers containing the
> LC_German_Sharp_S are considered the same as those containing "SS" in
> the same position (if I ever get around to creating ACATS tests for
> wide characters in identifiers, that will be one the first tests).

I find that absurd, and highly undesirable, this is case equivalence gone
totally berserk. I entirely refuse to implement this on the grounds that it is
highly undesirable to do so. And to have two totally different notions of upper
casing junk characters in the language is simply horrible,

I think the case mapping of 10646 with SMALL<-->CAPITAL is as far as we go.

And if you *DO* right such an ACATS test, I will consider it a last straw and
declare the ACATS suite junk :-) :-)

Seriously, this is a weird interpretation and needs to be discussed by the whole
ARG.

> Thus it doesn't make much sense to define To_Upper for strings in
> terms of To_Upper for characters, assuming that the same results as
> for identifiers is intended. (It's certainly what I would expect to be
> intended, it would be strange to get different results.)
>
>> For source it's up to you how the characters are represented, but
>> conceptually identifiers are a sequence of wide_wide_characters.
>
> Right. And the decision as to whether two identifiers are the same is
> made using the mapping defined in 2.1(5/2). (Ah-ha: this talks about
> "Simple Uppercase Mapping"; apparently this is a Unicode construct, as the "note 1"
> reference is just a way to get a veiled reference to Unicode into an
> ISO/IEC document. I definitely think we need to make it clearer in the
> AI wording that this is what is being talked about.)

Under no conditions can we tolerate two identifiers with different numbers of
10646 code points being considered identical in my opinion.

****************************************************************

From: Robert Dewar
Sent: Monday, July 5, 2010  6:40 AM

> The upper case mapping for Unicode characters can be 2 and supposely 3
> characters. The obvious example is the LC_German_Sharp_S (as it is
> named in
> Ada.Characters.Latin_1): the upper case mapping is "SS". It is
> certainly intended that Ada identifiers containing the
> LC_German_Sharp_S are considered the same as those containing "SS" in
> the same position (if I ever get around to creating ACATS tests for
> wide characters in identifiers, that will be one the first tests).

More on why this would be simply awful
Let's call LC_German_Sharp_S * in the below discussion

    If we have the identifier *, then Randy things SS and * should be
    case equivalent.

    But surely SS is equivalent to ss, so now do we break transititivity
    of case equivalence or is * also equivalent to ss?

This way lies complete madness in my opinion.

For example suppose we have the identifier SSS, is that equivalent to *s and
also to s*, and now are *s and s* equivalent? AARGH, if you have a whole row of
S's, there are a combinatorial number of possible equivalent identifiers

I am not sure if trying to extend the world of case equivalence to other than
A-Z makes sense at all, for example is E equivalent to e-acute (very often in
french practice, accents are omitted from upper case, even though they should
not be, because of type writers, I remember JDI strongly thought the answer was
yes -- the answer must be no in fact for similar reasons to the above).

But this is a done deal.

To me the only thing that makes sense for case equivalence is to regard the
identifier or other string as a series of code points in 10646.

Then if the names of two code points differ only in SMALL LETTER being replaced
by CAPITAL LETTER, then they are equivalent, otherwise they are not equivalent.

How does that work out for Randy's example?

The entry for the character in question in 10646 is

> 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;;

There is no entry whose name is

        LATIN CAPITAL LETTER SHARP S

Thus this character has no upper case equivalent.

This is the ONLY interpretation that makes sense.

It is symmetrical, reversible (forget what I said about dotless I and dotted I,
they are separate characters from the normal Latin-I, and if turkish folks want,
they can consistently use these separate characters), and well- defined.

The To_Lower and To_Upper functions in this package should be consistent with
this model, and yes Randy, with this model it is fine to define the string
version in terms of the character version (nothing else makes sense).

Randy, if you want to devise some other version with multi-character
replacements, feel free to write such a package, and even try to propose it as a
separate package for the standard, but do not contaminate case equivalence of
id.

The above intepretation is certainly what GNAT implements now, and that is not
about to change unless there are very good arguments. I see no such arguments in
sight!

****************************************************************

From: Robert Dewar
Sent: Monday, July 5, 2010  8:13 AM

For reference, here are the 10646 CAPITAL LETTER entries with no corresponding
SMALL LETTER entries:

>    --  LATIN CAPITAL LETTER I WITH DOT ABOVE
>    --  LATIN CAPITAL LETTER AFRICAN D
>    --  LATIN CAPITAL LETTER O WITH MIDDLE TILDE
>    --  LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
>    --  LATIN CAPITAL LETTER L WITH SMALL LETTER J
>    --  LATIN CAPITAL LETTER N WITH SMALL LETTER J
>    --  LATIN CAPITAL LETTER D WITH SMALL LETTER Z
>    --  LATIN CAPITAL LETTER HWAIR
>    --  LATIN CAPITAL LETTER WYNN
>    --  GREEK CAPITAL LETTER UPSILON HOOK
>    --  GREEK CAPITAL LETTER UPSILON HOOK TONOS
>    --  GREEK CAPITAL LETTER UPSILON HOOK DIAERESIS
>    --  GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
>    --  GREEK CAPITAL LETTER OMEGA WITH

Here are the 10646 SMALL LETTER entries with no matching CAPITAL LETTER entries,
note Randy's favortie at the start of the list. Note also the entries at the
end, I trust that does not inspire Randy to figure out how to allow parentheses
into identifiers (since I suppose he would consider the upper case equivalent of
parenthesized-small-letter-c to be (C) :-))

>    --  LATIN SMALL LETTER SHARP S
>    --  LATIN SMALL LETTER DOTLESS I
>    --  LATIN SMALL LETTER KRA
>    --  LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
>    --  LATIN SMALL LETTER LONG S
>    --  LATIN SMALL LETTER B WITH STROKE
>    --  LATIN SMALL LETTER TURNED DELTA
>    --  LATIN SMALL LETTER HV
>    --  LATIN SMALL LETTER L WITH BAR
>    --  LATIN SMALL LETTER LAMBDA WITH STROKE
>    --  LATIN SMALL LETTER T WITH PALATAL HOOK
>    --  LATIN SMALL LETTER EZH WITH TAIL
>    --  LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
>    --  LATIN CAPITAL LETTER L WITH SMALL LETTER J
>    --  LATIN CAPITAL LETTER N WITH SMALL LETTER J
>    --  LATIN SMALL LETTER TURNED E
>    --  LATIN SMALL LETTER J WITH CARON
>    --  LATIN CAPITAL LETTER D WITH SMALL LETTER Z
>    --  LATIN SMALL LETTER D WITH CURL
>    --  LATIN SMALL LETTER L WITH CURL
>    --  LATIN SMALL LETTER N WITH CURL
>    --  LATIN SMALL LETTER T WITH CURL
>    --  LATIN SMALL LETTER TURNED A
>    --  LATIN SMALL LETTER ALPHA
>    --  LATIN SMALL LETTER TURNED ALPHA
>    --  LATIN SMALL LETTER C WITH CURL
>    --  LATIN SMALL LETTER D WITH TAIL
>    --  LATIN SMALL LETTER SCHWA WITH HOOK
>    --  LATIN SMALL LETTER REVERSED OPEN E
>    --  LATIN SMALL LETTER REVERSED OPEN E WITH HOOK
>    --  LATIN SMALL LETTER CLOSED REVERSED OPEN E
>    --  LATIN SMALL LETTER DOTLESS J WITH STROKE
>    --  LATIN SMALL LETTER SCRIPT G
>    --  LATIN SMALL LETTER RAMS HORN
>    --  LATIN SMALL LETTER TURNED H
>    --  LATIN SMALL LETTER H WITH HOOK
>    --  LATIN SMALL LETTER HENG WITH HOOK
>    --  LATIN SMALL LETTER L WITH MIDDLE TILDE
>    --  LATIN SMALL LETTER L WITH BELT
>    --  LATIN SMALL LETTER L WITH RETROFLEX HOOK
>    --  LATIN SMALL LETTER LEZH
>    --  LATIN SMALL LETTER TURNED M WITH LONG LEG
>    --  LATIN SMALL LETTER M WITH HOOK
>    --  LATIN SMALL LETTER N WITH RETROFLEX HOOK
>    --  LATIN SMALL LETTER BARRED O
>    --  LATIN SMALL LETTER CLOSED OMEGA
>    --  LATIN SMALL LETTER PHI
>    --  LATIN SMALL LETTER TURNED R
>    --  LATIN SMALL LETTER TURNED R WITH LONG LEG
>    --  LATIN SMALL LETTER TURNED R WITH HOOK
>    --  LATIN SMALL LETTER R WITH LONG LEG
>    --  LATIN SMALL LETTER R WITH TAIL
>    --  LATIN SMALL LETTER R WITH FISHHOOK
>    --  LATIN SMALL LETTER REVERSED R WITH FISHHOOK
>    --  LATIN SMALL LETTER S WITH HOOK
>    --  LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK
>    --  LATIN SMALL LETTER SQUAT REVERSED ESH
>    --  LATIN SMALL LETTER ESH WITH CURL
>    --  LATIN SMALL LETTER TURNED T
>    --  LATIN SMALL LETTER U BAR
>    --  LATIN SMALL LETTER TURNED V
>    --  LATIN SMALL LETTER TURNED W
>    --  LATIN SMALL LETTER TURNED Y
>    --  LATIN SMALL LETTER Z WITH RETROFLEX HOOK
>    --  LATIN SMALL LETTER Z WITH CURL
>    --  LATIN SMALL LETTER EZH WITH CURL
>    --  LATIN SMALL LETTER CLOSED OPEN E
>    --  LATIN SMALL LETTER J WITH CROSSED-TAIL
>    --  LATIN SMALL LETTER TURNED K
>    --  LATIN SMALL LETTER Q WITH HOOK
>    --  LATIN SMALL LETTER DZ DIGRAPH
>    --  LATIN SMALL LETTER DEZH DIGRAPH
>    --  LATIN SMALL LETTER DZ DIGRAPH WITH CURL
>    --  LATIN SMALL LETTER TS DIGRAPH
>    --  LATIN SMALL LETTER TESH DIGRAPH
>    --  LATIN SMALL LETTER TC DIGRAPH WITH CURL
>    --  LATIN SMALL LETTER FENG DIGRAPH
>    --  LATIN SMALL LETTER LS DIGRAPH
>    --  LATIN SMALL LETTER LZ DIGRAPH
>    --  LATIN SMALL LETTER TURNED H WITH FISHHOOK
>    --  LATIN SMALL LETTER TURNED H WITH FISHHOOK AND TAIL
>    --  COMBINING LATIN SMALL LETTER A
>    --  COMBINING LATIN SMALL LETTER E
>    --  COMBINING LATIN SMALL LETTER I
>    --  COMBINING LATIN SMALL LETTER O
>    --  COMBINING LATIN SMALL LETTER U
>    --  COMBINING LATIN SMALL LETTER C
>    --  COMBINING LATIN SMALL LETTER D
>    --  COMBINING LATIN SMALL LETTER H
>    --  COMBINING LATIN SMALL LETTER M
>    --  COMBINING LATIN SMALL LETTER R
>    --  COMBINING LATIN SMALL LETTER T
>    --  COMBINING LATIN SMALL LETTER V
>    --  COMBINING LATIN SMALL LETTER X
>    --  GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
>    --  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
>    --  GREEK SMALL LETTER FINAL SIGMA
>    --  GREEK SMALL LETTER CURLED BETA
>    --  GREEK SMALL LETTER SCRIPT THETA
>    --  GREEK SMALL LETTER SCRIPT PHI
>    --  GREEK SMALL LETTER OMEGA PI
>    --  GREEK SMALL LETTER ARCHAIC KOPPA
>    --  GREEK SMALL LETTER SCRIPT KAPPA
>    --  GREEK SMALL LETTER TAILED RHO
>    --  GREEK SMALL LETTER LUNATE SIGMA
>    --  GEORGIAN SMALL LETTER FI
>    --  LIMBU SMALL LETTER KA
>    --  LIMBU SMALL LETTER NGA
>    --  LIMBU SMALL LETTER ANUSVARA
>    --  LIMBU SMALL LETTER TA
>    --  LIMBU SMALL LETTER NA
>    --  LIMBU SMALL LETTER PA
>    --  LIMBU SMALL LETTER MA
>    --  LIMBU SMALL LETTER RA
>    --  LIMBU SMALL LETTER LA
>    --  LATIN SMALL LETTER TURNED AE
>    --  LATIN SMALL LETTER TURNED OPEN E
>    --  LATIN SMALL LETTER TURNED I
>    --  LATIN SMALL LETTER SIDEWAYS O
>    --  LATIN SMALL LETTER SIDEWAYS OPEN O
>    --  LATIN SMALL LETTER SIDEWAYS O WITH STROKE
>    --  LATIN SMALL LETTER TURNED OE
>    --  LATIN SMALL LETTER TOP HALF O
>    --  LATIN SMALL LETTER BOTTOM HALF O
>    --  LATIN SMALL LETTER SIDEWAYS U
>    --  LATIN SMALL LETTER SIDEWAYS DIAERESIZED U
>    --  LATIN SMALL LETTER SIDEWAYS TURNED M
>    --  LATIN SUBSCRIPT SMALL LETTER I
>    --  LATIN SUBSCRIPT SMALL LETTER R
>    --  LATIN SUBSCRIPT SMALL LETTER U
>    --  LATIN SUBSCRIPT SMALL LETTER V
>    --  GREEK SUBSCRIPT SMALL LETTER BETA
>    --  GREEK SUBSCRIPT SMALL LETTER GAMMA
>    --  GREEK SUBSCRIPT SMALL LETTER RHO
>    --  GREEK SUBSCRIPT SMALL LETTER PHI
>    --  GREEK SUBSCRIPT SMALL LETTER CHI
>    --  LATIN SMALL LETTER UE
>    --  LATIN SMALL LETTER H WITH LINE BELOW
>    --  LATIN SMALL LETTER T WITH DIAERESIS
>    --  LATIN SMALL LETTER W WITH RING ABOVE
>    --  LATIN SMALL LETTER Y WITH RING ABOVE
>    --  LATIN SMALL LETTER A WITH RIGHT HALF RING
>    --  LATIN SMALL LETTER LONG S WITH DOT ABOVE
>    --  GREEK SMALL LETTER UPSILON WITH PSILI
>    --  GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
>    --  GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
>    --  GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
>    --  GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ALPHA WITH PERISPOMENI
>    --  GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER ETA WITH PERISPOMENI
>    --  GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
>    --  GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
>    --  GREEK SMALL LETTER IOTA WITH PERISPOMENI
>    --  GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
>    --  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
>    --  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
>    --  GREEK SMALL LETTER RHO WITH PSILI
>    --  GREEK SMALL LETTER UPSILON WITH PERISPOMENI
>    --  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
>    --  GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
>    --  GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI
>    --  GREEK SMALL LETTER OMEGA WITH PERISPOMENI
>    --  GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
>    --  SUPERSCRIPT LATIN SMALL LETTER I
>    --  SUPERSCRIPT LATIN SMALL LETTER N
>    --  TURNED GREEK SMALL LETTER IOTA
>    --  PARENTHESIZED LATIN SMALL LETTER A
>    --  PARENTHESIZED LATIN SMALL LETTER B
>    --  PARENTHESIZED LATIN SMALL LETTER C
>    --  PARENTHESIZED LATIN SMALL LETTER D
>    --  PARENTHESIZED LATIN SMALL LETTER E
>    --  PARENTHESIZED LATIN SMALL LETTER F
>    --  PARENTHESIZED LATIN SMALL LETTER G
>    --  PARENTHESIZED LATIN SMALL LETTER H
>    --  PARENTHESIZED LATIN SMALL LETTER I
>    --  PARENTHESIZED LATIN SMALL LETTER J
>    --  PARENTHESIZED LATIN SMALL LETTER K
>    --  PARENTHESIZED LATIN SMALL LETTER L
>    --  PARENTHESIZED LATIN SMALL LETTER M
>    --  PARENTHESIZED LATIN SMALL LETTER N
>    --  PARENTHESIZED LATIN SMALL LETTER O
>    --  PARENTHESIZED LATIN SMALL LETTER P
>    --  PARENTHESIZED LATIN SMALL LETTER Q
>    --  PARENTHESIZED LATIN SMALL LETTER R
>    --  PARENTHESIZED LATIN SMALL LETTER S
>    --  PARENTHESIZED LATIN SMALL LETTER T
>    --  PARENTHESIZED LATIN SMALL LETTER U
>    --  PARENTHESIZED LATIN SMALL LETTER V
>    --  PARENTHESIZED LATIN SMALL LETTER W
>    --  PARENTHESIZED LATIN SMALL LETTER X
>    --  PARENTHESIZED LATIN SMALL LETTER Y
>    --  PARENTHESIZED LATIN SMALL LETTER Z

****************************************************************

From: Randy Brukardt
Sent: Tuesday, July 6, 2010  6:49 PM

...
> > The upper case mapping for Unicode characters can be 2 and supposely
> > 3 characters. The obvious example is the LC_German_Sharp_S (as it is
> > named in
> > Ada.Characters.Latin_1): the upper case mapping is "SS". It is
> > certainly intended that Ada identifiers containing the
> > LC_German_Sharp_S are considered the same as those containing "SS"
> > in the same position (if I ever get around to creating ACATS tests
> > for wide characters in identifiers, that will be one the first tests).
>
> I find that absurd, and highly undesirable, this is case equivalence
> gone totally berserk. I entirely refuse to implement this on the
> grounds that it is highly undesirable to do so. And to have two
> totally different notions of upper casing junk characters in the
> language is simply horrible,

OK, but that is what the language defines.

> I think the case mapping of 10646 with SMALL<-->CAPITAL is as far as
> we go.

I just went back and re-read the Ada 95 AIs that define this, and the important
result is that we want to follow the Unicode recommendations and not try to
invent our own character set rules. Thus we adopted the Unicode case folding
(see 1.1.4(14.2/2)), and as the ramification note 1.1.4(14.f/2) says, this is
applied on complete sequences, not single characters.

As I recall, Unicode documents were pretty clear that doing single character
conversions is a bad idea.

> And if you *DO* right such an ACATS test, I will consider it a last
> straw and declare the ACATS suite junk :-) :-)

Well, in this case, the ACATS would be reflecting the standard as written.
If you don't like that, you need to get the Standard changed.

> Seriously, this is a weird interpretation and needs to be discussed by
> the whole ARG.

It was discussed by the whole ARG when it was adopted. This was definitely an
intended choice. Whether everyone understood the ramifications, I can't say, but
they're clearly mentioned in the record and in the chosen wording. For instance,
we adopted a slightly different rule for reserved words than for identifiers,
and the ramification 2.3(5.c/2) discusses the fact that there are strings of
letters that are neither legal identifiers nor reserved words. I recall that it
took several iterations to settle on this intent.

The one thing that I see is missing is that I failed to document these as
"Incompatibilities with Ada 95", as there are some identifiers that would be
considered different for Ada 95 that would be considered the same for Ada 2005
(or even illegal). I don't know if that would have changed anyone's opinion, but
I doubt it.

...
> Under no conditions can we tolerate two identifiers with different
> numbers of 10646 code points being considered identical in my opinion.

Well, we decided differently back in 2005. The question is whether there is any
semantic problem with this. You later complain:

> Let's call LC_German_Sharp_S * in the below discussion  But surely SS
> is equivalent to ss, so now do we break transititivity
>   of case equivalence or is * also equivalent to ss?

The rule is that identifiers are converted to upper case, then compared for
equality. So of course "*" and "ss" and "SS" are all equivalent. Similarly,
"acce*" is equivalent to "access" (but the former is considered illegal, as
reserved words have to be written in their ascii form).

> For example suppose we have the identifier SSS, is that equivalent to
> *s and also to s*, and now are *s and s* equivalent? AARGH, if you
> have a whole row of S's, there are a combinatorial number of possible
> equivalent identifiers

Yes, there are a lot of possible equivalent identifiers. So what? I don't see
any semantic problem with that; the only thing that matters as far as the
language is concerned is that we can tell whether two identifiers are
equivalent.

...
> To me the only thing that makes sense for case equivalence is to
> regard the identifier or other string as a series of code points in 10646.

Unicode strongly suggested that this sort of case equivalence is a very bad
idea. (I haven't gone back to check if that is still true, we'll need to do that
before we argue this for the ARG.)

In any case, I don't care all that much what we decide. Personally, I think
allowing characters outside of Latin-1 in Ada source is a mistake. But that
wasn't a choice that we were allowed to make. The second best choice seemed to
be to follow the recommendations of the character set experts, and what we
decided is what they recommended (in 2005). I'm very leery of claiming that Ada
experts are smarter than character set experts in this particular area.

****************************************************************

From: Robert Dewar
Sent: Tuesday, July 6, 2010  10:35 PM

> OK, but that is what the language defines.

No it doesn't that would violate Robert's rule of reasonableness

>> I think the case mapping of 10646 with SMALL<-->CAPITAL is as far as
>> we go.
>
> I just went back and re-read the Ada 95 AIs that define this, and the
> important result is that we want to follow the Unicode recommendations
> and not try to invent our own character set rules. Thus we adopted the
> Unicode case folding (see 1.1.4(14.2/2)), and as the ramification note
> 1.1.4(14.f/2) says, this is applied on complete sequences, not single characters.

Absurd in this case

> As I recall, Unicode documents were pretty clear that doing single
> character conversions is a bad idea.

Doing case equivalence beyond basic latin 1 characters is a bad

>> And if you *DO* right such an ACATS test, I will consider it a last
>> straw and declare the ACATS suite junk :-) :-)
>
> Well, in this case, the ACATS would be reflecting the standard as written.
> If you don't like that, you need to get the Standard changed.

The standard often has mistakes, doesn't mean you have to try to implement them.
Any attempt to implement this ends up with nonsense.

>> Seriously, this is a weird interpretation and needs to be discussed
>> by the whole ARG.
>
> It was discussed by the whole ARG when it was adopted. This was
> definitely an intended choice. Whether everyone understood the
> ramifications, I can't say, but they're clearly mentioned in the record and in the chosen wording.
> For instance, we adopted a slightly different rule for reserved words
> than for identifiers, and the ramification 2.3(5.c/2) discusses the
> fact that there are strings of letters that are neither legal
> identifiers nor reserved words. I recall that it took several iterations to settle on this intent.

Well it is absurd if case equivalence is not reversible and transitive. It
results in all kinds of anomolies, and if it is transitive, we get very peculiar
things.

For example, it seems entirely wrong that the identifier j*y be considered case
equivalent to jssy. Your "character code experts" would be quite suprised at
this suggestion.

> The one thing that I see is missing is that I failed to document these
> as "Incompatibilities with Ada 95", as there are some identifiers that
> would be considered different for Ada 95 that would be considered the
> same for Ada
> 2005 (or even illegal). I don't know if that would have changed
> anyone's opinion, but I doubt it.

I hope it would have! The idea of introducing this kind of incompatibily for
such an absurd small gain should not have been countenanced for a moment. If you
have a situation where two identifiers that are different in Ada 95 are the same
in Ada 2005, you have a VERY SERIOUS incompatibility. Any kind of
incompatibiltiy needs to be justified on the grounds that it provides an
important functionality (e.g. for some). That justification is totally lacking
in this case.

> ...
>> Under no conditions can we tolerate two identifiers with different
>> numbers of 10646 code points being considered identical in my
>> opinion.
>
> Well, we decided differently back in 2005. The question is whether
> there is any semantic problem with this. You later complain:
>
>> Let's call LC_German_Sharp_S * in the below discussion  But surely SS
>> is equivalent to ss, so now do we break transititivity
>>   of case equivalence or is * also equivalent to ss?
>
> The rule is that identifiers are converted to upper case, then
> compared for equality. So of course "*" and "ss" and "SS" are all
> equivalent. Similarly, "acce*" is equivalent to "access" (but the
> former is considered illegal, as reserved words have to be written in their ascii form).
>
>> For example suppose we have the identifier SSS, is that equivalent to
>> *s and also to s*, and now are *s and s* equivalent? AARGH, if you
>> have a whole row of S's, there are a combinatorial number of possible
>> equivalent identifiers
>
> Yes, there are a lot of possible equivalent identifiers. So what? I
> don't see any semantic problem with that; the only thing that matters
> as far as the language is concerned is that we can tell whether two
> identifiers are equivalent.

I regard this as complete nonsense. I am surprised people intended this kind of
nonsense. If the ARG insists on this interpretation, I VERY much doubt that
anyone would ever implement it.

>> To me the only thing that makes sense for case equivalence is to
>> regard the identifier or other string as a series of code points in 10646.
>
> Unicode strongly suggested that this sort of case equivalence is a
> very bad idea. (I haven't gone back to check if that is still true,
> we'll need to do that before we argue this for the ARG.)
>
> In any case, I don't care all that much what we decide. Personally, I
> think allowing characters outside of Latin-1 in Ada source is a
> mistake. But that wasn't a choice that we were allowed to make. The
> second best choice seemed to be to follow the recommendations of the
> character set experts, and what we decided is what they recommended
> (in 2005). I'm very leery of claiming that Ada experts are smarter
> than character set experts in this particular area.

Nope, what should have been done is to allow arbitrary letters in identifiers
but NOT extend case equivalence beyond Latin-1, THAT was the mistake, and you
cannot blame this on Unicode, after all most reasonable programming languages
don't have case equivalence of identifiers, so it is not on the Unicode radar
screen wrt programming languages. Your so-called "character set experts" are
talking about the general issue of case conversion, not the specific issue of
case equivalence in identifiers.

For one thing, there is no problem in converting the small parenthesized letters
to upper case for the general usage, e.g. small (c) becomes three characters
(C). But you can't go that way for identifiers. There are other such examples.

So you don't really mean to adopt all of the Unicode multi-character scheme,
because it won't work for identifiers. Given that you need a special set of
rules, different from the general Unicode rules, I think the rule I propose is
simple and good enough. Once again, the rule I propose is

   If two code points in 10646 have names that are the same except
   for CAPITAL LETTER <---> SMALL LETTER, then case conversion
   converts between them. Otherwise case conversion has no effect.

   This handles all the common reasonable cases, as well as some
   reasonable cases in other languages.

Another point is that in general case equivalence is locale dependent, consider
the situation with the letter I.

We have the following 10646 code points:

> 0049;LATIN CAPITAL LETTER I;Lu
> 0069;LATIN SMALL LETTER I;Ll
> 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu 0131;LATIN SMALL LETTER
> DOTLESS I;Ll

Now in locales which don't use the last two entries, like english, the case mapping is 49<-->69, but in locales that do use the last two entries (I think if I remember Turkish is a case in question), the case mappings are
49<-->131 and 69<-->130.

That's fine for general use, but obviously equivalence of Ada identifiers can't
be locale dependent.

My rule looks at these four entries, and decides that the only mapping that
makes sense is 49<-->69 which is what you want in this case.

If people want to write identifiers with code points 130 and 131, they can, but
they should not expect case equivalence.

Note that it is not at all terrible to forbid case equivalence, since good
practice is to always write identifiers with the same casing anyway.

I would be interested in what Randy has to say (or what he thinks the standard
says) about these four code points???

Note by the way that Ada compilers (GNAT in particular) have allowed wide
characters in identifiers for ever, and that's because despite Randy's feeling,
lots of people in non-english countries find this very useful. We did not
however implement case equivalence for such identifiers until forced to do so,
and it was probably a mistake to do so.

For me, it is fundamental that To_Lower and To_Upper should be invertible, i.e.
if

  To_Lower (C) /= C

then

  To_Upper (To_Lower (C)) = C

I personally would not have introduced the To_Lower function, or extended it to
apply to strings, but that's apparently done now, and I think my intepretation
is the only one that makes sense.

Note incidentally that the existing packages as proposed by the ARG do case
conversion on a single character to single character basis, and so for that
model, my rule is the kind of rule you need. The To_Upper function in
Ada.Wide_Characters.Handling simply does not allow for an input of one character
and an output of several characters.

Anyway, I don't particularly care what is decided, I don't think it will make
any difference at this stage.

P.S. I am surprised that no one resurrected Jean's insistence that
lower-case-e-acute be considered equivalent to upper-case-e-no-accent :-)

BTW, the original implementation of To_Upper in GNAT refers to note 1 in the ISO
10646 standard, I don't have a copy of the standard handy, does someone know
what note 1 says?

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 7, 2010  1:04 PM

(For the record, I had nothing significant to do with these choices; this area
was primarily Pascal's and Kiyoshi's. I'm just trying to explain what they
decided and what the entire ARG approved. Please do not attribute any of these
ideas to me!)

...
> >> And if you *DO* right such an ACATS test, I will consider it a last
> >> straw and declare the ACATS suite junk :-) :-)
> >
> > Well, in this case, the ACATS would be reflecting the standard as written.
> > If you don't like that, you need to get the Standard changed.
>
> The standard often has mistakes, doesn't mean you have to try to
> implement them. Any attempt to implement this ends up with nonsense.

So, since "coextensions" are clearly a mistake, no one has to implement them!
Yaa!! :-)

In this case, this isn't a mistake; it was a carefully considered decision. You
just don't like the decision (and based on the old mail, you didn't like in 2002
and 2005, either).

> >> Seriously, this is a weird interpretation and needs to be discussed
> >> by the whole ARG.
> >
> > It was discussed by the whole ARG when it was adopted. This was
> > definitely an intended choice. Whether everyone understood the
> > ramifications, I can't say, but they're clearly mentioned in the
> > record and in the chosen wording.
> > For instance, we adopted a slightly different rule for reserved
> > words than for identifiers, and the ramification 2.3(5.c/2)
> > discusses the fact that there are strings of letters that are
> > neither legal identifiers nor reserved words. I recall that it took
> > several iterations to settle on this intent.
>
> Well it is absurd if case equivalence is not reversible and transitive.
> It results in all kinds of anomolies, and if it is transitive, we get
> very peculiar things.

I don't understand what you mean by "reversible" here. Even for Ada 83, there
are 2**N equivalent identifiers (where N is the length of the identifier). If
you have the all upper-case version of that identifier, there is no way to tell
which of the 2**N versions it came from.

> For example, it seems entirely wrong that the identifier j*y be
> considered case equivalent to jssy. Your "character code experts"
> would be quite suprised at this suggestion.

Possibly. I'm not going to guess.

> > The one thing that I see is missing is that I failed to document
> > these as "Incompatibilities with Ada 95", as there are some
> > identifiers that would be considered different for Ada 95 that would
> > be considered the same for Ada 2005 (or even illegal). I don't know
> > if that would have changed
> > anyone's opinion, but I doubt it.
>
> I hope it would have! The idea of introducing this kind of
> incompatibily for such an absurd small gain should not have been
> countenanced for a moment. If you have a situation where two
> identifiers that are different in Ada 95 are the same in Ada 2005, you
> have a VERY SERIOUS incompatibility. Any kind of incompatibiltiy needs
> to be justified on the grounds that it provides an important
> functionality (e.g. for some).
> That justification is totally lacking in this case.

I'm unconvinced that the incompatibility is that serious, in that you would get
a compile-time error from the problem in the very rare case when it occurs. If
you can find a case where that is *not* true, that would have been an important
data point in choosing a different rule (I would probably have pushed for
something compatible).

My biggest concern here is readability. But I've come to the conclusion that is
a red herring. That's because code that uses identifiers having characters
outside of the base 128 characters are never going to be readable to some subset
of programmers. The use of identifiers in a local language with "funny"
characters is probably only readable to speakers of that language. So "portable"
code will avoid such characters, and for the rest it is only important that it
is well-defined.

...
> > Yes, there are a lot of possible equivalent identifiers. So what? I
> > don't see any semantic problem with that; the only thing that
> > matters as far as the language is concerned is that we can tell
> > whether two identifiers are equivalent.
>
> I regard this as complete nonsense. I am surprised people intended
> this kind of nonsense. If the ARG insists on this interpretation, I
> VERY much doubt that anyone would ever implement it.

My understanding was that Pascal did a complete implementation for the IBM
Rational compiler. And given that he knew the intent, I would guess that he did
implement all of this (it's actually quite simple, anyway).

...
> Nope, what should have been done is to allow arbitrary letters in
> identifiers but NOT extend case equivalence beyond Latin-1, THAT was
> the mistake, and you cannot blame this on Unicode, after all most
> reasonable programming languages don't have case equivalence of
> identifiers, so it is not on the Unicode radar screen wrt programming
> languages. Your so-called "character set experts"
> are talking about the general issue of case conversion, not the
> specific issue of case equivalence in identifiers.

We followed the Unicode recommendations for programming language identifiers
quite closely. Pascal explained the differences in detail in AI95-00285.

Now, I should point out that those recommendations changed quite substantially
between Unicode 4.0 and 5.0, and we adopted Binding Interpretation AI05-0091-1
to reconcile these differences. But I don't recall any change in the case
equivalence rules (I'll have to go look up the current state of those
recommendations when I write this up as an AI).

> For one thing, there is no problem in converting the small
> parenthesized letters to upper case for the general usage, e.g. small
> (c) becomes three characters (C). But you can't go that way for
> identifiers. There are other such examples.

Huh? That is exactly the sort of thing that is intended here (for identifiers).
Not sure why you say "you can't go that way for identifiers". In any case, the
(c) isn't a letter, so it is irrelevant in the identifier context.

> So you don't really mean to adopt all of the Unicode multi-character
> scheme, because it won't work for identifiers.

Why not?

> Given that you need a special set of rules, different from the general
> Unicode rules, I think the rule I propose is simple and good enough.
> Once again, the rule I propose is
>
>    If two code points in 10646 have names that are the same except
>    for CAPITAL LETTER <---> SMALL LETTER, then case conversion
>    converts between them. Otherwise case conversion has no effect.
>
>    This handles all the common reasonable cases, as well as some
>    reasonable cases in other languages.
>
> Another point is that in general case equivalence is locale dependent,
> consider the situation with the letter I.

Unicode defines a locale-independent case folding algorithm. That is what the
Ada Standard requires using.

> We have the following 10646 code points:
>
> > 0049;LATIN CAPITAL LETTER I;Lu
> > 0069;LATIN SMALL LETTER I;Ll
> > 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu 0131;LATIN SMALL
> > LETTER DOTLESS I;Ll
>
> Now in locales which don't use the last two entries, like english, the
> case mapping is 49<-->69, but in locales that do use the last two
> entries (I think if I remember Turkish is a case in question), the
> case mappings are
> 49<-->131 and 69<-->130.
>
> That's fine for general use, but obviously equivalence of Ada
> identifiers can't be locale dependent.

There is an extensive discussion of this issue in the AI and in the AARM
(2.3(6.a-j)). The example of Turkish is given.

The URL of the intended case foldings is provided in the AARM (2.1(14.g)):

http://www.unicode.org/Public/4.0-Update/CaseFolding-4.0.0.txt

Note that 1.1.4(14.2/2) says that locale-independent full case folding is used,
meaning that the string lengths can change.

Presumably this mapping has changed in later versions of Unicode, and one
unanswered question is whether we intend to track these changes (which is likely
to be incompatible) or just freeze it at Unicode 4.0.

Looking at this table, I see that it is a mapping to *lower* case, so there is
in fact a very real problem with the definition of the language (the definition
of converting to upper case actually goes to lower case, which makes no sense).

The intent that we used in examples was that both 69 and 131 map to 49.
That's why there is a rule that non-standard reserved word spellings are illegal
(the string coded 131 65 maps to "IF", but we don't want to be able to spell
reserved words that way - so it is declared illegal). But this doesn't follow
from the defined mapping.

> My rule looks at these four entries, and decides that the only mapping
> that makes sense is 49<-->69 which is what you want in this case.
>
> If people want to write identifiers with code points 130 and 131, they
> can, but they should not expect case equivalence.
>
> Note that it is not at all terrible to forbid case equivalence, since
> good practice is to always write identifiers with the same casing
> anyway.
>
> I would be interested in what Randy has to say (or what he thinks the
> standard says) about these four code points???

See above. The AI discussion that inspired the AARM note gives the intent.
Although it is clear that the intent isn't described properly in the standard.
And I'm not going to make any comment on the desirability of that intent!

> Note by the way that Ada compilers (GNAT in particular) have allowed
> wide characters in identifiers for ever, and that's because despite
> Randy's feeling, lots of people in non-english countries find this
> very useful. We did not however implement case equivalence for such
> identifiers until forced to do so, and it was probably a mistake to do
> so.
>
> For me, it is fundamental that To_Lower and To_Upper should be
> invertible, i.e. if
>
>   To_Lower (C) /= C
>
> then
>
>   To_Upper (To_Lower (C)) = C
>
> I personally would not have introduced the To_Lower function, or
> extended it to apply to strings, but that's apparently done now, and I
> think my intepretation is the only one that makes sense.

My understanding of Unicode suggests that there shouldn't be a To_Lower for
Wide_String handling, so the issue doesn't really come up. For Strings, the
rules are as they are for Ada 95 (else programs would change behavior when
moving to newer compilers, which would be very undesirable).

[Unfortunately, that understanding doesn't seem to be reflected by the case
folding tables, which appear to do something else altogether. Pascal, et. al.
didn't read carefully enough...]

> Note incidentally that the existing packages as proposed by the ARG do
> case conversion on a single character to single character basis, and
> so for that model, my rule is the kind of rule you need. The To_Upper
> function in Ada.Wide_Characters.Handling simply does not allow for an
> input of one character and an output of several characters.

Right, that definition is clearly broken based on the current language
definition. That was my point that started this discussion. For the intended
current language definition, I think there should only be a wide_string version
of To_Upper, and the other three functions are junk.

But it is OK with me to change the language definition: we just need to be
extremely clear about what we are doing and why. And as always, we need to have
good technical reasons for a change, as it is likely to be incompatible with
(partial) Ada 2005 implementations other than GNAT.

> Anyway, I don't particularly care what is decided, I don't think it
> will make any difference at this stage.

I don't care what is decided, either, but I do think it is important that there
are ACATS tests reflecting what is decided. The entire point of the ACATS is to
encourage uniformity between Ada implementations, and this seems like an
important area where uniformity is needed and may not occur naturally.

> P.S. I am surprised that no one resurrected Jean's insistence that
> lower-case-e-acute be considered equivalent to upper-case-e-no-accent
> :-)

We decided to do whatever Unicode recommended, specifically to avoid such
discussions. There is no value to arguing about particular characters -- it's
not our area of expertise.

> BTW, the original implementation of To_Upper in GNAT refers to note 1
> in the ISO 10646 standard, I don't have a copy of the standard handy,
> does someone know what note 1 says?

Note 1 is our way of saying that you get case folding from the Unicode
definition (it isn't in 10646). We were afraid to mention Unicode directly,
because there was some history of standards being rejected at high levels for
doing that. But since 10646 doesn't define case folding, we have to get it from
somewhere (we surely were not going to define our own), and Unicode seemed like
the appropriate place.

****************************************************************

From: Robert Dewar
Sent: Wednesday, July 7, 2010  1:32 PM

> (For the record, I had nothing significant to do with these choices;
> this area was primarily Pascal's and Kiyoshi's. I'm just trying to
> explain what they decided and what the entire ARG approved. Please do
> not attribute any of these ideas to me!)

Fair enough, this clearly needs rediscussing! I am unconvinced the ARG really
understood the issues, or really understood that they were introducing upwarrds
incompatible changes.

> So, since "coextensions" are clearly a mistake, no one has to
> implement them! Yaa!! :-)

I had more in mind the Ada 83 rule that made

    subtype X is integer range 1 .. 10;

be a non-static subtype

> In this case, this isn't a mistake; it was a carefully considered decision.
> You just don't like the decision (and based on the old mail, you
> didn't like in 2002 and 2005, either).

I am unconvinced this was carefully considered. In particular, if no one
understood that it was introducing non-upwards comaptibility, then the
discussion was seriously flawed. I am also quite unconvinced that people
understood the other ramifications. In my experience the ARG does not care to
think deeply about character issues (your comments about deferring to outside
experts are telling in this regard!)

To me the failure to explicitly worry about the compatibility issiue is a fatal
flaw in the discussions. There are plenty of people on the ARG who couldn't care
less about wide character issues, but who are VERY concerned about introducing
gratuitous incompatibilities.

> I don't understand what you mean by "reversible" here. Even for Ada
> 83, there are 2**N equivalent identifiers (where N is the length of
> the identifier). If you have the all upper-case version of that
> identifier, there is no way to tell which of the 2**N versions it came from.
>
>> For example, it seems entirely wrong that the identifier j*y be
>> considered case equivalent to jssy. Your "character code experts"
>> would be quite suprised at this suggestion.
>
> Possibly. I'm not going to guess.

But you need to KNOW the answer to this before you decide to agree

> I'm unconvinced that the incompatibility is that serious, in that you
> would get a compile-time error from the problem in the very rare case
> when it occurs. If you can find a case where that is *not* true, that
> would have been an important data point in choosing a different rule
> (I would probably have pushed for something compatible).

Language designers always seem to operate in a mode of "well it's easy to fix
the sources", they often seem totally unaware of the impact of incompatible
changes. For example, the change on return of limited types has VERY severely
impeded the uptake of Ada 2005.

> My biggest concern here is readability. But I've come to the
> conclusion that is a red herring. That's because code that uses
> identifiers having characters outside of the base 128 characters are
> never going to be readable to some subset of programmers. The use of
> identifiers in a local language with "funny" characters is probably
> only readable to speakers of that language. So "portable" code will
> avoid such characters, and for the rest it is only important that it is well-defined.

Actually I have a different view. I think that general good practice is to
always spell identifiers the same throughout a program, so turkish programmers
are just fine, they can use any of the four i's in identifiers, and just spell
consistently throughout. Yes, the compiler won't detect some undesirable cases
of identifiers that should not be allowed to coexist, but auxiliary tools can
take care of this.

For instance, there are probably French programmers who think it is a bad idea
to have two identifiers that differ only by an acute accent over an E, but the
language won't help them, the same kind of tools can help them.

>> I regard this as complete nonsense. I am surprised people intended
>> this kind of nonsense. If the ARG insists on this interpretation, I
>> VERY much doubt that anyone would ever implement it.
>
> My understanding was that Pascal did a complete implementation for the
> IBM Rational compiler. And given that he knew the intent, I would
> guess that he did implement all of this (it's actually quite simple, anyway).

> We followed the Unicode recommendations for programming language
> identifiers quite closely. Pascal explained the differences in detail in AI95-00285.
>
> Now, I should point out that those recommendations changed quite
> substantially between Unicode 4.0 and 5.0, and we adopted Binding
> Interpretation AI05-0091-1 to reconcile these differences. But I don't
> recall any change in the case equivalence rules (I'll have to go look
> up the current state of those recommendations when I write this up as an AI).
>
>> For one thing, there is no problem in converting the small
>> parenthesized letters to upper case for the general usage, e.g. small
>> (c) becomes three characters (C). But you can't go that way for
>> identifiers. There are other such examples.
>
> Huh? That is exactly the sort of thing that is intended here (for
> identifiers). Not sure why you say "you can't go that way for identifiers".
> In any case, the (c) isn't a letter, so it is irrelevant in the
> identifier context.

Sorry, you don't know what you are talking about, of COURSE (c) is a letter, why
would I have mentioned it otherwise? :-) Here is the 10646 entry for it:

249E;PARENTHESIZED LATIN SMALL LETTER C;So;0;L

There are 25 more entries like that, probably you got confused with the
copyright symbol:

00A9;COPYRIGHT SIGN;So

That's something completely different, and is not a letter (it does not have
parens, it has a circle around the C).

>> So you don't really mean to adopt all of the Unicode multi-character
>> scheme, because it won't work for identifiers.
>
> Why not?

see above

By the way, I formally object to Ada depending in anyway directly on Unicode,
this is quite improper.

****************************************************************

From: Robert Dewar
Sent: Wednesday, July 7, 2010  1:33 PM

> (For the record, I had nothing significant to do with these choices;
> this area was primarily Pascal's and Kiyoshi's. I'm just trying to
> explain what they decided and what the entire ARG approved. Please do
> not attribute any of these ideas to me!)

Well you are arguing for the position, so I really think it is fair to attribute
at least agreement to you. If you disagree, please make this clear and explain
why.

****************************************************************

From: Robert Dewar
Sent: Wednesday, July 7, 2010  1:37 PM

By the way, with regard to ACATS tests, you can't have it both ways.

Either you regard it as reasonable and common to have the * character in
identifiers, in which case the upwards incompatibility inadverently introduced
by the Ada 2005 change is serious.

Or you think it is obscure usage, obscure enough not to worry about the
incompatibility. In which case tests for obscure features do not belong in the
ACATS tests.

Anyway, I don't see any substantial resources being available for development of
new ACATS tests in any case, and I don't think that's a terrible thing, since
their utility would be minimal.

****************************************************************

From: Bob Duff
Sent: Wednesday, July 7, 2010  1:44 PM

> By the way, I formally object to Ada depending in anyway directly on
> Unicode, this is quite improper.

Why is it improper?

(Note that I am one of those you referred to with: "There are plenty of people
on the ARG who couldn't care less about wide character issues, but who are VERY
concerned about introducing gratuitous incompatibilities."  Well, I guess I care
enough to be curious why it's improper to depend on Unicode.  ;-))

****************************************************************

From: Robert Dewar
Sent: Wednesday, July 7, 2010  2:20 PM

the last I knew Unicode was not an ISO standard, but perhaps that has
changed????

****************************************************************

From: Robert Dewar
Sent: Wednesday, July 7, 2010  2:22 PM

> (Note that I am one of those you referred to with: "There are plenty
> of people on the ARG who couldn't care less about wide character
> issues, but who are VERY concerned about introducing gratuitous incompatibilities."

So that people understand, here is one example of an incompatibility, there may
be others

Again * is the german beta standing for two lower case s's

In Ada 95, the following program is legal:

     package X is
        Y*  : Integer;
        Yss : Integer;
     end X;

In Ada 2005 this program becomes illegal because both ientifiers map to upper
case YSS.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 7, 2010  2:32 PM

> > (For the record, I had nothing significant to do with these choices;
> > this area was primarily Pascal's and Kiyoshi's. I'm just trying to
> > explain what they decided and what the entire ARG approved. Please do
> > not attribute any of these ideas to me!)
>
> Well you are arguing for the position, so I really think it is fair to
> attribute at least agreement to you. If you disagree, please make this
> clear and explain why.

I tend to be a strict constructionist, and tend to argue in support whatever the
Standard says unless there is a clear technical reason that it is incorrect.
(Tucker knows this well vis-a-vis pragma Pack.) In this case, there was a clear
intent that I described in the AARM notes written at the time (and reviewed by
Pascal and some others as well), and I am just explaining what that intent is. I
personally made no attempt to determine whether or not that intent is a good
thing, because honestly, I don't care what happens with program text outside of
the first 128 characters beyond the rules being well-defined.

In this case, of course, I've discovered that there is something bizarre about
the rules as written (the defined upper case conversion actually goes to lower
case), and that alone provides a reason to reopen the discussion.

There is also the undocumented incompatibility, although I think I just missed
that when writing the AARM. Clearly, there is an incompatibility described in
the AARM note of 2.3(5.c/2), and that example was discussed in the ARG. But I
agree that there isn't any clear discussion of that in either of the AIs
(AI95-00285 and AI95-00395), so it's not clear that the ARG properly considered
the other incompatibilities caused by case-collisions changing.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 7, 2010  2:43 PM

> Again * is the german beta standing for two lower case s's
>
> In Ada 95, the following program is legal:
>
>      package X is
>         Y*  : Integer;
>         Yss : Integer;
>      end X;
>
> In Ada 2005 this program becomes illegal because both ientifiers map
> to upper case YSS.

Right. Similarly,

    Acce* : Integer;

(* means as above) is illegal in Ada 2005, while it is OK in Ada 95. I'm pretty
certain this incompatibility was known, because it was discussed at a meeting
and is mentioned in the AARM (although not in the incompatibilities section).

My concern with this today is whether there is some case where this change could
cause a Beaujolais-type effect. (If there are, then the incompatibility is much
more dangerous than the one discussed and approved.) I don't think so because
use-clause cancelation would make any problem cases that I can think of illegal.
Humm, maybe local objects could do something nasty:

    procedure Y is
      Y* : Integer := 0;
    begin
       Y* := Some_Function (...);
       declare
           YSS : Integer := Some_Other_Function (...);
       begin
           P (Y*); -- Ouch!!
       end;
    end Y;

In Ada 2005, YSS would hide Y*, while in Ada 95, both would be visible. So a
different object would be used in the call to P (Ada 95 would use Y*, Ada 2005
would use YSS), without any compile-time error. Of course, this would be
execeedingly unlikely (you need similar objects in nested scopes, they have to
be of the same type, etc.), but it does seem scarier than the pure compile-time
incompatibility in rare cases.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 7, 2010  2:47 PM

> the last I knew Unicode was not an ISO standard, but perhaps that has
> changed????

It's not. 10646 is the ISO version of Unicode, but the 2003 version was heavily
simplified and had no case conversion information. (At least that is what we
were told, I personally have never looked at 10646.) Thus we had to reference
something else to get that information (which is critical to Ada - it is a case
insensitive language).

****************************************************************

From: Tucker Taft
Sent: Wednesday, July 7, 2010  2:51 PM

In 1991 they effectively unified Unicode and ISO 10646:

    http://unicode.org/faq/unicode_iso.html

and apparently ISO 10646:2003 references Unicode with regard to case conversion
(at least that is what is implied by AARM 2.1(14.f/2)).

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 7, 2010  3:05 PM

> conversion (at least that is what is implied by AARM 2.1(14.f/2)).

My understanding is that the referenced Note 1 of 10646 says essentially that if
you want case conversion information, go see Unicode. Which is what we did.

****************************************************************

From: Robert Dewar
Sent: Wednesday, July 7, 2010  3:04 PM

> In 1991 they effectively unified Unicode and ISO 10646:
>
>     http://unicode.org/faq/unicode_iso.html
>
> and apparently ISO 10646:2003 references Unicode with regard to case
> conversion (at least that is what is implied by AARM 2.1(14.f/2)).

Someone needs to verify this, I can't find my copy of the 10646 standard (only
the tables that I extracted from it :-))

****************************************************************

From: Robert Dewar
Sent: Wednesday, July 7, 2010  2:56 PM
> In Ada 2005, YSS would hide Y*, while in Ada 95, both would be
> visible. So a different object would be used in the call to P (Ada 95
> would use Y*, Ada
> 2005 would use YSS), without any compile-time error. Of course, this
> would be execeedingly unlikely (you need similar objects in nested
> scopes, they have to be of the same type, etc.), but it does seem
> scarier than the pure compile-time incompatibility in rare cases.

Indeed a nasty upwards incompatibility.

The trouble is that if you end up having to say

"we are pretty much compatible, but there are cases where the meaning of a
program changes, and the behavior is different in Ada 95 and Ada 2005, but don't
worry, it is very unlikely that this will happen in practice."

Then we worry people who are maintaining millions of lines of legacy code. How
do they know whether they will hit these rare cases? Answer they don't.

So even if the cases are obscure, it is better to be able to say absolutely:

"There are only a very small number of cases of upward incompatibility in going
from Ada 95 to Ada 2005 (for example, the introduction of the new keyword
INTERFACE). But in every case, if you hit one of those cases, the compiler will
signal the incompatibility as an illegality. So there is no possibility of
silent change of behavior due to one of these incomaptible changes."

Do we have other cases in the nasty category of silent changes in behavior? Or
is this the only one?

****************************************************************

From: Robert Dewar
Sent: Wednesday, July 7, 2010  3:03 PM

> It's not. 10646 is the ISO version of Unicode

That's not really quite right, 10646 was developed independently, coordinated
with Unicode, but it is not right to call 10646 the ISO version of Unicode.

> but the 2003 version was
> heavily simplified and had no case conversion information. (At least
> that is what we were told, I personally have never looked at 10646.)
> Thus we had to reference something else to get that information (which
> is critical to Ada - it is a case insensitive language).

We had no business referencing the unicode standard. Instead we should have
spelled out the exact rules in the Ada RM in detail without appealing to
improper outside authority.

If this results in a chaotic pile of junk, all the more reason to change to the
simple rule I propose. Let me state it again:

Two identifiers are considered case equivalent if and only if

a) they are the same length

b) for each character in one identifier, and the corresponding character
    in the other identifier, one of the following is true:

    o They are the same character
    o The two characters have names defined in 10646 which differ only
      in the replacement of CAPITAL LETTER by SMALL LETTER or vice versa

(No outside reference required).

I really think this is the right rule:

a) it introduces no upwards incompatibilities with Ada 95
b) it captures all the common Latin-1 cases we are used to
c) it does a reasonable job on other alphabets
d) it captures the intent of 10646 in assigning these names

BTW, in the 10 years between Ada 95 and Ada 2005, does anyone remember some
German who was distressed that E*en and ESSEN were not considered case
equivalent? :-)

****************************************************************

From: Bob Duff
Sent: Wednesday, July 7, 2010  3:23 PM

> Do we have other cases in the nasty category of silent changes in
> behavior? Or is this the only one?

I think there are about a dozen.  You can find them all by searching for
"Inconsistencies With Ada 95" in the AARM.  Well, all the ones except the ones
that we forgot to document.  ;-)

Here's one:

    use Ada.Calendar;
    ...
    Put_Line (Year_Number'Image(Year_Number'Last));

prints a different answer in Ada 95 versus Ada 2005.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 7, 2010  3:45 PM

> I think there are about a dozen.  You can find them all by searching
> for "Inconsistencies With Ada 95" in the AARM.
> Well, all the ones except the ones that we forgot to document.  ;-)

Better than searching, use the index in the AARM. That was started by the Ada 95
standard (probably by you, didn't you do the initial index?), and it is so
useful that I've continued it to the present day.

****************************************************************

From: Bob Duff
Sent: Wednesday, July 7, 2010  4:07 PM

Yes, I'm primarily responsible for the Ada 95 Index; I'm kind of proud of that.
I put in everything I could think of that was even remotely useful.  For
example, terms from languages like C, pointing to the Ada equivalent (e.g.
"cast" --> "unchecked conversion").  And a couple of silly jokes, although I
think I made sure those appear only in the AARM. ;-)

Thanks for continuing to maintain the Index!  One of my pet peeves is technical
documents with a bad (or no) index.  My other pet peeve is books that have
separate indices for different sorts of things (index of predefined functions,
index of built-in operators, etc -- if I'm using the index, how am I supposed to
know whether so-and-so is a function or a gizmo or a whatnot?!)

The original Ada 95 index was badly alphabetized, because of a bug in Scribe
that I didn't notice.  You fixed that -- that's when I noticed!

I'm also kind of proud of the fact that the index points to the exact paragraph
number, rather than (as in most books) the page number. And the fact that the
"see blah" entries don't make you go look up "blah".

But I still search the AARM more often than I use the index.

****************************************************************

From: Robert Dewar
Sent: Wednesday, July 7, 2010  4:13 PM

> Yes, I'm primarily responsible for the Ada 95 Index; I'm kind of proud
> of that.  I put in everything I could think of that was even remotely
> useful.  For example, terms from languages like C, pointing to the Ada
> equivalent (e.g.  "cast" --> "unchecked conversion").

For the record "cast" is an Algol-68 term meaning something closer to type
conversion than unchecked conversion :-)

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 7, 2010  4:22 PM

> I'm also kind of proud of the fact that the index points to the exact
> paragraph number, rather than (as in most books) the page number.

For the HTML version, I spent a lot of effort so that clicking on the paragraph
number (or just the section in the ISO version) actually takes you directly to
the reference -- even if it is in the middle of a large paragraph. That's what I
used to look at all of them quickly in response to Robert's question. The links
for the index, the cross-references, and for the syntax terms is the big
advantage of the HTML version compared to the other versions (and the primary
reason I use it mostly -- but I still refer to the paper one when I know what I
want to look up, as it is faster to pick up and open to the right place than any
on-line version, especially as my well-worn version automatically opens up to
important sections that are commonly referenced [a feature of binding wear, I
think. I've noticed the same effect with paper Rand-McNally road atlases; during
a trip through some state, the map tends to open easily to the right page for
that state after a dozen or so references.]).

****************************************************************

From: Bob Duff
Sent: Wednesday, July 7, 2010  4:38 PM

> For the record "cast" is an Algol-68 term meaning something closer to
> type conversion than unchecked conversion :-)

Interesting.

In C, it sometimes sort of means "unchecked conversion" and sometimes sort of
"type conversion", so my index entry is slightly misleading. Oh, well.

****************************************************************

From: Robert Dewar
Sent: Wednesday, July 7, 2010  4:45 PM

> In C, it sometimes sort of means "unchecked conversion" and sometimes
> sort of "type conversion", so my index entry is slightly misleading.

The Algol-68 image is melt down the value, and then pour it into a cast
corresponding to the result MODE (type).

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 7, 2010  6:06 PM

...
> So even if the cases are obscure, it is better to be able to say
> absolutely:
>
> "There are only a very small number of cases of upward incompatibility
> in going from Ada 95 to Ada 2005 (for example, the introduction of the
> new keyword INTERFACE). But in every case, if you hit one of those
> cases, the compiler will signal the incompatibility as an illegality.
> So there is no possibility of silent change of behavior due to one of
> these incomaptible changes."
>
> Do we have other cases in the nasty category of silent changes in
> behavior? Or is this the only one?

I'm sure we can't say the above. All of the runtime inconsistencies are
documented as "Inconsistencies with Ada 95". (Well, except this one, which I
wasn't aware of.) They're all listed in the Index of the AARM (there are 15
listed there, the first one is the definition of the header). A lot of them are
really very obscure or even fixing clear bugs. For instance, 3.3.1(34.f/2) says
that Unconstrained aliased objects are no longer constrained by their initial
value, so an Ada 95 program that raises Constraint_Error might not do that in
Ada 2005.

Feel free to look at them and decide for yourself if any are unacceptable.
(That is why I've done the work of documenting these things, after all.)

Note that there currently are 11 of these listed for Ada 2012. They are all bug
fixes except for the composition of untagged record "=" change, which we all
agreed was preferrable to trying to figure out some way for the old semantics to
hold. It almost always would fix a bug rather than introduce one anyway.

****************************************************************

From: Randy Brukardt
Sent: Sunday, January 23, 2011 12:30 AM

Attached is a new version of AI05-0227-1, the case equivalence AI.

This generally reflects the discussion of the ARG, and includes full wording,
but there are two major differences.

First, I used "locale-independent simple case folding", rather than
the full case folding that we had previously discussed. This happened
because I noticed the following line in the case folding description
when I was looking at the dotless I issues:

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S

This says that for "full case folding", Sharp S is the same as "ss".

I obviously missed that when I looked at this table in the past. That's a
non-starter for the compatibility reasons previously noted. I revised the
discussion based on that.

Second, I did not add any rule about declarations in the same scope having
distinct upper-case mappings. There are many reasons for this.
  * The simple case folding has many fewer equivalences, so the problem
    is much less likely.
  * There is nothing in the standard that requires the default External_Tag
    values to be in a particular case, so there is no issue with those
    (so long as a canonical representation is used).
  * Wide_Wide_Exception_Name does have such a requirement, but given that
    the number of exceptions is typically low, and this is only debugging
    information (there is no routine that works like 'Value), it is hard
    to be concerned about what is obviously a pathology.
  * This check is likely to be expensive; either it requires storing an
    extra representation of each identifier in upper case, only used for
    this check (wasteful of space) or doing the conversion on the fly
    (wasteful of time). Either way, the number of comparisons needed for
    each declaration is proportional to the number of declarations in
    the scope, meaning that the check is quadratic in the total number
    of declarations. It is not unusual to have thousands of declarations
    in a package (especially in automatically generated code), so this
    could be a problem. While the problem still could occur for enumeration
    type declarations, there are many fewer of those.
    In any case, if we are going to do this generally, why are the identifiers
    distinct in the first place?? It would be easier to just make them
    map to the same identifier value, then no check would be needed.
  * Upper case mappings are not considered stable in Unicode. That is,
    future versions may be different. That could introduce incompatibilities
    into Ada where previously legal programs are now illegal because of
    upper case mappings. Again, this could happen for enumeration types,
    but that clearly is a much more limited incompatibility than "anywhere".
For all of these reasons, I only made a requirement for distinct upper case
mappings for enumeration literals.

Please read the AI (especially the wording and discussion) and make any
comments needed.

****************************************************************


Questions? Ask the ACAA Technical Agent