CVS difference for ai05s/ai05-0227-1.txt
--- ai05s/ai05-0227-1.txt 2010/10/22 04:40:01 1.1
+++ ai05s/ai05-0227-1.txt 2011/01/23 06:45:36 1.2
@@ -1,4 +1,4 @@
-!standard 1.1.4(14.2/2) 10-10-21 AI05-0227-1/01
+!standard 1.1.4(14.2/2) 11-01-22 AI05-0227-1/02
!class binding interpretation 10-10-21
!status work item 10-10-21
!status received 10-07-03
@@ -8,8 +8,12 @@
!subject Identifier equivalence
!summary
-** No clue **
+Identifier equivalence is based on Unicode "locale-independent simple case folding".
+The result of 'Wide_Wide_Image is based on the "simple upper case mapping" of
+the enumeration literals. An enumeration type is illegal if two literals have
+the same "simple upper case mapping".
+
!question
AARM 2.3(5.c/2) implies that Ada 2005 may be incompatible with Ada 95 in obscure
@@ -23,18 +27,94 @@
upper case" means "locale-independent full case folding". Full case folding does
not consider the character at 16#DF# (the German character that looks like a
beta) to be the same as "ss". Thus, these incompatibilies don't arise. However,
-"full case folding" is a mapping to *lower case*!! Thus the definition on
+"full case folding" is a mapping to *lower case*!! Thus the definition in
1.1.4(14.2/2) is madness. Moreover, taken literally, it requires 'Image to
produce *lower case* versions of enumeration literals, which would be completely
inconsistent with Ada 95 and before. That cannot have been intended.
-So what is the rule? (Beats me.)
+So what is the rule? (See Summary.)
!wording
+
+Replace 1.1.4(14.2/2):
+
+When this International Standard mentions the conversion of some character or sequence
+of characters to upper case, it means the character or sequence of characters obtained
+by using simple upper case mapping, as defined by documents referenced in the note
+in section 1 of ISO/IEC 10646:2003.
+
+Replace 2.3(5/3):
+
+Two identifiers are considered the same if they consist of the same sequence of characters
+after applying locale-independent simple case folding, as defined by documents referenced
+in the note in section 1 of ISO/IEC 10646:2003.
+
+Replace 2.3(5.3/3):
+
+After applying simple case folding, an identifier shall not be identical
+to a reserved word.
+
+AARM Discussion:
+
+Simple case folding is a mapping to lower case, so this is a matching to the defining
+(lower case) version of a reserved word. We could have mentioned case folding of the
+reserved words, but as that is an identity function, it would have no effect.
+
+Modify 2.3(5.b/3):
+
+...after [converting to upper case]{applying case folding} so that the rules...
+
+Replace 2.3(5.c/2):
+
+The rules for reserved words differ in one way: they define case conversion on letters rather than
+sequences. This means that it is possible that there exist some unusual sequences that are neither
+identifiers nor reserved words. We are not aware of any such sequences so long as we use
+simple case folding (as opposed to full case folding), but we have defined the rules in case any are
+introduced in future character set standards. This originally was a problem when converting to upper
+case: “<i>f” and “acce<s>” have upper case conversions of “IF” and “ACCESS” respectively. We would
+not want these to be treated as reserved words. But neither of these cases exist when using simple
+case folding.
+
+
+Replace the notes 2.3(6.a/2-6.i/2) by: (Turkish characters surrounded by <> here, <I> - dotted capital I,
+<i> - dotless lower-case I).
+
+For instance, in most languages, the simple case folded equivalent of LATIN CAPTIAL LETTER I (a
+upper case letter without a dot above) is LATIN SMALL LETTER I (an lower case letter with a dot
+above). In Turkish, though, LATIN CAPITAL LETTER I and LATIN CAPITAL LETTER WITH DOT ABOVE are two
+distinct letters, so the case folded equivalent of LATIN CAPTIAL LETTER I is LATIN SMALL LETTER DOTLESS I,
+and the case folded equivalent of LATIN CAPTIAL LETTER WITH DOT ABOVE I is LATIN SMALL LETTER I.
+Take for instance the following identifier (which is the name of a city on the Tigris river in
+Eastern Anatolia):
+
+ D<I>YARBAKIR -- First I is dotted, second is not.
+
+A Turkish speaker would expect that the original identifier is equivalent to:
+
+ diyarbak<i>r
+
+However, case folding (and thus Ada) map this to:
+
+ d<I>yarbakir
+
+which is different any of:
+
+ <the four values in 6.f>
+
+including the "correct" matching identifier for Turkish. Upper case conversion
+(used in '[Wide_]Wide_Image) introduces additional problems.
+
+An implementation targeting the Turkish market is allowed (in fact, expected) to
+provide a nonstandard mode where case folding is appropriate for Turkish.
+
+
+Replace 3.5.1(5):
-Fix 1.1.4(14.2/2):
+The defining_identifiers in upper case Redunant[and the defining_character_literals]
+listed in an enumeration_type_definition shall be distinct.
-<<Somehow>>
+AARM Reason: To ease implementation of the attribute Wide_Wide_Value, we require
+that all enumeration literals have distinct images.
!discussion
@@ -56,9 +136,10 @@
both be used for case conversion and case insensitive comparisions. These are
two different operations.
-An important property of "full case folding" is that Unicode guarantees that it
-is stable. That is, it will always provide the same results for two strings in
-any current or newer version of Unicode. Conversely, the various case mapping
+An important property of "full case folding" (and "simple case folding as well)
+is that Unicode guarantees that it is stable. That is, it will always provide
+the same results for two strings (that contain only defined code-points) in
+any current or newer version of Unicode. Conversely, the various case mappings
are not stable: each new version of Unicode may provide different results.
Since "full case folding" is stable, it is appropriate for use in programming
@@ -74,35 +155,59 @@
the result of 'Image (which is a case conversion problem). To follow the Unicode
recommendations, these have to be treated separately.
-That means that notion of identifier equivalence in 2.3(5.2/2) would have to be
+That means that notion of identifier equivalence in 2.3(5.2/2) needs to be
replaced by a direct reference to full case folding. Most of the other
references in 2.3 and 2.9 to upper case would also have to be replaced by this
same definition of equivalence.
-At the same time, we also would want to replace 1.1.4(14.2/2) by a reference to
-either "full upper case mapping" or "simple upper case mapping". This is what
-'Image would use. ('Image cannot use "full case mapping" as this is neither
-intended to be a case conversion mapping, and in any case it goes to lower
-case.)
-
-This would make Ada 2012 as compatible with the Unicode recommendations as
-possible, but there would be two side-effects:
-
-(1) It is possible that two different identifiers (by full case folding) would
-have the same upper case mapping. For instance, the 16#DF# (German sharp s,
-represented here by <S>) character maps to "SS" in full upper case mapping.
-Thus, 'Image (Bass) = "BASS" 'Image (Ba<S>) = "BASS" This problem could be
-minimized by using the Simple upper case mapping (which does not do this
-particular conversion), but it seems likely that there are other such cases (the
-dotless i comes to mind).
-
-This issue would usually not be significant, but it means that there would be
-small possibility that 'Image would not be reversible. That does not seem good.
-
-(2) 'Value would need to use full case folding to determine which identifier was
-provided. This would be more complex to do than the trivial conversions
-currently done; that would have a runtime cost, both in time and space.
+At the same time, we also need to replace 1.1.4(14.2/2) by a reference to
+"full upper case mapping". This is what '[Wide_Wide_]Image would use. ('Image
+cannot use "full case folding" as this is neither intended to be a case
+conversion mapping, and in any case it goes to lower case.)
+
+This makes Ada 2012 as compatible with the Unicode recommendations as
+possible. However, there are problems problem.
+
+Both "full case folding" and "full upper case mapping" can cause strings to
+change lengths. This adds implementation complexity.
+
+"Locale-idependent full case folding" maps the 16#DF# character (German sharp s,
+represented here by <S>) character maps to "ss". This leads directly to the various
+incompatibilities noted in the question. In particular all of the following
+identifiers would be equivalent in Ada 2012 if this was adopted:
+ Bass BASS BAss Ba<S> ba<S>
+
+This is clearly incompatible with Ada 95 (where the last two are different than
+the first three). Moreover, this incompatibility could in fact lead to a
+beaujolias-like inconsistency if there are nested identifiers that used to be
+considered different and now are considered the same.
+
+We might be prepared to live with an incompatibility, but an inconsistency here
+is unconsionable. So we cannot use "locale-independent full case folding".
+
+If we use "locale-independent simple case folding" instead, it then makes no
+sense to use the more complex "full upper case mapping" for '[Wide_Wide_]Image.
+We would have the same problem with 'Image that we had previously with identifier
+equivalence.For instance, the sharp s character maps to "SS" in full upper case
+mapping. Thus, 'Image (Bass) = "BASS" 'Image (Ba<S>) = "BASS". This particular
+example is especially nasty because it is inconsistent with the Ada 95 handling of
+this identifier.
+
+Additionally, '[Wide_Wide_]Value would have to use the relatively complex full case
+folding to determine which identifier was provided. This would be more complex
+to do than the trivial conversions currently done; that would have a runtime cost,
+both in time and space.
+
+Even using "simple case folding" and "simple upper case mapping"s, it is still
+possible for two different identifiers to have the same upper case mapping (the
+dotless i is likely to be such an example). In order to keep the implementation of
+'[Wide_Wide_]Value manageable, we have also adopted a rule that all of the literals
+of an enumeration type have distinct upper case mappings. This allows
+'[Wide_Wide_]Value to compare the upper case mappings of its parameter, rather than
+having to use case folding.
+
+
ALTERNATIVE SOLUTIONS
The bad effects above come about because case conversions and case equivalence
@@ -110,29 +215,19 @@
'Image in a programming language. Probably they would recommend that it return
the original case of the identifier. But it's many decades too late to do that.
-Thus, we probably also ought to consider simpler changes where we leave the
-semantic rules as they are and just define "convert to upper case" meaningfully.
+Thus, we also considered simpler changes where we leave the semantic rules as
+they are and just define "convert to upper case" meaningfully.
The obvious thing to do is to define "convert to upper case" in 1.1.4(14.2/2) to
-be full upper case folding. Indeed, the AARM notes were constructed using the
+be full upper case mapping. Indeed, the AARM notes were constructed using the
notion that this is what we were doing.
-
-However, this leads directly to the various incompatibilities noted in the
-question. In particular all of the following identifiers would be equivalent in
-Ada 2012 if this was adopted:
-
- Bass BASS BAss Ba<S> ba<S>
-This is clearly incompatible with Ada 95 (where the last two are different than
-the first three). Moreover, this incompatible could in fact lead to a
-beaujolias-like inconsistency if there are nested identifiers that used to be
-considered different and now are considered the same.
+However, this has the same problem as using "locale-independent full case folding".
+The Sharp-S mapping is such that identifiers are incompatable with Ada 95.
+Thus this solution has to be rejected.
-We might be prepared to live with an incompatibility, but an inconsistency here
-is unconsionable. So this solution has to be rejected.
-
The next obvious thing to do is to define "convert to upper case" in
-1.1.4(14.2/2) to be simple upper case folding. This always goes to the same
+1.1.4(14.2/2) to be simple upper case mapping. This always goes to the same
length string, so the problem given above does not occur. Indeed, we are not
aware of any incompatibility with Ada 95 here.
@@ -156,12 +251,13 @@
algorithm. (Robert Dewar has a suggestion of how we could do that in the
!appendix.) However, if we were to do that, it would have to be because we
wanted to avoid incompatibilities in the future. Thus simply depending on
-character names/classifications is not enough, as a future standard will surely
-change some of those.
+character names/classifications is not enough, as a future character set standard
+will surely change some of those.
!ACATS Test
+Adjust the Unicode identifier tests to reflect this decision.
!appendix
@@ -1762,6 +1858,64 @@
fixes except for the composition of untagged record "=" change, which we all
agreed was preferrable to trying to figure out some way for the old semantics to
hold. It almost always would fix a bug rather than introduce one anyway.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Sunday, January 23, 2011 12:30 AM
+
+Attached is a new version of AI05-0227-1, the case equivalence AI.
+
+This generally reflects the discussion of the ARG, and includes full wording,
+but there are two major differences.
+
+First, I used "locale-independent simple case folding", rather than
+the full case folding that we had previously discussed. This happened
+because I noticed the following line in the case folding description
+when I was looking at the dotless I issues:
+
+00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
+
+This says that for "full case folding", Sharp S is the same as "ss".
+
+I obviously missed that when I looked at this table in the past. That's a
+non-starter for the compatibility reasons previously noted. I revised the
+discussion based on that.
+
+Second, I did not add any rule about declarations in the same scope having
+distinct upper-case mappings. There are many reasons for this.
+ * The simple case folding has many fewer equivalences, so the problem
+ is much less likely.
+ * There is nothing in the standard that requires the default External_Tag
+ values to be in a particular case, so there is no issue with those
+ (so long as a canonical representation is used).
+ * Wide_Wide_Exception_Name does have such a requirement, but given that
+ the number of exceptions is typically low, and this is only debugging
+ information (there is no routine that works like 'Value), it is hard
+ to be concerned about what is obviously a pathology.
+ * This check is likely to be expensive; either it requires storing an
+ extra representation of each identifier in upper case, only used for
+ this check (wasteful of space) or doing the conversion on the fly
+ (wasteful of time). Either way, the number of comparisons needed for
+ each declaration is proportional to the number of declarations in
+ the scope, meaning that the check is quadratic in the total number
+ of declarations. It is not unusual to have thousands of declarations
+ in a package (especially in automatically generated code), so this
+ could be a problem. While the problem still could occur for enumeration
+ type declarations, there are many fewer of those.
+ In any case, if we are going to do this generally, why are the identifiers
+ distinct in the first place?? It would be easier to just make them
+ map to the same identifier value, then no check would be needed.
+ * Upper case mappings are not considered stable in Unicode. That is,
+ future versions may be different. That could introduce incompatibilities
+ into Ada where previously legal programs are now illegal because of
+ upper case mappings. Again, this could happen for enumeration types,
+ but that clearly is a much more limited incompatibility than "anywhere".
+For all of these reasons, I only made a requirement for distinct upper case
+mappings for enumeration literals.
+
+Please read the AI (especially the wording and discussion) and make any
+comments needed.
****************************************************************
Questions? Ask the ACAA Technical Agent