!standard 1.1.4(14.2/2) 11-05-05 AI05-0227-1/05 !standard 2.3(5/2) !standard 2.3(5.3/2) !standard 3.5.1(5) !class binding interpretation 10-10-21 !status Amendment 2012 11-03-11 !status ARG Approved 6-0-2 11-02-20 !status work item 10-10-21 !status received 10-07-03 !priority Medium !difficulty Hard !qualifier Error !subject Identifier equivalence !summary Identifier equivalence is based on Unicode "locale-independent simple case folding". The result of 'Wide_Wide_Image is based on the "simple upper case mapping" of the enumeration literals. An enumeration type is illegal if two literals have the same "simple upper case mapping". !question AARM 2.3(5.c/2) implies that Ada 2005 may be incompatible with Ada 95 in obscure cases. That's because some of the sequences which are neither identifiers nor reserved words in Ada 2005 were legal identifiers in Ada 95. Moreover, the expected change in identifier equivalence also could introduce incompatibilities or in very unusual cases, inconsistencies. Neither of these were discussed nor documented for Ada 2005. However, this AARM note is wrong, because 1.1.4(14.2/2) says that "convert to upper case" means "locale-independent full case folding". Full case folding does not consider the character at 16#DF# (the German character that looks like a beta) to be the same as "ss". Thus, these incompatibilities don't arise. However, "full case folding" is a mapping to *lower case*!! Thus the definition in 1.1.4(14.2/2) is madness. Moreover, taken literally, it requires 'Image to produce *lower case* versions of enumeration literals, which would be completely inconsistent with Ada 95 and before. That cannot have been intended. So what is the rule? (See Summary.) !wording Replace 1.1.4(14.2/2): When this International Standard mentions the conversion of some character or sequence of characters to upper case, it means the character or sequence of characters obtained by using simple upper case mapping, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2003. Replace 2.3(5/3): Two identifiers are considered the same if they consist of the same sequence of characters after applying locale-independent simple case folding, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2003. Replace 2.3(5.3/3): After applying simple case folding, an identifier shall not be identical to a reserved word. AARM Discussion: Simple case folding is a mapping to lower case, so this is matching the defining (lower case) version of a reserved word. We could have mentioned case folding of the reserved words, but as that is an identity function, it would have no effect. Modify 2.3(5.b/3): ...after [converting to upper case]{applying case folding} so that the rules... Replace 2.3(5.c/2): The rules for reserved words differ in one way: they define case conversion on letters rather than sequences. This means that it is possible that there exist some unusual sequences that are neither identifiers nor reserved words. We are not aware of any such sequences so long as we use simple case folding (as opposed to full case folding), but we have defined the rules in case any are introduced in future character set standards. This originally was a problem when converting to upper case: “f” and “acce” have upper case conversions of “IF” and “ACCESS” respectively. We would not want these to be treated as reserved words. But neither of these cases exist when using simple case folding. Replace the notes 2.3(6.a/2-6.i/2) by: (Turkish characters surrounded by <> here, - dotted capital I, - dotless lower-case I). For instance, in most languages, the simple case folded equivalent of LATIN CAPITAL LETTER I (a upper case letter without a dot above) is LATIN SMALL LETTER I (an lower case letter with a dot above). In Turkish, though, LATIN CAPITAL LETTER I and LATIN CAPITAL LETTER WITH DOT ABOVE are two distinct letters, so the case folded equivalent of LATIN CAPITAL LETTER I is LATIN SMALL LETTER DOTLESS I, and the case folded equivalent of LATIN CAPTIAL LETTER WITH DOT ABOVE I is LATIN SMALL LETTER I. Take for instance the following identifier (which is the name of a city on the Tigris river in Eastern Anatolia): DYARBAKIR -- First I is dotted, second is not. A Turkish reader would expect that the original identifier is equivalent to: diyarbakr However, locale-independent simple case folding (and thus Ada) maps this to: dyarbakir which is different from any of the following identifiers: including the "correct" matching identifier for Turkish. Upper case conversion (used in '[Wide_]Wide_Image) introduces additional problems. An implementation targeting the Turkish market is allowed (in fact, expected) to provide a nonstandard mode where case folding is appropriate for Turkish. Replace 3.5.1(5): The defining_identifiers in upper case Redundant[and the defining_character_literals] listed in an enumeration_type_definition shall be distinct. AARM Reason: To ease implementation of the attribute Wide_Wide_Value, we require that all enumeration literals have distinct images. !discussion The original intent was to follow the Unicode recommendations in this area. Therefore, the author went back and read the Unicode recommendations. Unicode says that case-insensitive identifier equivalence should be done by converting both strings using "locale-independent case folding", and comparing the results. But this is *not* a conversion to some case: it is solely intended for comparisons. Unicode also provides a variety of mappings to convert strings from mixed case into all upper case or all lower case versions. The "full" versions are preferred (these may change the length of the strings); the "simple" versions leave the string lengths the same. The important take-away here is that there is not a single operation that can both be used for case conversion and case insensitive comparisons. These are two different operations. An important property of "full case folding" (and "simple case folding" as well) is that Unicode guarantees that it is stable. That is, it will always provide the same results for two strings (that contain only defined code-points) in any current or newer version of Unicode. Conversely, the various case mappings are not stable: each new version of Unicode may provide different results. Since "full case folding" is stable, it is appropriate for use in programming languages: identifiers will remain compatible with future versions of Unicode. Case mapping is not considered appropriate, because it will change with each new version of Unicode. The Problem. Ada 95 and Ada 2005 have confused the two concerns. "Convert to upper case" is used both for identifier equivalence (which is a comparison operation) and for the result of 'Image (which is a case conversion problem). To follow the Unicode recommendations, these have to be treated separately. That means that notion of identifier equivalence in 2.3(5.2/2) needs to be replaced by a direct reference to full case folding. Most of the other references in 2.3 and 2.9 to upper case would also have to be replaced by this same definition of equivalence. At the same time, we also need to replace 1.1.4(14.2/2) by a reference to "full upper case mapping". This is what '[Wide_Wide_]Image would use. ('Image cannot use "full case folding" as this is neither intended to be a case conversion mapping, and in any case it goes to lower case.) This makes Ada 2012 as compatible with the Unicode recommendations as possible. However, there are problems. Both "full case folding" and "full upper case mapping" can cause strings to change lengths. This adds implementation complexity. "Locale-independent full case folding" maps the 16#DF# character (German sharp s, represented here by ) to "ss". This leads directly to the various incompatibilities noted in the question. In particular all of the following identifiers would be equivalent in Ada 2012 if this was adopted: Bass BASS BAss Ba ba This is clearly incompatible with Ada 95 (where the last two are different than the first three). Moreover, this incompatibility could in fact lead to a beaujolais-like inconsistency if there are nested identifiers that used to be considered different and now are considered the same. We might be prepared to live with an incompatibility, but an inconsistency here is unconscionable. So we cannot use "locale-independent full case folding". If we use "locale-independent simple case folding" instead, it then makes no sense to use the more complex "full upper case mapping" for '[Wide_Wide_]Image. We would have the same problem with 'Image that we had previously with identifier equivalence. For instance, the sharp s character maps to "SS" in full upper case mapping. Thus, 'Image (Bass) = "BASS" 'Image (Ba) = "BASS". This particular example is especially nasty because it is inconsistent with the Ada 95 handling of this identifier. Additionally, '[Wide_Wide_]Value would have to use the relatively complex full case folding to determine which identifier was provided. This would be more complex to do than the trivial conversions currently done; that would have a runtime cost, both in time and space. Even using "simple case folding" and "simple upper case mapping"s, it is still possible for two different identifiers to have the same upper case mapping (identifiers including the dotless lowercase i is such an example). In order to keep the implementation of '[Wide_Wide_]Value manageable, we have also adopted a rule that all of the literals of an enumeration type have distinct upper case mappings. This allows '[Wide_Wide_]Value to compare the upper case mappings of its parameter, rather than having to use case folding. ALTERNATIVE SOLUTIONS The bad effects above come about because case conversions and case equivalence are separate entities. Clearly, Unicode does not anticipate a function like 'Image in a programming language. Probably they would recommend that it return the original case of the identifier. But it's many decades too late to do that. Thus, we also considered simpler changes where we leave the semantic rules as they are and just define "convert to upper case" meaningfully. The obvious thing to do is to define "convert to upper case" in 1.1.4(14.2/2) to be full upper case mapping. Indeed, the AARM notes were constructed using the notion that this is what we were doing. However, this has the same problem as using "locale-independent full case folding". The Sharp-S mapping is such that identifiers are incompatible with Ada 95. Thus this solution has to be rejected. The next obvious thing to do is to define "convert to upper case" in 1.1.4(14.2/2) to be simple upper case mapping. This always goes to the same length string, so the problem given above does not occur. Indeed, we are not aware of any incompatibility with Ada 95 here. Note however, that this problem has not gone away, it just has left Latin-1. For instance, the dotless i and normal i both map to 'I'. That means that "dotless-intent" and "intent" are still considered the same, and we still have sequences that are neither identifiers nor reserved words ("dotless-if" for instance). Since we already "solved" that problem in the Ada 2005 definition, this is not too terrible. At least it is true that Image is always reversible with Value, as anything that is considered the same would have the same upper case mapping. Note however that the definition of "simple upper case mapping" is not stable, meaning that switching to a newer version of Unicode (presumably when 10646 is updated) almost certainly would introduce incompatibilities. We could solve this by abandoning Unicode altogether. We could do that by defining our own case mapping algorithm. (Robert Dewar has a suggestion of how we could do that in the !appendix.) However, if we were to do that, it would have to be because we wanted to avoid incompatibilities in the future. Thus simply depending on character names/classifications is not enough, as a future character set standard will surely change some of those. Thus we rejected this option. !corrigendum 1.1.4(14.2/2) @drepl When this International Standard mentions the conversion of some character or sequence of characters to upper case, it means the character or sequence of characters obtained by using locale-independent full case folding, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2003. @dby When this International Standard mentions the conversion of some character or sequence of characters to upper case, it means the character or sequence of characters obtained by using simple upper case mapping, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2003. !corrigendum 2.3(5/2) @drepl Two @fas are considered the same if they consist of the same sequence of characters after applying the following transformations (in this order): @dby Two @fas are considered the same if they consist of the same sequence of characters after applying locale-independent simple case folding, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2003. !corrigendum 2.3(5.3/2) @drepl After applying these transformations, an @fa shall not be identical to a reserved word (in upper case). @dby After applying simple case folding, an @fa shall not be identical to a reserved word. !corrigendum 3.5.1(5) @drepl The @fas and @fas listed in an @fa shall be distinct. @dby The @fas in upper case and the @fas listed in an @fa shall be distinct. !ACATS Test Adjust the Unicode identifier tests to reflect this decision. !appendix From: Randy Brukardt Sent: Sunday, July 4, 2010 7:09 PM [Split from a thread about AI05-0185-1. See that AI for previous mail.] > > I personally had thought that this was talking about the same > > mapping used for Ada Identifiers, but having read the definition > > again, I'm not so sure anymore. That's because To_Upper for strings > > is defined in terms of To_Upper for characters, and that surely > > doesn't work for the full character set (how can To_Upper for a > > character return the > > *three* characters needed in some extreme cases??). So I suspect > > that you are right that there is a definitional problem here. > > To_Upper cannot return three characters for one, what are you talking > about? 10646 has one code per point, we are not talking about UTF-8 > strings here. The upper case mapping for Unicode characters can be 2 and supposely 3 characters. The obvious example is the LC_German_Sharp_S (as it is named in Ada.Characters.Latin_1): the upper case mapping is "SS". It is certainly intended that Ada identifiers containing the LC_German_Sharp_S are considered the same as those containing "SS" in the same position (if I ever get around to creating ACATS tests for wide characters in identifiers, that will be one the first tests). Thus it doesn't make much sense to define To_Upper for strings in terms of To_Upper for characters, assuming that the same results as for identifiers is intended. (It's certainly what I would expect to be intended, it would be strange to get different results.) > For source it's up to you how the characters are represented, but > conceptually identifiers are a sequence of wide_wide_characters. Right. And the decision as to whether two identifiers are the same is made using the mapping defined in 2.1(5/2). (Ah-ha: this talks about "Simple Uppercase Mapping"; apparently this is a Unicode construct, as the "note 1" reference is just a way to get a veiled reference to Unicode into an ISO/IEC document. I definitely think we need to make it clearer in the AI wording that this is what is being talked about.) **************************************************************** From: Robert Dewar Sent: Sunday, July 4, 2010 8:18 PM > The upper case mapping for Unicode characters can be 2 and supposely 3 > characters. The obvious example is the LC_German_Sharp_S (as it is > named in > Ada.Characters.Latin_1): the upper case mapping is "SS". It is > certainly intended that Ada identifiers containing the > LC_German_Sharp_S are considered the same as those containing "SS" in > the same position (if I ever get around to creating ACATS tests for > wide characters in identifiers, that will be one the first tests). I find that absurd, and highly undesirable, this is case equivalence gone totally berserk. I entirely refuse to implement this on the grounds that it is highly undesirable to do so. And to have two totally different notions of upper casing junk characters in the language is simply horrible, I think the case mapping of 10646 with SMALL<-->CAPITAL is as far as we go. And if you *DO* right such an ACATS test, I will consider it a last straw and declare the ACATS suite junk :-) :-) Seriously, this is a weird interpretation and needs to be discussed by the whole ARG. > Thus it doesn't make much sense to define To_Upper for strings in > terms of To_Upper for characters, assuming that the same results as > for identifiers is intended. (It's certainly what I would expect to be > intended, it would be strange to get different results.) > >> For source it's up to you how the characters are represented, but >> conceptually identifiers are a sequence of wide_wide_characters. > > Right. And the decision as to whether two identifiers are the same is > made using the mapping defined in 2.1(5/2). (Ah-ha: this talks about > "Simple Uppercase Mapping"; apparently this is a Unicode construct, as the "note 1" > reference is just a way to get a veiled reference to Unicode into an > ISO/IEC document. I definitely think we need to make it clearer in the > AI wording that this is what is being talked about.) Under no conditions can we tolerate two identifiers with different numbers of 10646 code points being considered identical in my opinion. **************************************************************** From: Robert Dewar Sent: Monday, July 5, 2010 6:40 AM > The upper case mapping for Unicode characters can be 2 and supposely 3 > characters. The obvious example is the LC_German_Sharp_S (as it is > named in > Ada.Characters.Latin_1): the upper case mapping is "SS". It is > certainly intended that Ada identifiers containing the > LC_German_Sharp_S are considered the same as those containing "SS" in > the same position (if I ever get around to creating ACATS tests for > wide characters in identifiers, that will be one the first tests). More on why this would be simply awful Let's call LC_German_Sharp_S * in the below discussion If we have the identifier *, then Randy things SS and * should be case equivalent. But surely SS is equivalent to ss, so now do we break transititivity of case equivalence or is * also equivalent to ss? This way lies complete madness in my opinion. For example suppose we have the identifier SSS, is that equivalent to *s and also to s*, and now are *s and s* equivalent? AARGH, if you have a whole row of S's, there are a combinatorial number of possible equivalent identifiers I am not sure if trying to extend the world of case equivalence to other than A-Z makes sense at all, for example is E equivalent to e-acute (very often in french practice, accents are omitted from upper case, even though they should not be, because of type writers, I remember JDI strongly thought the answer was yes -- the answer must be no in fact for similar reasons to the above). But this is a done deal. To me the only thing that makes sense for case equivalence is to regard the identifier or other string as a series of code points in 10646. Then if the names of two code points differ only in SMALL LETTER being replaced by CAPITAL LETTER, then they are equivalent, otherwise they are not equivalent. How does that work out for Randy's example? The entry for the character in question in 10646 is > 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;; There is no entry whose name is LATIN CAPITAL LETTER SHARP S Thus this character has no upper case equivalent. This is the ONLY interpretation that makes sense. It is symmetrical, reversible (forget what I said about dotless I and dotted I, they are separate characters from the normal Latin-I, and if turkish folks want, they can consistently use these separate characters), and well- defined. The To_Lower and To_Upper functions in this package should be consistent with this model, and yes Randy, with this model it is fine to define the string version in terms of the character version (nothing else makes sense). Randy, if you want to devise some other version with multi-character replacements, feel free to write such a package, and even try to propose it as a separate package for the standard, but do not contaminate case equivalence of id. The above intepretation is certainly what GNAT implements now, and that is not about to change unless there are very good arguments. I see no such arguments in sight! **************************************************************** From: Robert Dewar Sent: Monday, July 5, 2010 8:13 AM For reference, here are the 10646 CAPITAL LETTER entries with no corresponding SMALL LETTER entries: > -- LATIN CAPITAL LETTER I WITH DOT ABOVE > -- LATIN CAPITAL LETTER AFRICAN D > -- LATIN CAPITAL LETTER O WITH MIDDLE TILDE > -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON > -- LATIN CAPITAL LETTER L WITH SMALL LETTER J > -- LATIN CAPITAL LETTER N WITH SMALL LETTER J > -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z > -- LATIN CAPITAL LETTER HWAIR > -- LATIN CAPITAL LETTER WYNN > -- GREEK CAPITAL LETTER UPSILON HOOK > -- GREEK CAPITAL LETTER UPSILON HOOK TONOS > -- GREEK CAPITAL LETTER UPSILON HOOK DIAERESIS > -- GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI > -- GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI > -- GREEK CAPITAL LETTER OMEGA WITH Here are the 10646 SMALL LETTER entries with no matching CAPITAL LETTER entries, note Randy's favortie at the start of the list. Note also the entries at the end, I trust that does not inspire Randy to figure out how to allow parentheses into identifiers (since I suppose he would consider the upper case equivalent of parenthesized-small-letter-c to be (C) :-)) > -- LATIN SMALL LETTER SHARP S > -- LATIN SMALL LETTER DOTLESS I > -- LATIN SMALL LETTER KRA > -- LATIN SMALL LETTER N PRECEDED BY APOSTROPHE > -- LATIN SMALL LETTER LONG S > -- LATIN SMALL LETTER B WITH STROKE > -- LATIN SMALL LETTER TURNED DELTA > -- LATIN SMALL LETTER HV > -- LATIN SMALL LETTER L WITH BAR > -- LATIN SMALL LETTER LAMBDA WITH STROKE > -- LATIN SMALL LETTER T WITH PALATAL HOOK > -- LATIN SMALL LETTER EZH WITH TAIL > -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON > -- LATIN CAPITAL LETTER L WITH SMALL LETTER J > -- LATIN CAPITAL LETTER N WITH SMALL LETTER J > -- LATIN SMALL LETTER TURNED E > -- LATIN SMALL LETTER J WITH CARON > -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z > -- LATIN SMALL LETTER D WITH CURL > -- LATIN SMALL LETTER L WITH CURL > -- LATIN SMALL LETTER N WITH CURL > -- LATIN SMALL LETTER T WITH CURL > -- LATIN SMALL LETTER TURNED A > -- LATIN SMALL LETTER ALPHA > -- LATIN SMALL LETTER TURNED ALPHA > -- LATIN SMALL LETTER C WITH CURL > -- LATIN SMALL LETTER D WITH TAIL > -- LATIN SMALL LETTER SCHWA WITH HOOK > -- LATIN SMALL LETTER REVERSED OPEN E > -- LATIN SMALL LETTER REVERSED OPEN E WITH HOOK > -- LATIN SMALL LETTER CLOSED REVERSED OPEN E > -- LATIN SMALL LETTER DOTLESS J WITH STROKE > -- LATIN SMALL LETTER SCRIPT G > -- LATIN SMALL LETTER RAMS HORN > -- LATIN SMALL LETTER TURNED H > -- LATIN SMALL LETTER H WITH HOOK > -- LATIN SMALL LETTER HENG WITH HOOK > -- LATIN SMALL LETTER L WITH MIDDLE TILDE > -- LATIN SMALL LETTER L WITH BELT > -- LATIN SMALL LETTER L WITH RETROFLEX HOOK > -- LATIN SMALL LETTER LEZH > -- LATIN SMALL LETTER TURNED M WITH LONG LEG > -- LATIN SMALL LETTER M WITH HOOK > -- LATIN SMALL LETTER N WITH RETROFLEX HOOK > -- LATIN SMALL LETTER BARRED O > -- LATIN SMALL LETTER CLOSED OMEGA > -- LATIN SMALL LETTER PHI > -- LATIN SMALL LETTER TURNED R > -- LATIN SMALL LETTER TURNED R WITH LONG LEG > -- LATIN SMALL LETTER TURNED R WITH HOOK > -- LATIN SMALL LETTER R WITH LONG LEG > -- LATIN SMALL LETTER R WITH TAIL > -- LATIN SMALL LETTER R WITH FISHHOOK > -- LATIN SMALL LETTER REVERSED R WITH FISHHOOK > -- LATIN SMALL LETTER S WITH HOOK > -- LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK > -- LATIN SMALL LETTER SQUAT REVERSED ESH > -- LATIN SMALL LETTER ESH WITH CURL > -- LATIN SMALL LETTER TURNED T > -- LATIN SMALL LETTER U BAR > -- LATIN SMALL LETTER TURNED V > -- LATIN SMALL LETTER TURNED W > -- LATIN SMALL LETTER TURNED Y > -- LATIN SMALL LETTER Z WITH RETROFLEX HOOK > -- LATIN SMALL LETTER Z WITH CURL > -- LATIN SMALL LETTER EZH WITH CURL > -- LATIN SMALL LETTER CLOSED OPEN E > -- LATIN SMALL LETTER J WITH CROSSED-TAIL > -- LATIN SMALL LETTER TURNED K > -- LATIN SMALL LETTER Q WITH HOOK > -- LATIN SMALL LETTER DZ DIGRAPH > -- LATIN SMALL LETTER DEZH DIGRAPH > -- LATIN SMALL LETTER DZ DIGRAPH WITH CURL > -- LATIN SMALL LETTER TS DIGRAPH > -- LATIN SMALL LETTER TESH DIGRAPH > -- LATIN SMALL LETTER TC DIGRAPH WITH CURL > -- LATIN SMALL LETTER FENG DIGRAPH > -- LATIN SMALL LETTER LS DIGRAPH > -- LATIN SMALL LETTER LZ DIGRAPH > -- LATIN SMALL LETTER TURNED H WITH FISHHOOK > -- LATIN SMALL LETTER TURNED H WITH FISHHOOK AND TAIL > -- COMBINING LATIN SMALL LETTER A > -- COMBINING LATIN SMALL LETTER E > -- COMBINING LATIN SMALL LETTER I > -- COMBINING LATIN SMALL LETTER O > -- COMBINING LATIN SMALL LETTER U > -- COMBINING LATIN SMALL LETTER C > -- COMBINING LATIN SMALL LETTER D > -- COMBINING LATIN SMALL LETTER H > -- COMBINING LATIN SMALL LETTER M > -- COMBINING LATIN SMALL LETTER R > -- COMBINING LATIN SMALL LETTER T > -- COMBINING LATIN SMALL LETTER V > -- COMBINING LATIN SMALL LETTER X > -- GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS > -- GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS > -- GREEK SMALL LETTER FINAL SIGMA > -- GREEK SMALL LETTER CURLED BETA > -- GREEK SMALL LETTER SCRIPT THETA > -- GREEK SMALL LETTER SCRIPT PHI > -- GREEK SMALL LETTER OMEGA PI > -- GREEK SMALL LETTER ARCHAIC KOPPA > -- GREEK SMALL LETTER SCRIPT KAPPA > -- GREEK SMALL LETTER TAILED RHO > -- GREEK SMALL LETTER LUNATE SIGMA > -- GEORGIAN SMALL LETTER FI > -- LIMBU SMALL LETTER KA > -- LIMBU SMALL LETTER NGA > -- LIMBU SMALL LETTER ANUSVARA > -- LIMBU SMALL LETTER TA > -- LIMBU SMALL LETTER NA > -- LIMBU SMALL LETTER PA > -- LIMBU SMALL LETTER MA > -- LIMBU SMALL LETTER RA > -- LIMBU SMALL LETTER LA > -- LATIN SMALL LETTER TURNED AE > -- LATIN SMALL LETTER TURNED OPEN E > -- LATIN SMALL LETTER TURNED I > -- LATIN SMALL LETTER SIDEWAYS O > -- LATIN SMALL LETTER SIDEWAYS OPEN O > -- LATIN SMALL LETTER SIDEWAYS O WITH STROKE > -- LATIN SMALL LETTER TURNED OE > -- LATIN SMALL LETTER TOP HALF O > -- LATIN SMALL LETTER BOTTOM HALF O > -- LATIN SMALL LETTER SIDEWAYS U > -- LATIN SMALL LETTER SIDEWAYS DIAERESIZED U > -- LATIN SMALL LETTER SIDEWAYS TURNED M > -- LATIN SUBSCRIPT SMALL LETTER I > -- LATIN SUBSCRIPT SMALL LETTER R > -- LATIN SUBSCRIPT SMALL LETTER U > -- LATIN SUBSCRIPT SMALL LETTER V > -- GREEK SUBSCRIPT SMALL LETTER BETA > -- GREEK SUBSCRIPT SMALL LETTER GAMMA > -- GREEK SUBSCRIPT SMALL LETTER RHO > -- GREEK SUBSCRIPT SMALL LETTER PHI > -- GREEK SUBSCRIPT SMALL LETTER CHI > -- LATIN SMALL LETTER UE > -- LATIN SMALL LETTER H WITH LINE BELOW > -- LATIN SMALL LETTER T WITH DIAERESIS > -- LATIN SMALL LETTER W WITH RING ABOVE > -- LATIN SMALL LETTER Y WITH RING ABOVE > -- LATIN SMALL LETTER A WITH RIGHT HALF RING > -- LATIN SMALL LETTER LONG S WITH DOT ABOVE > -- GREEK SMALL LETTER UPSILON WITH PSILI > -- GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA > -- GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA > -- GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI > -- GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI > -- GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI > -- GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI > -- GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI > -- GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ALPHA WITH PERISPOMENI > -- GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER ETA WITH PERISPOMENI > -- GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI > -- GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA > -- GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA > -- GREEK SMALL LETTER IOTA WITH PERISPOMENI > -- GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI > -- GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA > -- GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA > -- GREEK SMALL LETTER RHO WITH PSILI > -- GREEK SMALL LETTER UPSILON WITH PERISPOMENI > -- GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI > -- GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI > -- GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI > -- GREEK SMALL LETTER OMEGA WITH PERISPOMENI > -- GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI > -- SUPERSCRIPT LATIN SMALL LETTER I > -- SUPERSCRIPT LATIN SMALL LETTER N > -- TURNED GREEK SMALL LETTER IOTA > -- PARENTHESIZED LATIN SMALL LETTER A > -- PARENTHESIZED LATIN SMALL LETTER B > -- PARENTHESIZED LATIN SMALL LETTER C > -- PARENTHESIZED LATIN SMALL LETTER D > -- PARENTHESIZED LATIN SMALL LETTER E > -- PARENTHESIZED LATIN SMALL LETTER F > -- PARENTHESIZED LATIN SMALL LETTER G > -- PARENTHESIZED LATIN SMALL LETTER H > -- PARENTHESIZED LATIN SMALL LETTER I > -- PARENTHESIZED LATIN SMALL LETTER J > -- PARENTHESIZED LATIN SMALL LETTER K > -- PARENTHESIZED LATIN SMALL LETTER L > -- PARENTHESIZED LATIN SMALL LETTER M > -- PARENTHESIZED LATIN SMALL LETTER N > -- PARENTHESIZED LATIN SMALL LETTER O > -- PARENTHESIZED LATIN SMALL LETTER P > -- PARENTHESIZED LATIN SMALL LETTER Q > -- PARENTHESIZED LATIN SMALL LETTER R > -- PARENTHESIZED LATIN SMALL LETTER S > -- PARENTHESIZED LATIN SMALL LETTER T > -- PARENTHESIZED LATIN SMALL LETTER U > -- PARENTHESIZED LATIN SMALL LETTER V > -- PARENTHESIZED LATIN SMALL LETTER W > -- PARENTHESIZED LATIN SMALL LETTER X > -- PARENTHESIZED LATIN SMALL LETTER Y > -- PARENTHESIZED LATIN SMALL LETTER Z **************************************************************** From: Randy Brukardt Sent: Tuesday, July 6, 2010 6:49 PM ... > > The upper case mapping for Unicode characters can be 2 and supposely > > 3 characters. The obvious example is the LC_German_Sharp_S (as it is > > named in > > Ada.Characters.Latin_1): the upper case mapping is "SS". It is > > certainly intended that Ada identifiers containing the > > LC_German_Sharp_S are considered the same as those containing "SS" > > in the same position (if I ever get around to creating ACATS tests > > for wide characters in identifiers, that will be one the first tests). > > I find that absurd, and highly undesirable, this is case equivalence > gone totally berserk. I entirely refuse to implement this on the > grounds that it is highly undesirable to do so. And to have two > totally different notions of upper casing junk characters in the > language is simply horrible, OK, but that is what the language defines. > I think the case mapping of 10646 with SMALL<-->CAPITAL is as far as > we go. I just went back and re-read the Ada 95 AIs that define this, and the important result is that we want to follow the Unicode recommendations and not try to invent our own character set rules. Thus we adopted the Unicode case folding (see 1.1.4(14.2/2)), and as the ramification note 1.1.4(14.f/2) says, this is applied on complete sequences, not single characters. As I recall, Unicode documents were pretty clear that doing single character conversions is a bad idea. > And if you *DO* right such an ACATS test, I will consider it a last > straw and declare the ACATS suite junk :-) :-) Well, in this case, the ACATS would be reflecting the standard as written. If you don't like that, you need to get the Standard changed. > Seriously, this is a weird interpretation and needs to be discussed by > the whole ARG. It was discussed by the whole ARG when it was adopted. This was definitely an intended choice. Whether everyone understood the ramifications, I can't say, but they're clearly mentioned in the record and in the chosen wording. For instance, we adopted a slightly different rule for reserved words than for identifiers, and the ramification 2.3(5.c/2) discusses the fact that there are strings of letters that are neither legal identifiers nor reserved words. I recall that it took several iterations to settle on this intent. The one thing that I see is missing is that I failed to document these as "Incompatibilities with Ada 95", as there are some identifiers that would be considered different for Ada 95 that would be considered the same for Ada 2005 (or even illegal). I don't know if that would have changed anyone's opinion, but I doubt it. ... > Under no conditions can we tolerate two identifiers with different > numbers of 10646 code points being considered identical in my opinion. Well, we decided differently back in 2005. The question is whether there is any semantic problem with this. You later complain: > Let's call LC_German_Sharp_S * in the below discussion But surely SS > is equivalent to ss, so now do we break transititivity > of case equivalence or is * also equivalent to ss? The rule is that identifiers are converted to upper case, then compared for equality. So of course "*" and "ss" and "SS" are all equivalent. Similarly, "acce*" is equivalent to "access" (but the former is considered illegal, as reserved words have to be written in their ascii form). > For example suppose we have the identifier SSS, is that equivalent to > *s and also to s*, and now are *s and s* equivalent? AARGH, if you > have a whole row of S's, there are a combinatorial number of possible > equivalent identifiers Yes, there are a lot of possible equivalent identifiers. So what? I don't see any semantic problem with that; the only thing that matters as far as the language is concerned is that we can tell whether two identifiers are equivalent. ... > To me the only thing that makes sense for case equivalence is to > regard the identifier or other string as a series of code points in 10646. Unicode strongly suggested that this sort of case equivalence is a very bad idea. (I haven't gone back to check if that is still true, we'll need to do that before we argue this for the ARG.) In any case, I don't care all that much what we decide. Personally, I think allowing characters outside of Latin-1 in Ada source is a mistake. But that wasn't a choice that we were allowed to make. The second best choice seemed to be to follow the recommendations of the character set experts, and what we decided is what they recommended (in 2005). I'm very leery of claiming that Ada experts are smarter than character set experts in this particular area. **************************************************************** From: Robert Dewar Sent: Tuesday, July 6, 2010 10:35 PM > OK, but that is what the language defines. No it doesn't that would violate Robert's rule of reasonableness >> I think the case mapping of 10646 with SMALL<-->CAPITAL is as far as >> we go. > > I just went back and re-read the Ada 95 AIs that define this, and the > important result is that we want to follow the Unicode recommendations > and not try to invent our own character set rules. Thus we adopted the > Unicode case folding (see 1.1.4(14.2/2)), and as the ramification note > 1.1.4(14.f/2) says, this is applied on complete sequences, not single characters. Absurd in this case > As I recall, Unicode documents were pretty clear that doing single > character conversions is a bad idea. Doing case equivalence beyond basic latin 1 characters is a bad >> And if you *DO* right such an ACATS test, I will consider it a last >> straw and declare the ACATS suite junk :-) :-) > > Well, in this case, the ACATS would be reflecting the standard as written. > If you don't like that, you need to get the Standard changed. The standard often has mistakes, doesn't mean you have to try to implement them. Any attempt to implement this ends up with nonsense. >> Seriously, this is a weird interpretation and needs to be discussed >> by the whole ARG. > > It was discussed by the whole ARG when it was adopted. This was > definitely an intended choice. Whether everyone understood the > ramifications, I can't say, but they're clearly mentioned in the record and in the chosen wording. > For instance, we adopted a slightly different rule for reserved words > than for identifiers, and the ramification 2.3(5.c/2) discusses the > fact that there are strings of letters that are neither legal > identifiers nor reserved words. I recall that it took several iterations to settle on this intent. Well it is absurd if case equivalence is not reversible and transitive. It results in all kinds of anomolies, and if it is transitive, we get very peculiar things. For example, it seems entirely wrong that the identifier j*y be considered case equivalent to jssy. Your "character code experts" would be quite suprised at this suggestion. > The one thing that I see is missing is that I failed to document these > as "Incompatibilities with Ada 95", as there are some identifiers that > would be considered different for Ada 95 that would be considered the > same for Ada > 2005 (or even illegal). I don't know if that would have changed > anyone's opinion, but I doubt it. I hope it would have! The idea of introducing this kind of incompatibily for such an absurd small gain should not have been countenanced for a moment. If you have a situation where two identifiers that are different in Ada 95 are the same in Ada 2005, you have a VERY SERIOUS incompatibility. Any kind of incompatibiltiy needs to be justified on the grounds that it provides an important functionality (e.g. for some). That justification is totally lacking in this case. > ... >> Under no conditions can we tolerate two identifiers with different >> numbers of 10646 code points being considered identical in my >> opinion. > > Well, we decided differently back in 2005. The question is whether > there is any semantic problem with this. You later complain: > >> Let's call LC_German_Sharp_S * in the below discussion But surely SS >> is equivalent to ss, so now do we break transititivity >> of case equivalence or is * also equivalent to ss? > > The rule is that identifiers are converted to upper case, then > compared for equality. So of course "*" and "ss" and "SS" are all > equivalent. Similarly, "acce*" is equivalent to "access" (but the > former is considered illegal, as reserved words have to be written in their ascii form). > >> For example suppose we have the identifier SSS, is that equivalent to >> *s and also to s*, and now are *s and s* equivalent? AARGH, if you >> have a whole row of S's, there are a combinatorial number of possible >> equivalent identifiers > > Yes, there are a lot of possible equivalent identifiers. So what? I > don't see any semantic problem with that; the only thing that matters > as far as the language is concerned is that we can tell whether two > identifiers are equivalent. I regard this as complete nonsense. I am surprised people intended this kind of nonsense. If the ARG insists on this interpretation, I VERY much doubt that anyone would ever implement it. >> To me the only thing that makes sense for case equivalence is to >> regard the identifier or other string as a series of code points in 10646. > > Unicode strongly suggested that this sort of case equivalence is a > very bad idea. (I haven't gone back to check if that is still true, > we'll need to do that before we argue this for the ARG.) > > In any case, I don't care all that much what we decide. Personally, I > think allowing characters outside of Latin-1 in Ada source is a > mistake. But that wasn't a choice that we were allowed to make. The > second best choice seemed to be to follow the recommendations of the > character set experts, and what we decided is what they recommended > (in 2005). I'm very leery of claiming that Ada experts are smarter > than character set experts in this particular area. Nope, what should have been done is to allow arbitrary letters in identifiers but NOT extend case equivalence beyond Latin-1, THAT was the mistake, and you cannot blame this on Unicode, after all most reasonable programming languages don't have case equivalence of identifiers, so it is not on the Unicode radar screen wrt programming languages. Your so-called "character set experts" are talking about the general issue of case conversion, not the specific issue of case equivalence in identifiers. For one thing, there is no problem in converting the small parenthesized letters to upper case for the general usage, e.g. small (c) becomes three characters (C). But you can't go that way for identifiers. There are other such examples. So you don't really mean to adopt all of the Unicode multi-character scheme, because it won't work for identifiers. Given that you need a special set of rules, different from the general Unicode rules, I think the rule I propose is simple and good enough. Once again, the rule I propose is If two code points in 10646 have names that are the same except for CAPITAL LETTER <---> SMALL LETTER, then case conversion converts between them. Otherwise case conversion has no effect. This handles all the common reasonable cases, as well as some reasonable cases in other languages. Another point is that in general case equivalence is locale dependent, consider the situation with the letter I. We have the following 10646 code points: > 0049;LATIN CAPITAL LETTER I;Lu > 0069;LATIN SMALL LETTER I;Ll > 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu 0131;LATIN SMALL LETTER > DOTLESS I;Ll Now in locales which don't use the last two entries, like english, the case mapping is 49<-->69, but in locales that do use the last two entries (I think if I remember Turkish is a case in question), the case mappings are 49<-->131 and 69<-->130. That's fine for general use, but obviously equivalence of Ada identifiers can't be locale dependent. My rule looks at these four entries, and decides that the only mapping that makes sense is 49<-->69 which is what you want in this case. If people want to write identifiers with code points 130 and 131, they can, but they should not expect case equivalence. Note that it is not at all terrible to forbid case equivalence, since good practice is to always write identifiers with the same casing anyway. I would be interested in what Randy has to say (or what he thinks the standard says) about these four code points??? Note by the way that Ada compilers (GNAT in particular) have allowed wide characters in identifiers for ever, and that's because despite Randy's feeling, lots of people in non-english countries find this very useful. We did not however implement case equivalence for such identifiers until forced to do so, and it was probably a mistake to do so. For me, it is fundamental that To_Lower and To_Upper should be invertible, i.e. if To_Lower (C) /= C then To_Upper (To_Lower (C)) = C I personally would not have introduced the To_Lower function, or extended it to apply to strings, but that's apparently done now, and I think my intepretation is the only one that makes sense. Note incidentally that the existing packages as proposed by the ARG do case conversion on a single character to single character basis, and so for that model, my rule is the kind of rule you need. The To_Upper function in Ada.Wide_Characters.Handling simply does not allow for an input of one character and an output of several characters. Anyway, I don't particularly care what is decided, I don't think it will make any difference at this stage. P.S. I am surprised that no one resurrected Jean's insistence that lower-case-e-acute be considered equivalent to upper-case-e-no-accent :-) BTW, the original implementation of To_Upper in GNAT refers to note 1 in the ISO 10646 standard, I don't have a copy of the standard handy, does someone know what note 1 says? **************************************************************** From: Randy Brukardt Sent: Wednesday, July 7, 2010 1:04 PM (For the record, I had nothing significant to do with these choices; this area was primarily Pascal's and Kiyoshi's. I'm just trying to explain what they decided and what the entire ARG approved. Please do not attribute any of these ideas to me!) ... > >> And if you *DO* right such an ACATS test, I will consider it a last > >> straw and declare the ACATS suite junk :-) :-) > > > > Well, in this case, the ACATS would be reflecting the standard as written. > > If you don't like that, you need to get the Standard changed. > > The standard often has mistakes, doesn't mean you have to try to > implement them. Any attempt to implement this ends up with nonsense. So, since "coextensions" are clearly a mistake, no one has to implement them! Yaa!! :-) In this case, this isn't a mistake; it was a carefully considered decision. You just don't like the decision (and based on the old mail, you didn't like in 2002 and 2005, either). > >> Seriously, this is a weird interpretation and needs to be discussed > >> by the whole ARG. > > > > It was discussed by the whole ARG when it was adopted. This was > > definitely an intended choice. Whether everyone understood the > > ramifications, I can't say, but they're clearly mentioned in the > > record and in the chosen wording. > > For instance, we adopted a slightly different rule for reserved > > words than for identifiers, and the ramification 2.3(5.c/2) > > discusses the fact that there are strings of letters that are > > neither legal identifiers nor reserved words. I recall that it took > > several iterations to settle on this intent. > > Well it is absurd if case equivalence is not reversible and transitive. > It results in all kinds of anomolies, and if it is transitive, we get > very peculiar things. I don't understand what you mean by "reversible" here. Even for Ada 83, there are 2**N equivalent identifiers (where N is the length of the identifier). If you have the all upper-case version of that identifier, there is no way to tell which of the 2**N versions it came from. > For example, it seems entirely wrong that the identifier j*y be > considered case equivalent to jssy. Your "character code experts" > would be quite suprised at this suggestion. Possibly. I'm not going to guess. > > The one thing that I see is missing is that I failed to document > > these as "Incompatibilities with Ada 95", as there are some > > identifiers that would be considered different for Ada 95 that would > > be considered the same for Ada 2005 (or even illegal). I don't know > > if that would have changed > > anyone's opinion, but I doubt it. > > I hope it would have! The idea of introducing this kind of > incompatibily for such an absurd small gain should not have been > countenanced for a moment. If you have a situation where two > identifiers that are different in Ada 95 are the same in Ada 2005, you > have a VERY SERIOUS incompatibility. Any kind of incompatibiltiy needs > to be justified on the grounds that it provides an important > functionality (e.g. for some). > That justification is totally lacking in this case. I'm unconvinced that the incompatibility is that serious, in that you would get a compile-time error from the problem in the very rare case when it occurs. If you can find a case where that is *not* true, that would have been an important data point in choosing a different rule (I would probably have pushed for something compatible). My biggest concern here is readability. But I've come to the conclusion that is a red herring. That's because code that uses identifiers having characters outside of the base 128 characters are never going to be readable to some subset of programmers. The use of identifiers in a local language with "funny" characters is probably only readable to speakers of that language. So "portable" code will avoid such characters, and for the rest it is only important that it is well-defined. ... > > Yes, there are a lot of possible equivalent identifiers. So what? I > > don't see any semantic problem with that; the only thing that > > matters as far as the language is concerned is that we can tell > > whether two identifiers are equivalent. > > I regard this as complete nonsense. I am surprised people intended > this kind of nonsense. If the ARG insists on this interpretation, I > VERY much doubt that anyone would ever implement it. My understanding was that Pascal did a complete implementation for the IBM Rational compiler. And given that he knew the intent, I would guess that he did implement all of this (it's actually quite simple, anyway). ... > Nope, what should have been done is to allow arbitrary letters in > identifiers but NOT extend case equivalence beyond Latin-1, THAT was > the mistake, and you cannot blame this on Unicode, after all most > reasonable programming languages don't have case equivalence of > identifiers, so it is not on the Unicode radar screen wrt programming > languages. Your so-called "character set experts" > are talking about the general issue of case conversion, not the > specific issue of case equivalence in identifiers. We followed the Unicode recommendations for programming language identifiers quite closely. Pascal explained the differences in detail in AI95-00285. Now, I should point out that those recommendations changed quite substantially between Unicode 4.0 and 5.0, and we adopted Binding Interpretation AI05-0091-1 to reconcile these differences. But I don't recall any change in the case equivalence rules (I'll have to go look up the current state of those recommendations when I write this up as an AI). > For one thing, there is no problem in converting the small > parenthesized letters to upper case for the general usage, e.g. small > (c) becomes three characters (C). But you can't go that way for > identifiers. There are other such examples. Huh? That is exactly the sort of thing that is intended here (for identifiers). Not sure why you say "you can't go that way for identifiers". In any case, the (c) isn't a letter, so it is irrelevant in the identifier context. > So you don't really mean to adopt all of the Unicode multi-character > scheme, because it won't work for identifiers. Why not? > Given that you need a special set of rules, different from the general > Unicode rules, I think the rule I propose is simple and good enough. > Once again, the rule I propose is > > If two code points in 10646 have names that are the same except > for CAPITAL LETTER <---> SMALL LETTER, then case conversion > converts between them. Otherwise case conversion has no effect. > > This handles all the common reasonable cases, as well as some > reasonable cases in other languages. > > Another point is that in general case equivalence is locale dependent, > consider the situation with the letter I. Unicode defines a locale-independent case folding algorithm. That is what the Ada Standard requires using. > We have the following 10646 code points: > > > 0049;LATIN CAPITAL LETTER I;Lu > > 0069;LATIN SMALL LETTER I;Ll > > 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu 0131;LATIN SMALL > > LETTER DOTLESS I;Ll > > Now in locales which don't use the last two entries, like english, the > case mapping is 49<-->69, but in locales that do use the last two > entries (I think if I remember Turkish is a case in question), the > case mappings are > 49<-->131 and 69<-->130. > > That's fine for general use, but obviously equivalence of Ada > identifiers can't be locale dependent. There is an extensive discussion of this issue in the AI and in the AARM (2.3(6.a-j)). The example of Turkish is given. The URL of the intended case foldings is provided in the AARM (2.1(14.g)): http://www.unicode.org/Public/4.0-Update/CaseFolding-4.0.0.txt Note that 1.1.4(14.2/2) says that locale-independent full case folding is used, meaning that the string lengths can change. Presumably this mapping has changed in later versions of Unicode, and one unanswered question is whether we intend to track these changes (which is likely to be incompatible) or just freeze it at Unicode 4.0. Looking at this table, I see that it is a mapping to *lower* case, so there is in fact a very real problem with the definition of the language (the definition of converting to upper case actually goes to lower case, which makes no sense). The intent that we used in examples was that both 69 and 131 map to 49. That's why there is a rule that non-standard reserved word spellings are illegal (the string coded 131 65 maps to "IF", but we don't want to be able to spell reserved words that way - so it is declared illegal). But this doesn't follow from the defined mapping. > My rule looks at these four entries, and decides that the only mapping > that makes sense is 49<-->69 which is what you want in this case. > > If people want to write identifiers with code points 130 and 131, they > can, but they should not expect case equivalence. > > Note that it is not at all terrible to forbid case equivalence, since > good practice is to always write identifiers with the same casing > anyway. > > I would be interested in what Randy has to say (or what he thinks the > standard says) about these four code points??? See above. The AI discussion that inspired the AARM note gives the intent. Although it is clear that the intent isn't described properly in the standard. And I'm not going to make any comment on the desirability of that intent! > Note by the way that Ada compilers (GNAT in particular) have allowed > wide characters in identifiers for ever, and that's because despite > Randy's feeling, lots of people in non-english countries find this > very useful. We did not however implement case equivalence for such > identifiers until forced to do so, and it was probably a mistake to do > so. > > For me, it is fundamental that To_Lower and To_Upper should be > invertible, i.e. if > > To_Lower (C) /= C > > then > > To_Upper (To_Lower (C)) = C > > I personally would not have introduced the To_Lower function, or > extended it to apply to strings, but that's apparently done now, and I > think my intepretation is the only one that makes sense. My understanding of Unicode suggests that there shouldn't be a To_Lower for Wide_String handling, so the issue doesn't really come up. For Strings, the rules are as they are for Ada 95 (else programs would change behavior when moving to newer compilers, which would be very undesirable). [Unfortunately, that understanding doesn't seem to be reflected by the case folding tables, which appear to do something else altogether. Pascal, et. al. didn't read carefully enough...] > Note incidentally that the existing packages as proposed by the ARG do > case conversion on a single character to single character basis, and > so for that model, my rule is the kind of rule you need. The To_Upper > function in Ada.Wide_Characters.Handling simply does not allow for an > input of one character and an output of several characters. Right, that definition is clearly broken based on the current language definition. That was my point that started this discussion. For the intended current language definition, I think there should only be a wide_string version of To_Upper, and the other three functions are junk. But it is OK with me to change the language definition: we just need to be extremely clear about what we are doing and why. And as always, we need to have good technical reasons for a change, as it is likely to be incompatible with (partial) Ada 2005 implementations other than GNAT. > Anyway, I don't particularly care what is decided, I don't think it > will make any difference at this stage. I don't care what is decided, either, but I do think it is important that there are ACATS tests reflecting what is decided. The entire point of the ACATS is to encourage uniformity between Ada implementations, and this seems like an important area where uniformity is needed and may not occur naturally. > P.S. I am surprised that no one resurrected Jean's insistence that > lower-case-e-acute be considered equivalent to upper-case-e-no-accent > :-) We decided to do whatever Unicode recommended, specifically to avoid such discussions. There is no value to arguing about particular characters -- it's not our area of expertise. > BTW, the original implementation of To_Upper in GNAT refers to note 1 > in the ISO 10646 standard, I don't have a copy of the standard handy, > does someone know what note 1 says? Note 1 is our way of saying that you get case folding from the Unicode definition (it isn't in 10646). We were afraid to mention Unicode directly, because there was some history of standards being rejected at high levels for doing that. But since 10646 doesn't define case folding, we have to get it from somewhere (we surely were not going to define our own), and Unicode seemed like the appropriate place. **************************************************************** From: Robert Dewar Sent: Wednesday, July 7, 2010 1:32 PM > (For the record, I had nothing significant to do with these choices; > this area was primarily Pascal's and Kiyoshi's. I'm just trying to > explain what they decided and what the entire ARG approved. Please do > not attribute any of these ideas to me!) Fair enough, this clearly needs rediscussing! I am unconvinced the ARG really understood the issues, or really understood that they were introducing upwarrds incompatible changes. > So, since "coextensions" are clearly a mistake, no one has to > implement them! Yaa!! :-) I had more in mind the Ada 83 rule that made subtype X is integer range 1 .. 10; be a non-static subtype > In this case, this isn't a mistake; it was a carefully considered decision. > You just don't like the decision (and based on the old mail, you > didn't like in 2002 and 2005, either). I am unconvinced this was carefully considered. In particular, if no one understood that it was introducing non-upwards comaptibility, then the discussion was seriously flawed. I am also quite unconvinced that people understood the other ramifications. In my experience the ARG does not care to think deeply about character issues (your comments about deferring to outside experts are telling in this regard!) To me the failure to explicitly worry about the compatibility issiue is a fatal flaw in the discussions. There are plenty of people on the ARG who couldn't care less about wide character issues, but who are VERY concerned about introducing gratuitous incompatibilities. > I don't understand what you mean by "reversible" here. Even for Ada > 83, there are 2**N equivalent identifiers (where N is the length of > the identifier). If you have the all upper-case version of that > identifier, there is no way to tell which of the 2**N versions it came from. > >> For example, it seems entirely wrong that the identifier j*y be >> considered case equivalent to jssy. Your "character code experts" >> would be quite suprised at this suggestion. > > Possibly. I'm not going to guess. But you need to KNOW the answer to this before you decide to agree > I'm unconvinced that the incompatibility is that serious, in that you > would get a compile-time error from the problem in the very rare case > when it occurs. If you can find a case where that is *not* true, that > would have been an important data point in choosing a different rule > (I would probably have pushed for something compatible). Language designers always seem to operate in a mode of "well it's easy to fix the sources", they often seem totally unaware of the impact of incompatible changes. For example, the change on return of limited types has VERY severely impeded the uptake of Ada 2005. > My biggest concern here is readability. But I've come to the > conclusion that is a red herring. That's because code that uses > identifiers having characters outside of the base 128 characters are > never going to be readable to some subset of programmers. The use of > identifiers in a local language with "funny" characters is probably > only readable to speakers of that language. So "portable" code will > avoid such characters, and for the rest it is only important that it is well-defined. Actually I have a different view. I think that general good practice is to always spell identifiers the same throughout a program, so turkish programmers are just fine, they can use any of the four i's in identifiers, and just spell consistently throughout. Yes, the compiler won't detect some undesirable cases of identifiers that should not be allowed to coexist, but auxiliary tools can take care of this. For instance, there are probably French programmers who think it is a bad idea to have two identifiers that differ only by an acute accent over an E, but the language won't help them, the same kind of tools can help them. >> I regard this as complete nonsense. I am surprised people intended >> this kind of nonsense. If the ARG insists on this interpretation, I >> VERY much doubt that anyone would ever implement it. > > My understanding was that Pascal did a complete implementation for the > IBM Rational compiler. And given that he knew the intent, I would > guess that he did implement all of this (it's actually quite simple, anyway). > We followed the Unicode recommendations for programming language > identifiers quite closely. Pascal explained the differences in detail in AI95-00285. > > Now, I should point out that those recommendations changed quite > substantially between Unicode 4.0 and 5.0, and we adopted Binding > Interpretation AI05-0091-1 to reconcile these differences. But I don't > recall any change in the case equivalence rules (I'll have to go look > up the current state of those recommendations when I write this up as an AI). > >> For one thing, there is no problem in converting the small >> parenthesized letters to upper case for the general usage, e.g. small >> (c) becomes three characters (C). But you can't go that way for >> identifiers. There are other such examples. > > Huh? That is exactly the sort of thing that is intended here (for > identifiers). Not sure why you say "you can't go that way for identifiers". > In any case, the (c) isn't a letter, so it is irrelevant in the > identifier context. Sorry, you don't know what you are talking about, of COURSE (c) is a letter, why would I have mentioned it otherwise? :-) Here is the 10646 entry for it: 249E;PARENTHESIZED LATIN SMALL LETTER C;So;0;L There are 25 more entries like that, probably you got confused with the copyright symbol: 00A9;COPYRIGHT SIGN;So That's something completely different, and is not a letter (it does not have parens, it has a circle around the C). >> So you don't really mean to adopt all of the Unicode multi-character >> scheme, because it won't work for identifiers. > > Why not? see above By the way, I formally object to Ada depending in anyway directly on Unicode, this is quite improper. **************************************************************** From: Robert Dewar Sent: Wednesday, July 7, 2010 1:33 PM > (For the record, I had nothing significant to do with these choices; > this area was primarily Pascal's and Kiyoshi's. I'm just trying to > explain what they decided and what the entire ARG approved. Please do > not attribute any of these ideas to me!) Well you are arguing for the position, so I really think it is fair to attribute at least agreement to you. If you disagree, please make this clear and explain why. **************************************************************** From: Robert Dewar Sent: Wednesday, July 7, 2010 1:37 PM By the way, with regard to ACATS tests, you can't have it both ways. Either you regard it as reasonable and common to have the * character in identifiers, in which case the upwards incompatibility inadverently introduced by the Ada 2005 change is serious. Or you think it is obscure usage, obscure enough not to worry about the incompatibility. In which case tests for obscure features do not belong in the ACATS tests. Anyway, I don't see any substantial resources being available for development of new ACATS tests in any case, and I don't think that's a terrible thing, since their utility would be minimal. **************************************************************** From: Bob Duff Sent: Wednesday, July 7, 2010 1:44 PM > By the way, I formally object to Ada depending in anyway directly on > Unicode, this is quite improper. Why is it improper? (Note that I am one of those you referred to with: "There are plenty of people on the ARG who couldn't care less about wide character issues, but who are VERY concerned about introducing gratuitous incompatibilities." Well, I guess I care enough to be curious why it's improper to depend on Unicode. ;-)) **************************************************************** From: Robert Dewar Sent: Wednesday, July 7, 2010 2:20 PM the last I knew Unicode was not an ISO standard, but perhaps that has changed???? **************************************************************** From: Robert Dewar Sent: Wednesday, July 7, 2010 2:22 PM > (Note that I am one of those you referred to with: "There are plenty > of people on the ARG who couldn't care less about wide character > issues, but who are VERY concerned about introducing gratuitous incompatibilities." So that people understand, here is one example of an incompatibility, there may be others Again * is the german beta standing for two lower case s's In Ada 95, the following program is legal: package X is Y* : Integer; Yss : Integer; end X; In Ada 2005 this program becomes illegal because both ientifiers map to upper case YSS. **************************************************************** From: Randy Brukardt Sent: Wednesday, July 7, 2010 2:32 PM > > (For the record, I had nothing significant to do with these choices; > > this area was primarily Pascal's and Kiyoshi's. I'm just trying to > > explain what they decided and what the entire ARG approved. Please do > > not attribute any of these ideas to me!) > > Well you are arguing for the position, so I really think it is fair to > attribute at least agreement to you. If you disagree, please make this > clear and explain why. I tend to be a strict constructionist, and tend to argue in support whatever the Standard says unless there is a clear technical reason that it is incorrect. (Tucker knows this well vis-a-vis pragma Pack.) In this case, there was a clear intent that I described in the AARM notes written at the time (and reviewed by Pascal and some others as well), and I am just explaining what that intent is. I personally made no attempt to determine whether or not that intent is a good thing, because honestly, I don't care what happens with program text outside of the first 128 characters beyond the rules being well-defined. In this case, of course, I've discovered that there is something bizarre about the rules as written (the defined upper case conversion actually goes to lower case), and that alone provides a reason to reopen the discussion. There is also the undocumented incompatibility, although I think I just missed that when writing the AARM. Clearly, there is an incompatibility described in the AARM note of 2.3(5.c/2), and that example was discussed in the ARG. But I agree that there isn't any clear discussion of that in either of the AIs (AI95-00285 and AI95-00395), so it's not clear that the ARG properly considered the other incompatibilities caused by case-collisions changing. **************************************************************** From: Randy Brukardt Sent: Wednesday, July 7, 2010 2:43 PM > Again * is the german beta standing for two lower case s's > > In Ada 95, the following program is legal: > > package X is > Y* : Integer; > Yss : Integer; > end X; > > In Ada 2005 this program becomes illegal because both ientifiers map > to upper case YSS. Right. Similarly, Acce* : Integer; (* means as above) is illegal in Ada 2005, while it is OK in Ada 95. I'm pretty certain this incompatibility was known, because it was discussed at a meeting and is mentioned in the AARM (although not in the incompatibilities section). My concern with this today is whether there is some case where this change could cause a Beaujolais-type effect. (If there are, then the incompatibility is much more dangerous than the one discussed and approved.) I don't think so because use-clause cancelation would make any problem cases that I can think of illegal. Humm, maybe local objects could do something nasty: procedure Y is Y* : Integer := 0; begin Y* := Some_Function (...); declare YSS : Integer := Some_Other_Function (...); begin P (Y*); -- Ouch!! end; end Y; In Ada 2005, YSS would hide Y*, while in Ada 95, both would be visible. So a different object would be used in the call to P (Ada 95 would use Y*, Ada 2005 would use YSS), without any compile-time error. Of course, this would be execeedingly unlikely (you need similar objects in nested scopes, they have to be of the same type, etc.), but it does seem scarier than the pure compile-time incompatibility in rare cases. **************************************************************** From: Randy Brukardt Sent: Wednesday, July 7, 2010 2:47 PM > the last I knew Unicode was not an ISO standard, but perhaps that has > changed???? It's not. 10646 is the ISO version of Unicode, but the 2003 version was heavily simplified and had no case conversion information. (At least that is what we were told, I personally have never looked at 10646.) Thus we had to reference something else to get that information (which is critical to Ada - it is a case insensitive language). **************************************************************** From: Tucker Taft Sent: Wednesday, July 7, 2010 2:51 PM In 1991 they effectively unified Unicode and ISO 10646: http://unicode.org/faq/unicode_iso.html and apparently ISO 10646:2003 references Unicode with regard to case conversion (at least that is what is implied by AARM 2.1(14.f/2)). **************************************************************** From: Randy Brukardt Sent: Wednesday, July 7, 2010 3:05 PM > conversion (at least that is what is implied by AARM 2.1(14.f/2)). My understanding is that the referenced Note 1 of 10646 says essentially that if you want case conversion information, go see Unicode. Which is what we did. **************************************************************** From: Robert Dewar Sent: Wednesday, July 7, 2010 3:04 PM > In 1991 they effectively unified Unicode and ISO 10646: > > http://unicode.org/faq/unicode_iso.html > > and apparently ISO 10646:2003 references Unicode with regard to case > conversion (at least that is what is implied by AARM 2.1(14.f/2)). Someone needs to verify this, I can't find my copy of the 10646 standard (only the tables that I extracted from it :-)) **************************************************************** From: Robert Dewar Sent: Wednesday, July 7, 2010 2:56 PM > In Ada 2005, YSS would hide Y*, while in Ada 95, both would be > visible. So a different object would be used in the call to P (Ada 95 > would use Y*, Ada > 2005 would use YSS), without any compile-time error. Of course, this > would be execeedingly unlikely (you need similar objects in nested > scopes, they have to be of the same type, etc.), but it does seem > scarier than the pure compile-time incompatibility in rare cases. Indeed a nasty upwards incompatibility. The trouble is that if you end up having to say "we are pretty much compatible, but there are cases where the meaning of a program changes, and the behavior is different in Ada 95 and Ada 2005, but don't worry, it is very unlikely that this will happen in practice." Then we worry people who are maintaining millions of lines of legacy code. How do they know whether they will hit these rare cases? Answer they don't. So even if the cases are obscure, it is better to be able to say absolutely: "There are only a very small number of cases of upward incompatibility in going from Ada 95 to Ada 2005 (for example, the introduction of the new keyword INTERFACE). But in every case, if you hit one of those cases, the compiler will signal the incompatibility as an illegality. So there is no possibility of silent change of behavior due to one of these incomaptible changes." Do we have other cases in the nasty category of silent changes in behavior? Or is this the only one? **************************************************************** From: Robert Dewar Sent: Wednesday, July 7, 2010 3:03 PM > It's not. 10646 is the ISO version of Unicode That's not really quite right, 10646 was developed independently, coordinated with Unicode, but it is not right to call 10646 the ISO version of Unicode. > but the 2003 version was > heavily simplified and had no case conversion information. (At least > that is what we were told, I personally have never looked at 10646.) > Thus we had to reference something else to get that information (which > is critical to Ada - it is a case insensitive language). We had no business referencing the unicode standard. Instead we should have spelled out the exact rules in the Ada RM in detail without appealing to improper outside authority. If this results in a chaotic pile of junk, all the more reason to change to the simple rule I propose. Let me state it again: Two identifiers are considered case equivalent if and only if a) they are the same length b) for each character in one identifier, and the corresponding character in the other identifier, one of the following is true: o They are the same character o The two characters have names defined in 10646 which differ only in the replacement of CAPITAL LETTER by SMALL LETTER or vice versa (No outside reference required). I really think this is the right rule: a) it introduces no upwards incompatibilities with Ada 95 b) it captures all the common Latin-1 cases we are used to c) it does a reasonable job on other alphabets d) it captures the intent of 10646 in assigning these names BTW, in the 10 years between Ada 95 and Ada 2005, does anyone remember some German who was distressed that E*en and ESSEN were not considered case equivalent? :-) **************************************************************** From: Bob Duff Sent: Wednesday, July 7, 2010 3:23 PM > Do we have other cases in the nasty category of silent changes in > behavior? Or is this the only one? I think there are about a dozen. You can find them all by searching for "Inconsistencies With Ada 95" in the AARM. Well, all the ones except the ones that we forgot to document. ;-) Here's one: use Ada.Calendar; ... Put_Line (Year_Number'Image(Year_Number'Last)); prints a different answer in Ada 95 versus Ada 2005. **************************************************************** From: Randy Brukardt Sent: Wednesday, July 7, 2010 3:45 PM > I think there are about a dozen. You can find them all by searching > for "Inconsistencies With Ada 95" in the AARM. > Well, all the ones except the ones that we forgot to document. ;-) Better than searching, use the index in the AARM. That was started by the Ada 95 standard (probably by you, didn't you do the initial index?), and it is so useful that I've continued it to the present day. **************************************************************** From: Bob Duff Sent: Wednesday, July 7, 2010 4:07 PM Yes, I'm primarily responsible for the Ada 95 Index; I'm kind of proud of that. I put in everything I could think of that was even remotely useful. For example, terms from languages like C, pointing to the Ada equivalent (e.g. "cast" --> "unchecked conversion"). And a couple of silly jokes, although I think I made sure those appear only in the AARM. ;-) Thanks for continuing to maintain the Index! One of my pet peeves is technical documents with a bad (or no) index. My other pet peeve is books that have separate indices for different sorts of things (index of predefined functions, index of built-in operators, etc -- if I'm using the index, how am I supposed to know whether so-and-so is a function or a gizmo or a whatnot?!) The original Ada 95 index was badly alphabetized, because of a bug in Scribe that I didn't notice. You fixed that -- that's when I noticed! I'm also kind of proud of the fact that the index points to the exact paragraph number, rather than (as in most books) the page number. And the fact that the "see blah" entries don't make you go look up "blah". But I still search the AARM more often than I use the index. **************************************************************** From: Robert Dewar Sent: Wednesday, July 7, 2010 4:13 PM > Yes, I'm primarily responsible for the Ada 95 Index; I'm kind of proud > of that. I put in everything I could think of that was even remotely > useful. For example, terms from languages like C, pointing to the Ada > equivalent (e.g. "cast" --> "unchecked conversion"). For the record "cast" is an Algol-68 term meaning something closer to type conversion than unchecked conversion :-) **************************************************************** From: Randy Brukardt Sent: Wednesday, July 7, 2010 4:22 PM > I'm also kind of proud of the fact that the index points to the exact > paragraph number, rather than (as in most books) the page number. For the HTML version, I spent a lot of effort so that clicking on the paragraph number (or just the section in the ISO version) actually takes you directly to the reference -- even if it is in the middle of a large paragraph. That's what I used to look at all of them quickly in response to Robert's question. The links for the index, the cross-references, and for the syntax terms is the big advantage of the HTML version compared to the other versions (and the primary reason I use it mostly -- but I still refer to the paper one when I know what I want to look up, as it is faster to pick up and open to the right place than any on-line version, especially as my well-worn version automatically opens up to important sections that are commonly referenced [a feature of binding wear, I think. I've noticed the same effect with paper Rand-McNally road atlases; during a trip through some state, the map tends to open easily to the right page for that state after a dozen or so references.]). **************************************************************** From: Bob Duff Sent: Wednesday, July 7, 2010 4:38 PM > For the record "cast" is an Algol-68 term meaning something closer to > type conversion than unchecked conversion :-) Interesting. In C, it sometimes sort of means "unchecked conversion" and sometimes sort of "type conversion", so my index entry is slightly misleading. Oh, well. **************************************************************** From: Robert Dewar Sent: Wednesday, July 7, 2010 4:45 PM > In C, it sometimes sort of means "unchecked conversion" and sometimes > sort of "type conversion", so my index entry is slightly misleading. The Algol-68 image is melt down the value, and then pour it into a cast corresponding to the result MODE (type). **************************************************************** From: Randy Brukardt Sent: Wednesday, July 7, 2010 6:06 PM ... > So even if the cases are obscure, it is better to be able to say > absolutely: > > "There are only a very small number of cases of upward incompatibility > in going from Ada 95 to Ada 2005 (for example, the introduction of the > new keyword INTERFACE). But in every case, if you hit one of those > cases, the compiler will signal the incompatibility as an illegality. > So there is no possibility of silent change of behavior due to one of > these incomaptible changes." > > Do we have other cases in the nasty category of silent changes in > behavior? Or is this the only one? I'm sure we can't say the above. All of the runtime inconsistencies are documented as "Inconsistencies with Ada 95". (Well, except this one, which I wasn't aware of.) They're all listed in the Index of the AARM (there are 15 listed there, the first one is the definition of the header). A lot of them are really very obscure or even fixing clear bugs. For instance, 3.3.1(34.f/2) says that Unconstrained aliased objects are no longer constrained by their initial value, so an Ada 95 program that raises Constraint_Error might not do that in Ada 2005. Feel free to look at them and decide for yourself if any are unacceptable. (That is why I've done the work of documenting these things, after all.) Note that there currently are 11 of these listed for Ada 2012. They are all bug fixes except for the composition of untagged record "=" change, which we all agreed was preferrable to trying to figure out some way for the old semantics to hold. It almost always would fix a bug rather than introduce one anyway. **************************************************************** From: Christoph Grein Sent: Sunday, December 19, 2010 6:05 PM Robert Dewar gave this example (* the German sharp s): package X is Y* : Integer; Yss : Integer; end X; and later E*en and ESSEN (but there is no word E*en, of course there is no German complaining about a non-existing word). Just to give a more appropriate example: Bu*e and Busse are two different words in German (pronounced differently), the first meaning a fine or penitence, the second the plural buses. But in Switzerland, they don't write the sharp s *, so the two word written are indistinguishable. BTW: There has been defined and entered a capital form of * meanwhile (don't know the code point). **************************************************************** From: Georg Bauhaus Sent: Monday, December 20, 2010 8:05 AM > BTW: There has been defined and entered a capital form of * meanwhile > (don't know the code point). http://unicode.org/versions/Unicode5.1.0/#Tailored_Casing_Operations Uppercasing U+00DF ( á ) LATIN SMALL LETTER SHARP S to the new U+1E9E LATIN CAPITAL LETTER SHARP S **************************************************************** From: Georg Bauhaus Sent: Monday, December 20, 2010 8:37 AM > Just to give a more appropriate example: > > Bu*e and Busse are two different words in German (pronounced > differently), the first meaning a fine or penitence, the second the > plural buses. > > But in Switzerland, they don't write the sharp s *, so the two word > written are indistinguishable. Some more technical examples of, I'd guess, possible relevance in actual programs (assuming absence of * in Swiss German means ss), and a question: - Ma* (measure, noun) Swiss German: Mass - Ma*e (plural of Ma*) Swiss German: Masse - Masse (German for mass) Swiss German: Masse (Notice how both Masse and Masse mean two very different, yet related things in Swiss German.) - ma* (past simple of messen, verb (to measure)) Swiss German: mass MASS: upper case spelling Ada 95 has introcuded Latin_1 such that ss and * are allowed generate different identifiers. This seems consistent with the assumption that identifiers don't have an "=" operation of their own, really. Rather, comparison of identifiers just delegates to comparing components, presuming an identifier is nothing more than a subtype of some String type, and not a distinct Identifier type (with whatever representation). Does ISO 10646, by referring to Unicode casing options and algorithms, suggest, then, that Ada should start considering identifiers as objects of a distinct type, one that is conceptually independent of its constituent parts'? **************************************************************** From: Adam Beneschan Sent: Monday, December 20, 2010 10:03 AM Question: When I studied German (in school, and partly from my father who is a native [Donauschwaben] German speaker), I learned that using á was optional--i.e. you could always write "ss" instead of á, but not the other way around (there are cases where á is inappropriate). Is this correct? Also, I was under the impression that the rules for when á could be used changed in recent years. It may not be entirely an academic question, if it relates to an issue of what algorithm Ada should use for determining whether identifiers are identical. **************************************************************** From: Randy Brukardt Sent: Monday, December 20, 2010 12:22 PM The entire set of examples using sharp-s was based on a mistaken understanding of the Unicode case folding mechanism. Ada 2005 mistakenly defines it as "convert to upper case", but the actual mapping is to *lower case*. Thus Ada 2005 is extremely confused on this. Sharp-s is always different than any other character in Unicode. So any examples using sharp-s are irrelevant, at least as far as identifier equivalence is concerned. If you want to give examples using the Turkish dotless I, then there might be something to talk about... **************************************************************** From: Randy Brukardt Sent: Sunday, January 23, 2011 12:30 AM Attached is a new version of AI05-0227-1, the case equivalence AI. This generally reflects the discussion of the ARG, and includes full wording, but there are two major differences. First, I used "locale-independent simple case folding", rather than the full case folding that we had previously discussed. This happened because I noticed the following line in the case folding description when I was looking at the dotless I issues: 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S This says that for "full case folding", Sharp S is the same as "ss". I obviously missed that when I looked at this table in the past. That's a non-starter for the compatibility reasons previously noted. I revised the discussion based on that. Second, I did not add any rule about declarations in the same scope having distinct upper-case mappings. There are many reasons for this. * The simple case folding has many fewer equivalences, so the problem is much less likely. * There is nothing in the standard that requires the default External_Tag values to be in a particular case, so there is no issue with those (so long as a canonical representation is used). * Wide_Wide_Exception_Name does have such a requirement, but given that the number of exceptions is typically low, and this is only debugging information (there is no routine that works like 'Value), it is hard to be concerned about what is obviously a pathology. * This check is likely to be expensive; either it requires storing an extra representation of each identifier in upper case, only used for this check (wasteful of space) or doing the conversion on the fly (wasteful of time). Either way, the number of comparisons needed for each declaration is proportional to the number of declarations in the scope, meaning that the check is quadratic in the total number of declarations. It is not unusual to have thousands of declarations in a package (especially in automatically generated code), so this could be a problem. While the problem still could occur for enumeration type declarations, there are many fewer of those. In any case, if we are going to do this generally, why are the identifiers distinct in the first place?? It would be easier to just make them map to the same identifier value, then no check would be needed. * Upper case mappings are not considered stable in Unicode. That is, future versions may be different. That could introduce incompatibilities into Ada where previously legal programs are now illegal because of upper case mappings. Again, this could happen for enumeration types, but that clearly is a much more limited incompatibility than "anywhere". For all of these reasons, I only made a requirement for distinct upper case mappings for enumeration literals. Please read the AI (especially the wording and discussion) and make any comments needed. ****************************************************************