CVS difference for ais/ai-00285.txt

Differences between 1.12 and version 1.13
Log of other versions for file ais/ai-00285.txt

--- ais/ai-00285.txt	2003/08/04 23:50:30	1.12
+++ ais/ai-00285.txt	2003/09/19 01:42:27	1.13
@@ -3745,3 +3745,379 @@
+From: Kiyoshi Ishihata
+Sent: Monday, July 28, 2003  10:01 AM
+> The next meeting of Japanese SC22 is on July 18.  After that, I will
+> send you a brief report about our thought, but please understand that
+> this may be a tentative position.
+Sorry for the delay.  I summarized our discussion as follows.
+If our position is accepted, the AI should go through major
+rewrite process.
+(1) Do not refer to Unicode
+The current AI frequently refers to Unicode and the Web site of the
+Unicode Consortium.  It is not appropriate in ISO/IEC context.  Simply
+changing the word "Unicode" to "ISO/IEC 10646" is not enough, since
+two systems are much different than you might think.
+Characters of 10646 and Unicode are identical, or at least intended to
+be identical.  Their code positions are the same.  However, the following
+Unicode products mentioned in this AI do not exist in the 10646 world.
+   Character categorization
+   Recommendation of character repertoire for identifiers
+   Normalization Form KC
+   Full case folding
+   Characters terminating lines
+These specifications are defined by the Unicode Consortium, and should
+not be regarded as internationally agreed standards.  We do not agree
+with the idea to define Ada rules based on these reports.
+Yes, some languages like Java and C# refer to these Unicode reports.
+But, they are "imported" languages in ISO/IEC standardization processes,
+different from Ada in this respect.
+(2) Recommendation of character repertoire for identifiers -- TR 10176
+The following Technical Report has been published by JTC1.
+   ISO/IEC TR 10176:2003
+   Information technology -- Guidelines for the preparation of
+   programming language standards (Fourth edition)
+This report contains a table of characters which should be made usable
+in identifiers (Annex A).  We believe that this document and ISO/IEC 10646
+itself are the only possible references in the Ada standard.
+The table of characters for identifiers is supposed to be identical to
+the above mentioned Unicode report.  In fact, the term "Unicode Character
+Database" does appear in Annex A of the TR.  The TR has been frequently
+revised, probably following the requests from Unicode people.
+Therefore, changing the reference from Unicode to TR 10176 does not
+significantly change the definition of character repertoire.
+The TR defines the character repertoire by enumerating all allowed
+characters.  In a formal sense, it does not depend upon the concept
+"character categorization".  Although Annex A of the TR gives categorization
+of each allowed character, the categorization itself does not contribute
+to the definition.
+A possible demerit of referring to 10176 is the lack of timeliness of
+revisions.  In the future, Unicode reports will be promptly revised.
+Compared to this, the revision process of 10176 would be slower.
+However, since 10176 is the only possible reference in the ISO/IEC world,
+we have no other options.
+Note that some languages including C++ and Cobol define characters
+for identifiers based on the recommendation of TR 10176.
+(3) No national variants of numeric literals
+We do not like to extend the character repertoire for numeric literals.
+Identifiers are used to denote entities, and often words or phrases of
+natural languages are used to compose identifiers.  Therefore, it is quite
+beneficial to use one's mother tongue in spelling identifiers.
+Numeric literals are much more universal.  They denote numeric values,
+and, unlike identifiers, do not have culture sensitive nuances in them.
+At least we Japanese are happy to write numeric literals only using
+ASCII characters.
+We understand that there is no Unicode report recommending to extend
+characters for numeric literals.
+Numeric literals denote values, and their values should be computed
+from the value denoted by each digit.  If national variants are allowed,
+people not knowing other countries' characters cannot compute the values
+of literals.  On the other hand, identifiers can be recognized by people
+of other countries through the process of pattern matching of characters.
+This is not easy, but anyway is possible.
+In summary, we believe that extending characters for numeric literals
+do more harm than the benefit gained.
+Of course, national variants of number representation may be useful in
+Input-Output.  But, this is a different issue.
+(4) No normalization
+We cannot refer to "Normalization Form KC" which is a Unicode term.
+Neither 10646 nor 10176 provides substitute for this concept.  Therefore
+we cannot introduce the character normalization process.
+This is not bad, we think.  For example, the character "letter A with
+umlaut" is regarded different from the combination of two characters
+"letter A" and "umlaut".  But, in the first place, it is not a good idea
+to have two representations of a single conceptual character.  People
+would try to define their own canonical representation of characters.
+Regarding "A with umlaut" and "A"+"umlaut" pair as different would not
+be a severe burden for them.
+Implementations would be much easier, since they can resort to simple
+byte-to-byte comparison.
+You say
+> This is to ensure that identifiers which look visually the same are
+> considered as identical, even if they are composed of different characters.
+but this principle is not strongly enforced.  The obvious example is
+Latin A and Greek Alpha.  They look identical but are distinguished
+in identifiers.  We think that they are inherently different characters
+and there are no reasons to consider them the same in identifiers.
+(5) Uppercase-lowercase correspondence
+In Ada, we must have one particular normalization process, which is
+the uppercase-lowercase correspondence.  10176 does not say anything
+on this topic, so we have to devise some feasible definition.
+One possible way of definition is to utilize character names defined
+in ISO/IEC 10646.  We can see the obvious correspondence between
+"Latin capital letter A" and "Latin small letter A".  We do not know
+whether this can easily be implemented in Ada compilers or not.
+We notice that there are cases not covered by this simple correspondence.
+For example, German "SS" corresponds to two lowercase sequences.  One
+is the string "ss", and the other is the es-zett character.  We feel that
+such complicated cases should be untouched in this time frame, waiting for
+the future standardization of appropriate ISO/IEC standards or technical
+(6) Miscellaneous
+> "JTC 1/SC 22 believes that programming languages should offer the appropriate
+> support for ISO/IEC 10646, and the Unicode character set where appropriate."
+I like to have the reference information attached to this sentence.
+This is "Resolution 02-24:  Recommendation on Coded Character Sets Support"
+of SC22 2002 plenary.
+From: Pascal Leroy
+Sent: Wednesday, July 30, 2003  8:43 AM
+Thank you for the extensive feedback.  I will obviously need to give more
+thought to your comments, and we will need to discuss them at a meeting.
+However, clearly the most contentious issue is that of eliminating references
+to Unicode.  As I am sure you realize, Unicode has much more technical "meat"
+than 10646.  So the good thing about relying on the Unicode database and
+similar documents is that we can just say "the Unicode folks did the work for
+us, we trust that they know what they are doing".  After all, the Unicode
+consortium has invested numerous man-years in their recommendations, and we
+don't have the resources or the expertise to do similar work.
+As I see it, we have three options:
+1 - Do nothing, keep the language as it is.
+2 - Base support of 16- and 32-bit characters on Unicode.
+3 - Base support of 16- and 32-bit characters on 10646.
+Evidently option #1 is easier, and frankly as a vendor I have not seen a lot of
+interest for the existing 16-bit character support, so adding a sizeable
+implementation complexity is quite hard to justify from an economical point of
+view.  The problem with this option is that it might make SC22 unhappy.
+Option #2 is the simplest technically, as we can merely reference the Unicode
+documents, and avoid having to dig into the properties of each character.  But
+as you point out, it is not kosher for an ISO standard to reference a non-ISO
+document.  So politically it is probably not going to work.
+Option #3 is evidently ISO-compliant, but 10646 says very little regarding the
+properties of characters (others than their name and code points).  I realize
+that 10176 has a list of allowed characters, but then it's a TR so it has
+relatively little teeth.  Of course we could just do what 10176 does in its
+annex A, i.e. list all the characters that we allow (and the case-conversion
+tables, and possibly the normalization tables) but that would add 50 pages of
+gibberish to the RM.  The problem with this option is that it would take a lot
+of work, and it would probably degenerate into cat fight about how case
+conversion or normalization or whatever ought to work.
+At this point I am going to consult with Jim to see how he thinks we should
+proceed.  If need be I'll refer the issue to WG9 to get guidance.
+From: Pascal Leroy
+Sent: Wednesday, August 6, 2003  10:20 AM
+>(4) No normalization
+>We cannot refer to "Normalization Form KC" which is a Unicode term.
+>Neither 10646 nor 10176 provides substitute for this concept.  Therefore
+>we cannot introduce the character normalization process.
+>This is not bad, we think.  For example, the character "letter A with
+>umlaut" is regarded different from the combination of two characters
+>"letter A" and "umlaut".  But, in the first place, it is not a good idea
+>to have two representations of a single conceptual character.  People
+>would try to define their own canonical representation of characters.
+>Regarding "A with umlaut" and "A"+"umlaut" pair as different would not
+>be a severe burden for them.
+>Implementations would be much easier, since they can resort to simple
+>byte-to-byte comparison.
+I have given more thought to normalization, and I believe that it is important
+for practical use.
+Ignore for a moment the issue of referencing Unicode.  Assume that we have no
+difficulties in describing normalization.  The question is: is normalization
+good for users?
+The problem I see is that when using a Unicode editor you have generally no
+idea how it represents a character internally.  When you type "letter A with
+umlaut" it may represent this with a single character or as "letter A" +
+"umlaut" and that's hidden to the user of the editor.  That's true regardless
+of whether you typed one or two characters on your keyboard.
+Now imagine the situation where two people write distinct compilation units
+with different editors (or maybe with different settings in a single editor).
+You might end up with the situation where the declaration of an entity has (in
+the file stored on disk) "letter A" + "umlaut" and the usage has "letter A with
+umlaut" (or vice-versa).  And that would be invisible to the user because both
+editors would merely display .  In this situation, in order to avoid utmost
+confusion and bewilderment, I think it is necessary to specify that the
+compiler treats the two sequences the same.  That's the purpose of
+As Unicode editing is going to become more and more common in years to come,
+and editors will undoubtedly become more and more fancy, I think it's important
+to deal with usability issues like this one.  Incidentally, it seems to me that
+this issue is particularly important for Korean Hangul.
+From: Pascal Leroy
+Sent: Friday, August 8, 2003  4:39 AM
+Kiyoshi said:
+> In short, I do not agree.
+Fine.  Let's get the discussion started, then.  (I am not sending this on the
+ARG mailing list as I don't want to start an endless chatter about whether we
+should be doing this at all, etc.  At some point I'll want Randy to record our
+discussion, though, to make sure it gets appended to the AI.)
+> (1) design of character code
+> I believe that a single logical character should not have two
+> different representations.  If 10646 or Unicode have two
+> representations for A with umlaut, it is the fault of the
+> character code system.  It should be remedied.
+In an ideal world, you are evidently right.
+(Irrelevant comment here: in an ideal world, Latin-1, the character set for
+Western European languages, would be suitable for writing French.)
+The Unicode folks explain (and I agree with them) that the "right"
+representation is "letter A"+"umlaut".  The reason is that you have many
+diacritical signs used by existing languages (mostly based on the Latin
+alphabet) and that assigning code points to all the combinations is impractical
+(code points are a scarce resource, especially if you want commonly used
+languages to remain in the BMP).  Unicode currently has more than 110
+diacritical signs.  The Western European languages only use very few of these,
+and they mostly combine with vowels, but still that consumes most of the upper
+half of Latin-1.  Greek and Vietnamese, among others, can combine two
+diacriticals, and that's a sizeable number of code points.
+Now the Unicode documents explain that there are marginal languages (they
+mention Navajo) which make complex use of diacriticals and would require many
+more code points, for a very small community of users.  Using combining
+diacriticals is the right way to go for these languages.  And of course there
+may be particular applications where people want to create unanticipated
+combination of characters (when I was trying to learn Chinese many many years
+ago, my textbook had a diacritical on top of each ideogram to indicate the
+tone; that would seem like a perfect application of combining diacriticals).
+Finally, there is the issue of fonts: developing a font that contains all the
+combinations like "letter A with umlaut" is expensive, and the resulting font
+is bulky.  Again, combining diacriticals are better.
+So why assign code points to characters like "letter A with umlaut" in the
+first place?  I suppose that the answer is compatibility to some extent
+(Latin-1 existed before Unicode, and you have to support files coded with
+Latin-1 with as little perturbation as possible), and political catfight to
+some extent (if German has specific code points, why not do the same for Polish
+or Greek; if you do it for Greek, why not Macedonian? etc., etc., ad nauseam).
+There may also have been a concern, when Unicode started, that uniformly using
+combining diacriticals would require more complex text handling algorithms,
+which would have been too costly for the computers of the time.
+> (2) role of programming language
+> Let's denote "A with umlaut" by A", and the sequence "A" and
+> umlaut by A+.  If one likes to search the character "A with
+> umlaut", he must perform two search operations, one with A"
+> and then with A+. This is very tedious, and if the target
+> string contains many such special characters, the operation
+> is nearly impossible.
+I entirely agree, but I would view this as a bug.  If you search for "letter A
+with umlaut", if should actually catch both representation (this is not hard to
+do, just normalize during search).  I noticed that Internet Explorer behaves as
+you describe, and this is really a pain as the two sequences look exactly the
+> So, my opinion is that normalization is not a role of
+> programming languages or compilers.  It should be performed
+> in some lower layers in order to maximize the convenience of
+> text file handling.
+Unfortunately, there are different forms of normalization, and Unicode
+recommends distinct normalization depending on whether the programming language
+is case-sensitive or not (I am not exactly sure why; I need to study this).  So
+the file system cannot do the normalization, it has to be done by a programming
+language tool, and the only one for which we can impose a behavior is the
+From: Pascal Leroy
+Sent: Tuesday, August 5, 2003  6:33 AM
+> At this point I am going to consult with Jim to see how he thinks we
+> should proceed.  If need be I'll refer the issue to WG9 to get guidance.
+I have talked to Jim.  He said that, from a procedural point of view, there is
+no intrinsic problem in referencing the Unicode standard in an ISO standard.
+All we need to do is some paperwork to justify the decision.  Of course, one or
+several countries could always vote "no" on the amendment because they don't
+like references to Unicode, but at least there is no procedural impossibility.
+I have also talked to John Benito, the convener of WG14 (C language).  The C
+folks are in the process of doing more-or-less what we are doing, only it's
+part of a technical report, not of an amendment.  They have been running into
+the exact same issue, i.e. opposition at the SC22 level from a number of
+countries which don't want to see references to Unicode (Japan, Canada, Norway
+and Germany are the countries he named).  John believes that it is nearly
+impossible to properly integrate 16- and 32-bit character in his standard
+without referencing Unicode.  His plan is to try to convince the SC22
+delegations of the aforementioned countries that this issue is moot because
+Unicode and 10646:2003 should be indistinguishable (and 10646:2003 references
+Unicode anyway).

Questions? Ask the ACAA Technical Agent