Version 1.5 of ais/ai-00395.txt

Unformatted version of ais/ai-00395.txt version 1.5
Other versions for file ais/ai-00395.txt

!standard 2.1(14.2/2)          05-02-07 AI95-00395/04
!standard 1.1.4(14.1/2)
!standard 2.3(1.1/2)
!standard 2.3(5.2/2)
!standard 2.9(2)
!standard 3.5.2(3.2/2)
!standard 4.1.4(3)
!standard 4.1.4(5)
!standard A.4.8(1)
!class amendment 05-01-25
!status work item 05-01-25
!status received 05-01-25
!priority High
!difficulty Easy
!subject Various clarifications regarding 16- and 32-bit characters
!summary
(See proposal.)
!problem
1 - The characters in category other_format are generally not displayed. The syntax rule for identifier would make it possible to have an identifier that includes two underlines separated by an other_format, which would visually look like two underlines. Similarly for trailing underlines, or for identifiers that would look like reserved words. Is this intended? (No.)
2 - The character at position 16#AD#, SOFT HYPHEN, is in category other_format. It was allowed in Ada 95 in literals, but the current wording means that it's no longer allowed, which introduces an incompatibility. Is this intended? (No.)
3 - Many places in normative text talk about "upper case" without qualification. This is somewhat ambiguous in the Unicode world.
4 - The definition of the image of non-graphic wide characters results in long strings like "Character_12345678". This increases the Width attribute for Wide_String and Wide_Wide_String for no good reason.
5 - AI-302-3 defines Ada.Strings.Wide_Hash and Ada.Strings.Wide_Unbounded.Wide_Hash; there should be double wide versions of these as well. Similarly, the addition of AI-362 to A.4.7 needs to be made in A.4.8.
6 - Ada.Strings.Wide_Maps.Wide_Constants and Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants define Upper_Case_Map and Lower_Case_Map. What is their effect?
7 - Wide_Wide_Character has 2**31 values. Therefore its size is 31. Because Wide_Wide_String is a packed array, each of its component should occupy 31 bits. Is this intended? (No.)
8 - ISO/IEC 10646:2003 reserves positions 16#FFFE# and 16#FFFF# of each plane, but the AARM only mentions the BMP. Is this intended? (No.)
9 - Ada compilers will have a mechanism for locale-independent case folding and character classification. It seems wrong to not allow Ada users to use these facilities.
!proposal
1 - After removing the other_format characters, an identifier must not violate the "usual" rules about underlines. It must not be a reserved word, either. Also, other_format characters are allowed (but ignored) in reserved words, and in "special" attribute designators.
2 - The incompatibility doesn't seem justified. While Unicode recommends that other_format characters be ignored in identifiers, it doesn't say anything about other constructs. ECMA C#, which we used as a guideline in resolving some of the characters issues, allow them in string literals. Hopefully decent program editors will provide a way to display these characters. Note that some languages allow any character in string literals. We do not want to go that far, in particular we do not want to allow control characters. They have been disallowed for 20 years, and there is no indication that users have had any problem with that. We are just avoiding an incompatibility. We must also specify what is the effect of the other_format character in operator names.
3 - We are not going to fix all these places. Currently we only have a rule in 2.3, but it surely doesn't cover all the occurrences of "upper case", so it would be better to have a blanket statement somewhere in section 1.
4 - Change the language-defined names to keep the current value of Width (which is 12).
5 - Add Ada.Strings.Hash and Ada.Strings.Unbounded.Hash to A.4.8's list of packages. Add A.4.7(46.1/2) to A.4.8.
6 - Ada.Strings.Wide_Maps.Wide_Constants defines Upper_Case_Map and Lower_Case_Map in terms of Ada.Strings.Maps.Constants. A.4.7(48) makes it clear that this is intended. Changing their definition would be inconsistent with Ada 95 - programs would behave differently with no compile-time indication.
Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants is defined in terms of Ada.Strings.Wide_Maps.Wide_Constants, but the note A.4.7(48) was not carried over. So it is unclear whether these are just copies (which is what the normative wording implies) or whether they cover the full range.
Covering the full range seems inconsistent with Wide_Maps. Case folding is not necessarily 1-to-1; therefore, these mappings are inappropriate for 32-bit characters anyway. Therefore, we stay consistent with Wide_Constants and add the Note to A.4.8.
[Editor's note: Robert Dewar has suggested that these notes need to be normative now, because we are now defining case conversion rules for the entire character set. I tend to agree.]
7 - There are two ways to fix this issue: add a size clause for 32 bits; or add another 2**31 literals to Wide_Character. The former has the drawback that some operations involving low-level programming (Unchecked_Conversion, C interfacing) may become erroneous. The latter has the drawback that Wide_Wide_Character does not model properly the 10646 character set, and therefore programmers who care about internationalization have to deal with the 2**31 extra values; in particular, a signed integer type on a 32-bit machine cannot hold the Pos of a Wide_Wide_Character. This AI was written for the first option.
8 - Add wording to cover positions 16#FFFE# and 16#FFFF# of each plane. Note that while the characters at positions 16#0000FFFE# and 16#0000FFFF# have the language-defined names FFFE and FFFF, the other "forbidden" characters don't have special language-defined names.
9 - Full case mapping and wide character categorization requires hefty run-time tables, so it would be inappropriate to add that to Ada.Characters.Handling. Therefore, we would need to add Ada.Wide_Characters.Handling and Ada.Wide_Wide_Characters.Handling. It seems too late to do this. [Editor's note: I don't feel that way; packages are relatively easy to define, and errors are unlikely to cause trouble elsewhere.]
!wording
1 -
Change 2.3(1.1/2-4) to read:
identifier ::= identifier_start {identifier_start | identifier_extend} identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= mark_non_spacing | mark_spacing_combining | number_decimal_digit | punctuation_connector | other_format
After eliminating the characters in category other_format, an identifier shall not contain two consecutive characters in category punctuation_connector, or end with a character in that category.
Add after 2.3(5.2/2):
After applying these transformations, an identifier shall not be identical to [the upper case version of] a reserved word.
Replace 2.9(2/2) with:
reserved_word ::= identifier_start {identifier_start | other_format}
After eliminating the characters in category other_format and converting the remaining sequence of characters to upper case, a reserved word shall be identical to the upper case version of one of the following words:
Replace 4.1.4(3) with:
attribute_designator ::= identifier [(static_expression)] | reserved_word
Add after 4.1.4(5):
A reserved word used as an attribute_designator shall be one of Access, Delta, or Digits.
2 -
Change 2.1(14/2) to read:
graphic_character
Any character which is not in the categories other_control,
other_private_use, other_surrogate, format_effector, and whose code position is neither 16#FFFE# nor 16#FFFF#.
Change 6.1(10) to read:
The sequence of characters in an operator_symbol shall be identical, after conversion to upper case, to the upper case version of the sequence of characters for one the six classes of operators defined in 4.5 (spaces and characters in category other_format are not allowed).
3 -
Add after 1.1.4(14.1/2):
When this International Standard mentions the upper case version of some character or sequence of characters, it means the character or sequence of characters obtained by using locale-independent full case folding, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2003.
AARM Note: For sequences of characters, case folding is applied to the sequence, not to individual characters. It can make a difference, sometimes.
Change 2.3(5.2/2) to read:
o The remaining sequence of characters is converted to upper case.
4 -
In 3.5.2(3.2/2), replace:
... the string "Character_" ...
by:
... the string "Chr_" ...
5 -
In A.4.8(1/2), add Wide_Wide_Strings.Wide_Wide_Hash and Strings.Wide_Wide_Unbounded.Wide_Wide_Hash packages.
In A.4.8(28/2), add Strings.Hash and Strings.Unbounded.Hash packages.
Add after A.4.8(45/2): Pragma Pure is replaced by pragma Preelaborate in Strings.Wide_Wide_Maps.Wide_Wide_Constants.
6 -
Add a note to A.4.8:
Each Wide_Wide_Character_Set constant in the package Strings.Wide_Wide_Maps.Wide_Wide_Constants contains no values outside the Character portion of Wide_Wide_Character. Similarly, each Wide_Wide_Character_Mapping constant in this package is the identity mapping when applied to any element outside the Character portion of Wide_Wide_Character. Use Ada.Characters.Wide_Wide_Handling to do more general case mapping.
7 -
Add in A.1, after the declaration of Wide_Wide_Character:
for Wide_Wide_Character'Size use 32;
8 -
Change 2.1(1/2) to read:
The character repertoire for the text of an Ada program consists of the collection of characters described by the ISO/IEC 10646:2003 Universal Multiple- Octet Coded Character Set. This collection is organized in planes, each plane comprising 65536 characters.
Change 2.1(3.1/2) to read:
A character is any character defined within ISO/IEC 10646:2003 other than those whose code position is 16#FFFE# or 16#FFFF# in each plane.
Change the second sentence of 2.1(4/2) to read:
The characters whose code position is 16#FFFE# or 16#FFFF# in each plane are not allowed anywhere in the text of a program.
Change 2.1(14/2) to read (that's in addition to the changes for this paragraph above):
graphic_character
Any character which is not in the categories other_control,
other_private_use, other_surrogate, format_effector, and whose code position is neither 16#FFFE# nor 16#FFFF# in any plane.
9 -
We're not defining anything for this question.
!discussion
(See proposal.)
!example
--!corrigendum
!ACATS test
ACATS C-Test(s) should be created to test these rules.
!appendix

From: Robert Dewar
Sent: Sunday, January 23, 2005  12:24 PM

The grammar as it is now allows identifiers to contain the sequence

   underline  other-format-character  underline

Now the normal way of handling other-format-character internally would be
to simply ignore it, but then we end up internally with an identifier
with two underscores in it. That's a real pain, since we assume that
two underscores is reserved. I really think this is undesirable for
other reasons, since other-format-character often corresponds to
something not visible, such as formatting information, and you end
up with an identifier that has two visible underscores in a row.

I would recommend we modify the grammar in AI195 to eliminate this
unpleasant possibility.

Note that the current rules also allow an identifier to effectively
end with an underscore (by ending with the sequence
underscore other-format-character) but not to begin with an underscore.

I know the standard is written in terms of how to compare identifiers,
but in fact I think may compilers will work as GNAT does, by
canonicalizing identifiers as they are scanned.

P.S. for those who don't want to go rummaging in the AI, other-format
characters include stuff like:

   invisible separator
   soft hyphen
   zero width non-joiner
   zero width no-break space
   tag space
   language tag

****************************************************************

From: Pascal Leroy
Sent: Monday, January 24, 2005  4:56 AM

(I suppose you mean AI 285, not AI 195, btw.)

I fully agree, I didn't realize that the syntax as written did allow for
two (visibly) consecutive underscores, or for trailing underscores.  It
was never my intent to allow that.  The other_format characters need to be
integrated in the BNF for identifier so that they don't interrupt an
identifier, but being typically invisible they should not be usable to
circumvent the presentation rules that we know and love.

It might be possible to fix the BNF to account for this rule, but I think
it would be clearer to add a syntax rule in English like:

"After eliminating the characters in category other_format, an identifier
shall not contain two consecutive characters in category
punctuation_connector, or a end with a character in that category."

****************************************************************

From: Robert Dewar
Sent: Monday, January 25, 2005  7:29 AM

> (I suppose you mean AI 285, not AI 195, btw.)

Yes indeed, sorry about that misprint

> It might be possible to fix the BNF to account for this rule, but I think
> it would be clearer to add a syntax rule in English like:
>
> "After eliminating the characters in category other_format, an identifier
> shall not contain two consecutive characters in category
> punctuation_connector, or a end with a character in that category."

I agree, it is a bit tricky (not impossible, but messy) to do this in BNF.
Note that once this sentence is added, you can simplify the grammar to:

    identifier ::= identifier_start {identifier_start | identifier_extend}
    identifier_start ::= letter_uppercase |
                         letter_lowercase |
                         letter_titlecase |
                         letter_modifier |
                         letter_other |
                         number_letter
    identifier_extend ::= mark_non_spacing |
                          mark_spacing_combining |
                          number_decimal_digit |
                          punctuation_connector |
                          other_format

which is exactly the grammar that annex 7 of UAX #15 recommends. So that's
nice. We adopt exactly the Unicode recommendation, with an extra sentence
giving the restriction that we decide to add.

****************************************************************

From: Pascal Leroy
Sent: Monday, January 24, 2005  7:45 AM

Excellent point!  So this is even better, we don't look like we add our
own inventions on top of Unicode.

****************************************************************

From: Dan Eilers
Sent: Monday, January 24, 2005  1:14 PM

AI 285 says:
> The characters in the category other_format are effectively ignored in most
> lexical elements, with the exception that they are illegal in string_literals
> and character_literals.

Is the intent that other-format characters will be allowed in other lexical
elements, such as reserved words, numeric literals, and compound delimiters?

It seems a little strange to be gumming up the works of lexical analyzers
by allowing certain formating characters inside certain lexemes.

****************************************************************

From: Robert Dewar
Sent: Monday, January 24, 2005  1:22 PM

> Is the intent that other-format characters will be allowed in other lexical
> elements, such as reserved words, numeric literals, and compound delimiters?

I don't know the intent, but the rules are clear, other-format characters are
allowed ONLY in identifiers.
>
> It seems a little strange to be gumming up the works of lexical analyzers
> by allowing certain formating characters inside certain lexemes.

Well it's not that hard to implement, but it does seem odd.

****************************************************************

From: Dan Eilers
Sent: Monday, January 24, 2005  2:00 PM

Then the erroneous wording in AI 285 needs to be changed from:

  The characters in the category other_format are effectively ignored in most
  lexical elements, with the exception that they are illegal in string_literals
  and character_literals.

to:

  The characters in the category other_format are illegal in all lexical
  elements except identifiers (and maybe comments).


> > It seems a little strange to be gumming up the works of lexical analyzers
> > by allowing certain formating characters inside certain lexemes.
>
> Well it's not that hard to implement, but it does seem odd.

Our lexical analyzer processes reserved words and identifiers together,
so it will have more of an impact.

Are there any users chomping at the bit to put formatting characters
in their identifiers?  If not, it seems unwise to slow down lexical
processing for everybody else on the off chance that someone might
eventually find some use for this.

****************************************************************

From: Robert Dewar
Sent: Tuesday, January 25, 2005   2:23 PM

> Then the erroneous wording in AI 285 needs to be changed from:
>
>   The characters in the category other_format are effectively ignored in most
>   lexical elements, with the exception that they are illegal in string_literals
>   and character_literals.

Well I see this wording, but I don't see anything else in the AI to
back up this position. I really don't want to have to allow these
junk characters in the middle of :=

>   The characters in the category other_format are illegal in all lexical
>   elements except identifiers (and maybe comments)

Let's add reserved words, and for sure absolutely anything shbould be allowed
in a comment except an end of line, which terminates the comment.

> Our lexical analyzer processes reserved words and identifiers together,
> so it will have more of an impact.

Ah ha, you are right, my current implementation is ignoring these other
format characters in reserved words, and it would be a huge pain to fix
this. It also is bizarre to allow these in identifiers and not in reserved
words. I also think it would be horrible (really an extension of my
double underline point) to allow identifiers that are visually identical
to reserved words, differing only in invisible format characters. I can
even see programmers misusing this when they really really want to use
a reserved word as an identifier, UGH!

> Are there any users chomping at the bit to put formatting characters
> in their identifiers?  If not, it seems unwise to slow down lexical
> processing for everybody else on the off chance that someone might
> eventually find some use for this.

Well there is merit in following the recommendations of the standard.

****************************************************************

From: Randy Brukardt
Sent: Monday, January 24, 2005  2:22 PM

> Are there any users chomping at the bit to put formatting characters
> in their identifiers?  If not, it seems unwise to slow down lexical
> processing for everybody else on the off chance that someone might
> eventually find some use for this.

My understanding of the intent is that we are trying to match (within
reason) the Unicode recommendations for program identifiers. The presumption
is that the Unicode people know more about character sets than we ever will,
so it is best to follow their lead.

I personally don't think that there are many who are "chomping at the bit"
to use any Unicode characters in identifiers. So it would be impossible to
predict what the users that do want such identifiers will want. Simply
allowing Unicode characters in identifiers is going to slow down the lexing
(as with Dan's implementation, Janus/Ada processes ids and reserved words
together), and I doubt that the particulars of the allowed characters will
make much difference.

****************************************************************

From: Robert Dewar
Sent: Monday, January 24, 2005  3:12 PM

I don't think allowing unicode characters in identifiers slows things
down significantly in practice, at least not with the approach we take
(which you can look at if you like :-)

I do think making a distinction between identifiers and keywords is a
huge menace and we should fix this. This has nothing to do with the
standard really, it is perfectly appropriate to apply the unicode
recommendations for identifiers to keywords, regarding the notion
in unicode of identifier to be more general and subsume keywords.

It's really so much easier to simply ignore the format effectors as
you store the identifier in the first place.

****************************************************************

From: Randy Brukardt
Sent: Monday, January 24, 2005  4:08 PM

> I don't think allowing unicode characters in identifiers slows things
> down significantly in practice, at least not with the approach we take
> (which you can look at if you like :-)

I think that the table lookups (which can't be pure array indexing like it
is now) will slow things down somewhat. But I don't think it will be a major
issue.

> I do think making a distinction between identifiers and keywords is a
> huge menace and we should fix this. This has nothing to do with the
> standard really, it is perfectly appropriate to apply the unicode
> recommendations for identifiers to keywords, regarding the notion
> in unicode of identifier to be more general and subsume keywords.

I certainly agree with you here, and didn't mean to give the impression that
I didn't.

> It's really so much easier to simply ignore the format effectors as
> you store the identifier in the first place.

Yes, if they don't change equality, they certainly would be ignored. In
which case, they need to be allowed in reserved words. Especially because we
don't want the abuse someone suggested about sticking invisible characters
into a keyword to make it an identifier. I'm not quite sure what wording
change is needed, however.

****************************************************************

From: Robert Dewar
Sent: Monday, January 24, 2005  9:35 PM

Randy Brukardt wrote:

> I think that the table lookups (which can't be pure array indexing like it
> is now) will slow things down somewhat. But I don't think it will be a major
> issue.

But you only look up in the tables if you have a wide character, so you
can't say that this slows things down. It is true that having to check
for letters etc is slower than the approach GNAT took before which was
to allow any wide characters in identifiers, but neither approach slows
things down for identifiers not containing wide characters.

...
>>It's really so much easier to simply ignore the format effectors as
>>you store the identifier in the first place.
>
> Yes, if they don't change equality, they certainly would be ignored. In
> which case, they need to be allowed in reserved words. Especially because we
> don't want the abuse someone suggested about sticking invisible characters
> into a keyword to make it an identifier. I'm not quite sure what wording
> change is needed, however.

The main thing is to agree that there will be no ACATS tests that test for
this anomoly :-)

****************************************************************

From: Randy Brukardt
Sent: Monday, January 24, 2005  10:25 PM

You have figure out the character class of every character somehow; that
certainly includes the Latin-1 characters. You can test for wide characters
first, then do different lookups for wide and non-wide, or you can think of
the lookup as a single operation, in which case the lookup is complicated by
handling wide characters.
The code is probably essentially the same either way, and its clearly slower
for handling wide characters outside of literals.

****************************************************************

From: Robert Dewar
Sent: Monday, January 24, 2005  11:37 PM

> You have figure out the character class of every character somehow; that
> certainly includes the Latin-1 characters. You can test for wide characters
> first, then do different lookups for wide and non-wide, or you can think of
> the lookup as a single operation, in which case the lookup is complicated by
> handling wide characters.

That's a really bad idea to do it as a single lookup

> The code is probably essentially the same either way, and its clearly slower
> for handling wide characters outside of literals.

In GNAT, there really is zero penalty here. The way things are done is to
have an identifier table of valid identifer characters. If wide characters
are allowed, then depending on the encoding, all upper half characters are
not in this table, triggering an exit from identifier scanning, at which
point you do the appropriate tests for wide characters. But in practice
in the real world, 99.9% of all identifiers are in the lower half of
ASCII anyway.

Programs with characters in the upper half are either UTF-8 encoded or
they are not. If they are not, then the only triggering characters are
ESC (e.g. for Shift-JIS) or '[' for brackets, but those are not valid
identifier characters in any case. If such programs are UTF-8 coded,
then you have to decode anyway.

I really don't see *any* penalty *at all* here. I invite you to look at
the GNAT code, and explain why there is any penalty whatever.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, January 25, 2005  12:15 AM

Interesting. Sounds to me like you traded off maintainable code for
performance (certainly a justifiable trade-off in some cases, and this quite
possibly is one of them). Given the number of places that would have to do
special processing (not just identifiers, but white space, literals, and
comments), it seems like a nightmare. In fact, AI-285 *is* a nightmare, any
way you slice it. It affects *everything*, and little of it in simple ways.
Sigh.

****************************************************************

From: Robert Dewar
Sent: Tuesday, January 25, 2005  6:54 AM

Randy Brukardt wrote:

> Interesting. Sounds to me like you traded off maintainable code for
> performance (certainly a justifiable trade-off in some cases, and this quite
> possibly is one of them).

Well I find the code very nicely maintainable, because it is mostly
table driven (see csets in the GNAT sources for the tables for all
the many character sets for identifiers supported by GNAT:

Lexical analyzers are such a trivial part of a
compiler anyway, and yes, it is very much worthwhile worrying
about speed here :-)

> Given the number of places that would have to do
> special processing (not just identifiers, but white space, literals, and
> comments), it seems like a nightmare. In fact, AI-285 *is* a nightmare, any
> way you slice it. It affects *everything*, and little of it in simple ways.
> Sigh.

Nightmare seems a bit strong, I have taken about six days to do everything
except the pretty mechanical Wide_Wide packages. True, that is longer than
most other 2005 features :-)

I do agree that in terms of effort-to-value ratio, this one is
vanishingly small. Particularly because of so many edge cases.
Quite a chunk of the time was taken in dealing with the very
annoying case of

Width/Wide_Width/Wide_Wide_Width applied to dynamic subtypes of
String/Wide_String/Wide_Wide_String (nine cases now instead of
only four before).

****************************************************************

From: Robert Dewar
Sent: Tuesday, January 25, 2005  12:03 AM

OK, here is another puzzle.

What is the status of Soft Hyphen?
The database entry is

00AD;SOFT HYPHEN;Cf;0;ON;;;;;N;;;;;

Meaning that this is Other, Format, and therefore not a graphic
character. So is this character *really* excluded from string
literals? That seems like quite a surprising incompatibility,
and indeed causes failure of an ACATS test:

with Ada.Characters.Latin_1;
package C250002_["C1"] is

   type Enum is ( Item, 'A', '["AD"]', AE_["C6"]["E6"]_ae,
                  '["2D"]', '["FF"]' );

   task type C2_["C2"] is
     entry C2_["C3"];
   end C2_["C2"];

end C250002_["C1"];

So ????

****************************************************************

From: Pascal Leroy
Sent: Tuesday, January 25, 2005  3:34 AM

My reply to Dan and Robert's comments.

Dan:
> Is the intent that other-format characters will be allowed in
> other lexical elements, such as reserved words, numeric
> literals, and compound delimiters?

Hmm, that's interesting. I originally wrote the AI by stating that
other_format are first stripped out of the program text, and that the rest
of section 2 applied to the "clean" text.  However, this caused an
incompatibility which was thought to be unpleasant: such a character
appearing in a string literal would be OK in Ada 95, but would silently
disappear in Ada 2005.  So we decided to make other_formal illegal in
character and string literals, to detect the problem at compile time.  But
obviously I did a lousy job of fixing the rest of the AI.

Robert:
> I don't know the intent, but the rules are clear,
> other-format characters are allowed ONLY in identifiers.

And comments (I think that's cover edby the current wording).  This could
be changed (to allow them pretty much everywhere) but I am reluctant to do
this at this point, as it would seem to hair up both the RM and the
implementations.  Plus, this AI is approved by SC22 (yes, I mean SC22, not
WG9) so we should only fix serious problems, not do cosmetic changes.
Finally, the main use of other_format is to control the presentation of
text, so it's only in comments and identifiers that they make any sense.

Robert:
> Well I see this wording, but I don't see anything else in the
> AI to back up this position. I really don't want to have to
> allow these junk characters in the middle of :=

Agreed.  This is non-normative text in the AI anyway, so we can safely
ignore it ;-)  Note that the Unicode recommendations are not entirely
clear as to what should be done with other_format in programming languages
(except for identifiers, where we strictly follow the recommendations).

Robert:
> It also is bizarre to allow these
> in identifiers and not in reserved words. I also think it
> would be horrible (really an extension of my double underline
> point) to allow identifiers that are visually identical to
> reserved words, differing only in invisible format
> characters. I can even see programmers misusing this when
> they really really want to use a reserved word as an identifier, UGH!

For sure we don't want this.  I'll add wording to make sure that this is
taken care of.

Robert:
> So this is annoying, we have introduced a non upwards
> compatibility by forcing compilers to go to a lot of effort
> to forbid a curious set of wide characters in string
> literals, just to cause people trouble who run into this
> silly rule in existing programs.

Wake up, this AI does create a myriad little incompatibilities, just
because many characters that were classified as graphic characters are not
anymore.  This was well understood by the ARG during the discussion of the
AI, and we tried to make it so that incompatibilities would be detected at
compile-time most of the time.

Robert:
> That's just drawn from the database, but I am a little bit
> unsure of this table. What is the category of codes which
> simply have no definition at all in the table. I assume they
> are not excluded, since otherwise why are FFFE and FFFF
> specially treated.

Right, this was done on purpose: graphic_character is defined by exclusion
(any character not in category...) so that the characters which are not
classified yet (by Unicode) are considered graphic characters, and can
therefore be used in string literals.  On the other hand, they are neither
letters nor digits, so they cannot be used in identifiers.  This is
essentially what Ada 95 did with respect to the wide characters.

Robert:
> What is the status of Soft Hyphen?
> The database entry is
>
> 00AD;SOFT HYPHEN;Cf;0;ON;;;;;N;;;;;
>
> Meaning that this is Other, Format, and therefore not a
> graphic character. So is this character *really* excluded
> from string literals? That seems like quite a surprising
> incompatibility...

Yes, it is excluded from string and character literals, and yes, this is
an incompatibility, but as I explained above, one which is detected at
compile time.  Again, we were aware of this, and sorry, this is not the
worse incompatibility that comes with Ada 2005.

****************************************************************

From: Robert Dewar
Sent: Tuesday, January 25, 2005  7:13 AM

Yes, but we introduce incompatibilities if there is a good reason
to do so. Here there is no good reason at all that I can see except
appeal to some notion of uniformity that has nothing to do with Ada.
I don't find this worth implementing, so this is a place where GNAT
will quite deliberately not conform. If anyone ever wants to validate
(not clear that this will ever happen), we can put this under control
of some silly pedantic switch. In fact there is an argument for putting
the entire graphic-in-string stuff under such a switch.

Perhaps it would be nice to have a collected list of all
incompatibilities especially since some of them are considered
bad by Pascal (not sure what he is referring to, since in general
we have had little trouble in that area). The A5 case is the first
time I have seen tests fail so far.

****************************************************************

From: Robert Dewar
Sent: Tuesday, January 25, 2005  7:18 AM

oops, I mean AD case :-)

****************************************************************

From: Robert Dewar
Sent: Tuesday, January 25, 2005  7:31 AM

Let me be a little clearer on why I think it is such a bad
mistake to exclude AD from string literals.

In practice, Ada programs are run in many different
environments where the graphics associated with the
upper half have nothing to do with international standards
or with anything in the Ada standard (e.g. various windows
character sets).

Ada programs work just fine in such environments, provided
that the compiler and rules do not get in the way.

Yes, 10646 thinks AD is a soft hyphen, but in my XP
environment, it comes out as an upside down exclamation
point.

It really seems annoying to tell an Ada programmer working
on XP that you can freely deal with all the upper half
graphics in the range A0-FE, except for AD.

I don't mind so much the changes in wide character stuff,
since no one uses this anyway (we know because of bug
reports that show that no one ran into things which
were pretty fundamental for many years).

But Ada programs working with 8-bit chars in various
character sets are all over the place.

Now a counter argument in the XP case is that 80-9F
are also graphic characters in windows. True, but
this is a (somewhat annoying) restriction that Ada
programmers are used to and have worked around, but
the AD exclusion is new and annoying, and simply
makes no sense whatever in many environments.

****************************************************************

From: Robert Dewar
Sent: Tuesday, January 25, 2005  7:38 AM

Actually, now that I think of it, once you get into the
switch business, you might as well allow 80-9F in string
and character literals when not in pedantic mode. This
would make working under windows much easier.

****************************************************************

From: Pascal Leroy
Sent: Tuesday, January 25, 2005  7:39 AM

> Yes, but we introduce incompatibilities if there is a good
> reason to do so. Here there is no good reason at all that I
> can see except...

The reason is that we have a *mandate* from SC22 to support Unicode, er, I
mean, ISO/IEC 10646:2003.  There is no way that we could get the Amendment
past SC22 without this.

> Perhaps it would be nice to have a collected list of all
> incompatibilities especially since some of them are
> considered bad by Pascal (not sure what he is referring to,
> since in general we have had little trouble in that area).
> The A5 case is the first time I have seen tests fail so far.

The AARM that is being prepared for Ada 2005 has a fairly extensive list
of incompatibilities, much like the AARM for Ada 95.  It turns out that
this one is not mentioned, and I agree that it should, but I still think
it's a rather unimportant incompatibility.  At any rate, an implementation
is free to have a nonstandard mode where it deviates from the syntax rules
spelled out in the RM.

****************************************************************

From: Robert Dewar
Sent: Tuesday, January 25, 2005  8:17 AM

It's fine to support unicode. How can a mandate to support
unicode be interpreted as a mandate to NOT support something.
We support all valid unicode stuff, where does it say in the
unicode standard that we are required to reject AD in string
literals, I don't see it.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, January 25, 2005 12:32 PM

> Perhaps it would be nice to have a collected list of all
> incompatibilities especially since some of them are considered
> bad by Pascal (not sure what he is referring to, since in general
> we have had little trouble in that area). The A5 case is the first
> time I have seen tests fail so far.

I've tried to identify all incompatibilities and inconsistencies in the
AARM, in the same way that it was done for Ada 95. It would be possible to
extract those (mechanically or otherwise) to provide a short document.

There is a similar list of "extensions" to Ada 95.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, January 25, 2005 12:45 PM

> It really seems annoying to tell an Ada programmer working
> on XP that you can freely deal with all the upper half
> graphics in the range A0-FE, except for AD.

Well, I sympathize, but can't get too excited about this. But I'm more
concerned about the basic idea: why can't soft hyphen be used in string
literals? It's commonly used (the AARM is full of them) and it generally has
a display representation (else you couldn't edit it). A program that
generated AARM text could have many soft hyphens in strings and character
literals; it seems like another case of Nanny Ada:

     "Wide_[AD]Wide_[AD]Text_IO"

is a whole lot clearer than

     "Wide_" & Character'Val(16#AD#) & "Wide_" Character'Val(16#AD#) &
"Text_IO"

****************************************************************

From: Robert Dewar
Sent: Tuesday, January 25, 2005  1:44 PM

>      "Wide_[AD]Wide_[AD]Text_IO"
>
> is a whole lot clearer than
>
>      "Wide_" & Character'Val(16#AD#) & "Wide_" Character'Val(16#AD#) &
> "Text_IO"

That comparison is not quite fair, it should be

 >      "Wide_["AD"]Wide_["AD"]Text_IO"
 >
 > compared to:
 >
 >      "Wide_" & SH & "Wide_" SH & "Text_IO"

And indeed I rather prefer the second one here I must say.

But I think it should be the programmer's choice.

The argument in favor of not allowing soft hyphens is
presumably that if you type in Unicode (whatever that
means), and display in unicode, then the soft hyphens
will be invisible in a program listing, which seems
a worry. Of course if you use brackets notation, all
is well (another reason for not being so down on
brackets notation :-)

****************************************************************

From: Randy Brukardt
Sent: Tuesday, January 25, 2005  2:51 PM

Well, that assumes that you use a use clause for package Latin_1; I would
not do that personally because it isn't a package I use frequently enough.
And the first one is more complex than it would be in practice: you'd just
insert the proper character in your editor. I wrote it with the brackets
notation only so I could send it in e-mail.

> But I think it should be the programmer's choice.
>
> The argument in favor of not allowing soft hyphens is
> presumably that if you type in Unicode (whatever that
> means), and display in unicode, then the soft hyphens
> will be invisible in a program listing, which seems
> a worry. Of course if you use brackets notation, all
> is well (another reason for not being so down on
> brackets notation :-)

I know, but that seems to me to be saying that we want the language to work
with the crappiest possible tools. Any Unicode programming editor that
didn't provide a way to show "hidden" characters would be pretty worthless.
(Word does that - which is hardly a programming editor - and I generally
leave the hidden characters displayed there.)

These rules made sense in 1980, when everything was in 7-bit ASCII (if you
were lucky); it's 25 years later now, and everything is done graphically
with rich fonts. Prohibiting tabs and soft hyphens simply because some
ancient editors can't display them is silly.

****************************************************************

From: Robert Dewar
Sent: Tuesday, January 25, 2005  3:56 PM

I couldn't agree more! For my taste, I would allow AD in string literals
and characters, and also allow all wide characters in literals. It seems
to me that 10646 is about supporting use of wide characters, not making
it hard by introducing unnecessary restrictions.

If there are good and sufficient reasons to avoid some character in
some particular environments, then please let's allow the programmer
to make this decision and not try to second guess requiremens.

****************************************************************

From: Robert A. Duff
Sent: Tuesday, January 25, 2005  3:40 PM

> Prohibiting tabs and soft hyphens simply because some
> ancient editors can't display them is silly.

Well, I think tabs are an abomination that should never have been
invented.  I don't even think they should be allowed in *whitespace*
in Ada programs, much less string literals!  But...

But I tend to agree with Randy's sentiment, here.  If some character has
a reasonable use, as suggested by Randy at least for soft hyphens, it
seems like a shame to forbid it in the language definition.  If you
don't like tabs or soft hyphens or whatever, make it a project-wide
coding convention, and enforce it using a script as part of your CM
system or something like that.

****************************************************************

From: Robert Dewar
Sent: Tuesday, January 25, 2005  5:30 PM

I agree with the Robert Duff who wrote the third, permissive, paragraph, and
I disagree with the Robert Duff who wrote the second, non-permissive para :-)

****************************************************************

From: Randy Brukardt
Sent: Tuesday, January 25, 2005  5:47 PM

> Well, I think tabs are an abomination that should never have been
> invented.  I don't even think they should be allowed in *whitespace*
> in Ada programs, much less string literals!  But...

The people who designed HTML agreed with you; they left out tabs. Now, try
to get free-form text to line up properly (Ada syntax productions come to
mind). Luckily for us, we make printed copies of the AARM from PDF derived
from RTF, which has no such restrictions. Programming certainly needs tabs
(especially when it is using a readable, non-fixed width font). Now the
implementation of tabs often sucks...

Anyway, back to your regularly scheduled language feature debate, already in
progress. :-)

****************************************************************

From: Jean-Pierre Rosen
Sent: Wednesday, January 26, 2005  2:32 AM

> If there are good and sufficient reasons to avoid some character in
> some particular environments, then please let's allow the programmer
> to make this decision and not try to second guess requiremens.
>
Which seems to beg for a configuration pragma Restriction
(Basic_Character_Set_Only) ...

****************************************************************

From: Pascal Leroy
Sent: Wednesday, January 26, 2005  4:30 AM

In reply to Bob, Randy and Robert:

First, a political comment.  Irrespective of the technical issues, the
topic of character set is a very delicate one politically.  Jim and I were
very concerned that it could cause a catfight at the SC22 level that would
derail the Amendment process, with potentially devastating consequences.
So at the Palma WG9 meeting we decided to send to SC22 a summary of AI 285
to get a stamp of approval well in advance of the vote on the entire
Amendment, so as to avoid ending up in a quagmire.

Thanks to the support of Kiyoshi and Steve M., our proposal was approved
by SC22, so we are on pretty firm ground now.  However, by following this
process, we have pretty much committed to not making substantial changes
to AI 285.  Otherwise there will be someone is SC22 who will think that we
are cheating, and the gates of hell will open.

I wished Bob, Randy, Robert and others had read the AI at that time,
because we should really have had this discussion before sending the AI to
SC22.  Note that I am *not* trying to use this argument to quench the
discussion, but I think we should only be doing minimal changes to the AI
at this point.  It's OK to say "we discovered an unintended consequence of
the write-up, we are fixing it"; it's another kettle of fish to say "well,
we really changed our mind on this entire business".

Now for the technical discussion.  I do not feel very strongly about
other_format in literals, but my intuition is to be conservative because
we have so little experience with programming in Unicode.  On the other
hand, I am noticing that Java and C# allow anything (including control
characters) in string literals.  On the third hand I'm not sure these
languages are models that we want to follow.

Specific comments below.

Robert:
> That comparison is not quite fair, it should be
>
>      "Wide_["AD"]Wide_["AD"]Text_IO"
>
> compared to:
>
>      "Wide_" & SH & "Wide_" SH & "Text_IO"
>
> And indeed I rather prefer the second one here I must say.
>
> The argument in favor of not allowing soft hyphens is
> presumably that if you type in Unicode (whatever that means),
> and display in unicode, then the soft hyphens will be
> invisible in a program listing, which seems a worry.

Well surely the second idiom is preferable because the bracket notation is
not defined by the language, and is therefore not a portable syntax ;-)
The first piece of program text doesn't even parse with a compiler (like
ours) that doesn't support the bracket notation, regardless of the soft
hyphen issue.

And yes, the rationale for not allowing other_format characters in
literals is that they may or may not print (or be displayed), and they may
alter the presentation of the literals in surprising ways (details below).

Randy:
> I know, but that seems to me to be saying that we want the
> language to work with the crappiest possible tools. Any
> Unicode programming editor that didn't provide a way to show
> "hidden" characters would be pretty worthless. (Word does
> that - which is hardly a programming editor - and I generally
> leave the hidden characters displayed there.)
>
> These rules made sense in 1980, when everything was in 7-bit
> ASCII (if you were lucky); it's 25 years later now, and
> everything is done graphically with rich fonts. Prohibiting
> tabs and soft hyphens simply because some ancient editors
> can't display them is silly.

You cannot have it both ways, Randy.  A few days ago you agreed with
Robert that we should disallow other_format characters that would be used
to write an identifier that looks like a reserved word (e.g., pro-tected
where the hyphen is a soft hyphen).  Now you say that anything goes
because surely people will be able to display all these funny characters.
But then you should not be bothered by pro-tected.

None of us has much experience with Unicode editors (whatever that means),
so I think we should err on the side of caution. It is not a simple matter
of displaying the characters, by the way.  These characters typically have
some semantics for displaying text.  For instance a soft-hyphen indicates
a place where the editor can fold the line.  I for one who be annoyed if
my editor folded the line in the middle of a string literal.

Another case that gave me headaches is this: among the other_format
characters are some that change the display direction.  Even if you have
an editor that displays these characters, it's unclear how you would
interpret what you see on the glass.  Compare for instance:

	"a" & Right2Left & "bc" & Left2Right & "d" -- unambiguous, good
old Ada

	"a[Right2Left]bc[Left2Right]d" -- Is it bc or cb in the string?
Depends on whether the formatting characters are interpreted by the
editor.

I have read enough of the Unicode standard to realize that this is an
extremely complicated area, and again, I'd rather be conservative.  (Just
out of curiosity, has anyone other than Robert looked at Unicode?)

Bob:
> But I tend to agree with Randy's sentiment, here.  If some
> character has a reasonable use, as suggested by Randy at
> least for soft hyphens, it seems like a shame to forbid it in
> the language definition.  If you don't like tabs or soft
> hyphens or whatever, make it a project-wide coding
> convention, and enforce it using a script as part of your CM
> system or something like that.

This is the Bob I totally disagree with.  Let's make the language very
lax, and people will implement coding conventions on top of it if they
like.

This is really a flexibility vs. safety tradeoff.  However, I for one have
a hard time evaluating the safety impact, i.e., the confusion that may
stem from having literals that are not wysiwyg.  Perhaps I am overstating
the problem.  But I don't think you can just ignore it.

****************************************************************

From: Robert A. Duff
Sent: Wednesday, January 26, 2005  1:10 PM

In reply to Pascal:

> In reply to Bob, Randy and Robert:
>
> First, a political comment.  Irrespective of the technical issues, the
> topic of character set is a very delicate one politically.

I'll defer to you on the political issues.  If you say we should leave
it as is to avoid rocking the boat, that's fine with me.

You ought to think about whether people motivated by political concerns
can discern the difference between minor and major changes.  For all I
know, no change at all is politically acceptable.

> I wished Bob, Randy, Robert and others had read the AI at that time,
> because we should really have had this discussion before sending the AI to
> SC22.

Well, I did look at it at the time, but my eyes glazed over, and my
review was therefore useless.  ;-)  I'm only picking up on Robert's
comments, and Robert apparently didn't notice these issues until he
started to implement the thing.

>...(Just
> out of curiosity, has anyone other than Robert looked at Unicode?)

A little bit, but again, my eyes glaze over.

> Bob:
> > But I tend to agree with Randy's sentiment, here.  If some
> > character has a reasonable use, as suggested by Randy at
> > least for soft hyphens, it seems like a shame to forbid it in
> > the language definition.  If you don't like tabs or soft
> > hyphens or whatever, make it a project-wide coding
> > convention, and enforce it using a script as part of your CM
> > system or something like that.
>
> This is the Bob I totally disagree with.  Let's make the language very
> lax, and people will implement coding conventions on top of it if they
> like.

I didn't state that as a general principle -- it just seems reasonable
in this case.  But I'd be happy either way (I mainly stick to 7-bit
ASCII for my own code!).

****************************************************************

From: Randy Brukardt
Sent: Wednesday, January 26, 2005  3:40 PM

> I wished Bob, Randy, Robert and others had read the AI at that time,
> because we should really have had this discussion before sending the AI to
> SC22.  Note that I am *not* trying to use this argument to quench the
> discussion, but I think we should only be doing minimal changes to the AI
> at this point.  It's OK to say "we discovered an unintended consequence of
> the write-up, we are fixing it"; it's another kettle of fish to say "well,
> we really changed our mind on this entire business".

I doubt very much that SC22 cares what characters are allowed vs. not
allowed in string literals. That was never the point of the political
discussion. In any case, the AI did not point out the incompatibility, and I
for one didn't think of it (as you point out, its not documented in the
AARM).

We should always look at incompatibilities carefully to see if they are
justified. This one, IMHO, does not seem to be.

...
> Randy:
> > I know, but that seems to me to be saying that we want the
> > language to work with the crappiest possible tools. Any
> > Unicode programming editor that didn't provide a way to show
> > "hidden" characters would be pretty worthless. (Word does
> > that - which is hardly a programming editor - and I generally
> > leave the hidden characters displayed there.)
> >
> > These rules made sense in 1980, when everything was in 7-bit
> > ASCII (if you were lucky); it's 25 years later now, and
> > everything is done graphically with rich fonts. Prohibiting
> > tabs and soft hyphens simply because some ancient editors
> > can't display them is silly.
>
> You cannot have it both ways, Randy.  A few days ago you agreed with
> Robert that we should disallow other_format characters that would be used
> to write an identifier that looks like a reserved word (e.g., pro-tected
> where the hyphen is a soft hyphen).

I don't remember ever agreeing with any such thing. My understanding of
Robert's position is that we have to allow other-format in reserved words,
because otherwise the processing of identifiers (of which reserved words are
a subset) is substantially complicated. That's what I agreed with; you seem
to be taking the opposite approach.

> Now you say that anything goes
> because surely people will be able to display all these funny characters.
> But then you should not be bothered by pro-tected.

I'm not, and never was.

> None of us has much experience with Unicode editors (whatever that means),
> so I think we should err on the side of caution. It is not a simple matter
> of displaying the characters, by the way.  These characters typically have
> some semantics for displaying text.  For instance a soft-hyphen indicates
> a place where the editor can fold the line.  I for one who be annoyed if
> my editor folded the line in the middle of a string literal.

A Unicode programming editor clearly will not do such things in hidden-text
mode. A general purpose word processor is not appropriate for editing
programs now, and I very much doubt that will change.

One of my objections to AI-388 is that is essentially forces implementations
to create Unicode programming editors (since an editor that can't display
the predefined packages is junk), and that is certainly a non-trivial
task -- and one for which off-the-shelf support is quite scanty. Probably
most would simply support a subset of Unicode (graphic characters and a few
well-used other-formats, and little else).

...
> I have read enough of the Unicode standard to realize that this is an
> extremely complicated area, and again, I'd rather be conservative.  (Just
> out of curiosity, has anyone other than Robert looked at Unicode?)

I don't doubt it. I'd be happy to simply except Soft-Hyphen and Tab from the
existing rules, and stop there. I'm less concerned about other ones.

Another option would be to allow them, and give an implementation-permission
to not allow problematic other-formats, private-use, etc. in strings. We
generally don't talk about source code formats, and it is there that there
is a problem, not in the language definition.

Worrying about what editors might or might not do is purely a function of
the source formats and the tools, and that is way out of bounds for the
language. Having restrictions on strings because some editor somewhere might
not work right is pretty silly.

Even if we went all the way and specified that canonical Ada source be given
in UTF-8 , we could hardly force editors and tools to be able to handle all
possible source. So the problem is the assumption that every tool can handle
every possible Ada program. Once you realize that is impractical in a
Unicode world, there really remains no important reason for restrictions *in
the language*. The restrictions (if they need to exist) are *in the tools*.
The standard needs to recognize that there must be the possibility of
character restrictions in the tools; once it does so, there is no need to
restrict character or string literals or comments *in the standard*.
(Identifiers are a whole different kettle of fish, of course, and that was
the area that was so contentious in SC22.)

****************************************************************

From: Pascal Leroy
Sent: Thursday, January 27, 2005  3:02 AM

> We should always look at incompatibilities carefully to see
> if they are justified. This one, IMHO, does not seem to be.

Fine.  At any rate I'll write an AI to discuss this in Paris.

> I don't remember ever agreeing with any such thing. My
> understanding of Robert's position is that we have to allow
> other-format in reserved words, because otherwise the
> processing of identifiers (of which reserved words are a
> subset) is substantially complicated. That's what I agreed
> with; you seem to be taking the opposite approach.

I think we are actually in agreement.  My view is that when the tokenizer
reads a "word" it first removes all the other_format characters, and then
checks to see if it's a reserved word (after conversion to upper case) or
an identifier (in which case it checks for double underscores and the
like).  So pro-tected would be a reserved word, not an identifier.

I feel quite strongly about this, btw, because this seems to align with
the Unicode recommendations, and to match what other languages are doing.

> I don't doubt it. I'd be happy to simply except Soft-Hyphen
> and Tab from the existing rules, and stop there. I'm less
> concerned about other ones.

I would be very much opposed to picking characters one by one.  We should
take entire categories of characters, as defined by Unicode, if only
because we don't have a good understanding of the purpose of all these
weird 16- and 32-bit characters.  We trust that the Unicode folks got the
categorization right.

I could live with other_format in literals.  I'd rather not include tabs
(which would effectively mean allow everything except format_effectors)
because we have all been bitten by tabs at one point or another in our
life.  Remember, other_format is the only category that creates an
incompatibility.

****************************************************************

From: Robert Dewar
Sent: Thursday, January 27, 2005  3:58 PM

Robert A Duff wrote:

> Well, I did look at it at the time, but my eyes glazed over, and my
> review was therefore useless.  ;-)  I'm only picking up on Robert's
> comments, and Robert apparently didn't notice these issues until he
> started to implement the thing.

Excactly, you don't really dig into the details till you look at
them.

I think it is essential that we fix things to have the same basic
syntax for keywords and identifiers.

I think it would be nice to me more permissing in string and
character literals given that

   a) this makes the language far more convenient to use
   b) other languages faced with the same decision have gone
         in that direction
   c) it avoids a completely unnecessary non-upwards compatibility.

If I had my way, I would also do in the case equivalence (I have
fully implemented it, so this is not to ease the implementation
burden in GNAT :-).

The reason is that proper case equivalence processing is unavoidably
locale dependent.

It is simply too peculiar that a Turkish Ada programmer finds that
dotted i is folded incorrectly to capital I without a dot. This
means that of the identifiers

      Capital I with dot
      Lower case I with dot
      Lower case I without dot
      Captial I with out dot

the first is distinct from the last three, which is just weird. I
am sure that there are other locale dependent weirdnesses like this.

But we can live with this if necessary. Good Ada style is never to
take advantage of the case equivalence in any case.

P.S. Pascal's tables in the AI for letters and numbers are significantly
wrong. If people are interested, I am happy to post GNAT's understanding
of the unicode categorizations :-)

****************************************************************

From: Robert Dewar
Sent: Thursday, January 27, 2005  4:15 PM

>>You cannot have it both ways, Randy.  A few days ago you agreed with
>>Robert that we should disallow other_format characters that would be used
>>to write an identifier that looks like a reserved word (e.g., pro-tected
>>where the hyphen is a soft hyphen).

I never said anything of the kind. It is fine to allow other other
format stuff in identifiers following the recommendations.

What is not OK is allowing

underline (some soft junk) underline

I think the prohibition against two underlines (more generally
against two punctuation,connector class charactes should apply
AFTER soft junk is stripped, not before.

****************************************************************

From: Robert Dewar
Sent: Thursday, January 27, 2005  4:20 PM

> I think we are actually in agreement.  My view is that when the tokenizer
> reads a "word" it first removes all the other_format characters, and then
> checks to see if it's a reserved word (after conversion to upper case) or
> an identifier (in which case it checks for double underscores and the
> like).  So pro-tected would be a reserved word, not an identifier.

Yes, that's right, I agree also
>
> I feel quite strongly about this, btw, because this seems to align with
> the Unicode recommendations, and to match what other languages are doing.

Not sure about the strongly, since I think this is non-critical,
but certainly I agree.

>>I don't doubt it. I'd be happy to simply except Soft-Hyphen
>>and Tab from the existing rules, and stop there. I'm less
>>concerned about other ones.

That;s because you have not looked through them, and/or you
simply don't know what they are. Here is the list:

    UTF_32_Other_Format : constant UTF_32_Ranges := (
      (16#000AD#, 16#000AD#),  -- SOFT HYPHEN .. SOFT HYPHEN
      (16#00600#, 16#00603#),  -- ARABIC NUMBER SIGN .. ARABIC SIGN SAFHA
      (16#006DD#, 16#006DD#),  -- ARABIC END OF AYAH .. ARABIC END OF AYAH
      (16#0070F#, 16#0070F#),  -- SYRIAC ABBREVIATION MARK .. SYRIAC ABBREVIATION MARK
      (16#017B4#, 16#017B5#),  -- KHMER VOWEL INHERENT AQ .. KHMER VOWEL INHERENT AA
      (16#0200C#, 16#0200F#),  -- ZERO WIDTH NON-JOINER .. RIGHT-TO-LEFT MARK
      (16#0202A#, 16#0202E#),  -- LEFT-TO-RIGHT EMBEDDING .. RIGHT-TO-LEFT OVERRIDE
      (16#02060#, 16#02063#),  -- WORD JOINER .. INVISIBLE SEPARATOR
      (16#0206A#, 16#0206F#),  -- INHIBIT SYMMETRIC SWAPPING .. NOMINAL DIGIT SHAPES
      (16#0FEFF#, 16#0FEFF#),  -- ZERO WIDTH NO-BREAK SPACE .. ZERO WIDTH NO-BREAK SPACE
      (16#0FFF9#, 16#0FFFB#),  -- INTERLINEAR ANNOTATION ANCHOR .. INTERLINEAR ANNOTATION TERMINATOR
      (16#1D173#, 16#1D17A#),  -- MUSICAL SYMBOL BEGIN BEAM .. MUSICAL SYMBOL END PHRASE
      (16#E0001#, 16#E0001#),  -- LANGUAGE TAG .. LANGUAGE TAG
      (16#E0020#, 16#E007F#)); -- TAG SPACE .. CANCEL TAG

Why on earth would you suggest treating zero width no-break space or
invisible separator in a manner different from Soft Hyphen (not sure
what tab has to do with this, it is not an other format character).

> I would be very much opposed to picking characters one by one.  We should
> take entire categories of characters, as defined by Unicode, if only
> because we don't have a good understanding of the purpose of all these
> weird 16- and 32-bit characters.  We trust that the Unicode folks got the
> categorization right.

Exactly, I agree

> I could live with other_format in literals.  I'd rather not include tabs
> (which would effectively mean allow everything except format_effectors)
> because we have all been bitten by tabs at one point or another in our
> life.  Remember, other_format is the only category that creates an
> incompatibility.

RIght, I would allow everything except format effectors, that's an old
Ada tradition, and I think it is fine to extend it to separator,line
and separator,para.

Think for a moment of an environment where Line Separator is used
routinely to end lines, you really do NOT want to allow this in
string literals as well.

****************************************************************

From: Pascal Leroy
Sent: Friday, January 28, 2005  3:29 AM

> But we can live with this if necessary. Good Ada style is
> never to take advantage of the case equivalence in any case.

I noticed that issue while reviewing the AARM, and added an IA to say
roughly "if you target a culture where some locale-dependent case folding
rule is more appropriate, by all means, provide a nonstandard mode that
supports this case folding".  Being an IA, it doesn't impose anything on
implementations, but it draws their attention to the fact that other case
folding rules exist, and it also tell users that it's a legitimate request
that they may want to bring up with their vendor (provided that they are
willing to put a laaarge number of Turkish Lira on the table ;-)

As far as I know only Turkish and Lithuanian have that problem (well,
ancient Greek too, but I don't expect many people to program in ancient
Greek).

> P.S. Pascal's tables in the AI for letters and numbers are
> significantly wrong. If people are interested, I am happy to
> post GNAT's understanding of the unicode categorizations :-)

I know, I
know, but please don't post your tables.  If I read any piece of code from
GNAT my brain gets polluted by public domain software, and IBM won't let
me work on Apex anymore (it does sound silly, but it's absolutely true).

****************************************************************

From: Robert Dewar
Sent: Friday, January 28, 2005  4:21 AM

> know, but please don't post your tables.  If I read any piece of code from
> GNAT my brain gets polluted by public domain software, and IBM won't let
> me work on Apex anymore (it does sound silly, but it's absolutely true).

That is indeed complete nonsense, given this is under the GMGPL, and you
can perfectly well incorporate it into Apex with no legal problems
whatever.

I really think it would be better to have an agreed on set of tables
that we all use. Uniformtiy of implementations is more important than
the junk rules we have agreed to implement!

****************************************************************

From: Robert Dewar
Sent: Friday, January 28, 2005  4:23 AM

By the way, NO PART of gnat is public domain, please do not spread
this seriously incorrect misconception. All our software is copyrighted
and we object to people trying to dilute our copyright by claiming that
our software is in the public domain. Thanks for being careful on this
point in future.

For sure, you have to check licensing conditions. In this case, the GNAT
code is under the GMGPL precisely so that other proprietary implementations
can share the tables if they wish.

****************************************************************

From: Pascal Leroy
Sent: Friday, January 28, 2005  5:12 AM

I realize that, and "public domain" was just a convenient shorthand,
although I should have been more precise/careful.   Sorry about that.

> For sure, you have to check licensing conditions. In this
> case, the GNAT code is under the GMGPL precisely so that
> other proprietary implementations can share the tables if they wish.

I understand that, but I was not actually kidding: we had to remove
libraries covered by the LGPL from some of our products before the lawyers
objected to it.

Anyway, this is getting totally off-topic...

****************************************************************


Questions? Ask the ACAA Technical Agent