CVS difference for ai05s/ai05-0079-1.txt
--- ai05s/ai05-0079-1.txt 2007/12/08 06:20:27 1.1
+++ ai05s/ai05-0079-1.txt 2007/12/13 04:39:38 1.2
@@ -14,7 +14,7 @@
!question
-A standard convention is to start a file with a zero width
+(1) A standard convention is to start a file with a zero width
non-breaking space character, 16#0000_FEFF#. By looking at the
first bytes you can tell if the file is UTF-8, UTF-16 (LE/BE)
or UTF-32 (LE/BE) encoded. That's the convention that Windows XP
@@ -31,46 +31,96 @@
be any problem with allowing these more generally. Should these
characters be allowed? (Yes.)
+(2) Recent versions of the Unicode technical report on identifiers
+(TR31: http://www.unicode.org/reports/tr31/) say that characters
+in class other_format should not be allowed in programming language
+identifiers for security reasons. (This changed since the Amendment
+started standardization.)
+
+The problem is basically that these characters can be used
+to write an identifier that looks like Foo_Bar but is actually
+Fo<Right-To-Left>aB_o<Left-To-Right>r. Stripping other_format results
+in FoaB_or, which is what the compiler sees, and lo and behold, I have
+introduced a vulnerability in a source file that looks perfectly
+kosher on your screen.
+
+Should the definition of Ada be changed? (No.)
+
!recommendation
(See Summary.)
!wording
+Add at the end of 2.1(4/2):
+
+The only characters allowed outside of comments are those in categories
+other_format, format_effector and graphic_character.
+
Add a new paragraph after 2.2(7):
One of more other_format characters are allowed anywhere that a separator
-is allowed[; any such characters have no effect on the meaning of an Ada program].
+is[; any such characters have no effect on the meaning of an Ada program].
[Editor's note: These brackets indicate text that would be marked as redundant in
the AARM.]
+AARM Ramification: It is not possible to have other_format characters immediately
+following an identifier: any other_format characters appearing at that place
+are part of the identifier. Such characters do not alter the meaning of the
+identifier anyway.
+
+
Add an AARM implementation note after 2.1(18.a):
In order to process the ACATS, an implementation will have to have the ability to
process Latin-1 and UTF-8 formatted files. UTF-8 files by convention start with
-the character zero width no-break space (16#0000FEFF#); Latin-1 Ada source files
-do not start with these characters (the ZWNBS is encoded as 16#EF# 16#BB# 16#BF#;
-the last two characters are not legal in Ada programs outside of comments). That
-means it is possible for a compiler to determine which of these file formats
-are used without operator intervention.
+the character zero width no-break space (16#0000FEFF#), also known as byte order
+mark (BOM); Latin-1 Ada source files do not start with these characters (the BOM
+is encoded as 16#EF# 16#BB# 16#BF# for UTF-8; the last two characters are not legal
+in Ada programs outside of comments). That means it is possible for a compiler to
+determine which of these file formats are used without operator intervention.
!discussion
-
-There is one semi-weird side-effect of this wording. If [ZNBS] represents zero width
-no-break space, then if it appears in a compound delimiter:
- /[ZNBS]=
+We need wording to say that characters not explicitly allowed are prohibited in
+programs. This doesn't have that much force, since such characters can be defined
+as part of the source representation. But the goal is that a file encoded in UTF-8
+can be mapped directly to the language rules, without any source representation games.
+
+There is one semi-weird side-effect of this wording. If [ZWS] represents zero width
+space, then if it appears in a compound delimiter:
+
+ /[ZWS]=
+
+the program ought to be illegal, as other_format characters are not allowed between
+the parts of a compound delimiter (this is not before or after a lexical element).
+
+However, the language rules as amendment seem to requirethis to be treated as
+two single delimiters (as no separator is required between delimiters). That seems
+bad, as an editor may show this such that it looks like a single compound delimiter
+(a zero-width space is unlikely to be very obvious).
-this would be interpreted as two (simple) delimiters, as other_format characters
-are not allowed between the parts of a compound delimiter. However, an editor may
-very show this such that it looks like a single delimiter (a zero-width space is
-unlikely to be very obvious).
-
We could have fixed this oddity by allowing other_format characters between the
characters of a compound delimiter, but this does not seem very important for the
extra descriptive requirements. (And it would allow characters like soft hyphen
-in compound delimiters, which could cause weird-looking programs).
+in compound delimiters, which could cause weird-looking programs). We could also
+add a special rule to specifically make this case illegal, but that seems klunky at
+best. The net effect is that an other_format character can act as a separator.
+
+On question 2, we choose to do nothing now. Ada 2005 is based on ISO/IEC 10646:2003,
+which corresponds to Unicode 4.0, and we followed the recommendations of that version
+of Unicode. Moreover, changing identifier syntax now would introduce an
+incompatibility (probably fairly slight). One presumes a future version of Ada will
+update the Unicode reference, and that will have the effect of changing the characters
+allowed in identifiers slightly. (TR31 also has changed the character classifications
+that are allowed in identifiers.) That will also be mildly incompatible, and we
+think that would be a better time to make the change for other_format characters.
+
+It should noted that the TR31 pronouncement is not absolute; they suggest that perhaps
+some characters should be allowed. Essentially, they don't appear to be very sure what
+the right answer is. Perhaps they'll change their mind again in the future, so it seems
+silly to be chasing their whims on this issue.
!corrigendum 2.2(7)
@@ -80,8 +130,8 @@
required between an @fa<identifier>, a reserved word, or a @fa<numeric_literal>
and an adjacent @fa<identifier>, reserved word, or @fa<numeric_literal>.
@dinst
-One of more @fa<other_format> characters are allowed anywhere that a separator
-is allowed; any such characters have no effect on the meaning of an Ada program.
+One of more other_format characters are allowed anywhere that a separator
+is[; any such characters have no effect on the meaning of an Ada program].
!ACATS Test
@@ -167,6 +217,80 @@
Right, I am just suggesting that the standard *allow* this other-format
character to appear as the first character of the program (and be
ignored), not that it be required.
+
+****************************************************************
+
+Summary of private e-mail on this topic between Randy Brukardt and Pascal Leroy,
+December 2007.
+
+Randy: Since the RM specifically notes the existence
+of the character class, it also has an obligation to say where it is
+allowed. There is explicit wording to allow it in identifiers, character
+literals, string literals, and comments (the last because of negative
+wording: it is not in the disallowed list of characters). There is no
+general perssion to allow random characters in various places, so one has to
+presume the intent is only to allow them at these specific positions.
+
+Pascal: Hmm, I am quite sure that there was such a permission in my mind when
+I wrote AI 285. The !proposal section of this AI has: "The characters
+in the category other_format are effectively ignored in most lexical
+elements, with the exception that they are illegal in string_literals
+and character_literals". OK, a separator is not a lexical element, so
+perhaps that bit is missing.
+
+Randy: Right, but that was dropped somewhere along the way. And the e-mail
+in AI-395 makes it clear that we noticed this and *decided not to fix it*;
+we only changed character literals and string literals. (Yes, I re-read much
+of the e-mail on Friday night.)
+
+Pascal: Anyway, I agree that something is broken, because my reading would
+seem to imply that a control character in the middle of program text
+is OK. Ah, I see, we've lost the following sentence from RM95 2.1(1):
+"The only characters allowed outside of comments are the
+graphic_characters and format_effectors".
+
+So I agree that a BI is needed. II would rather phrase it in terms of
+lexical elements, than in terms of separators, though. Something like
+the following, after 2.2(7):
+
+"Characters in category other_format are allowed before or after any
+lexical element. [These characters have no effect on the meaning of an
+Ada program.]"
+
+AARM Note: It is not possible to have other_format characters after an
+identifier: any other_format characters appearing at this place are
+really part of the identifier. They do not alter the meaning of the
+identifier anyway.
+
+Randy: I don't like this note as written, because "after" covers a lot
+of ground, and surely they're allowed after a separator following an
+identifier -- and a separator is not a "lexical element".
+
+Perhaps "immediately following" instead of "after"?
+
+Pascal: No, that doesn't work either, because you want to allow an
+other_format in the middle of whitespace, and that is not immediately
+after any lexical element. So I think your solution based on
+separators, although a bit unpalatable, is the best we can do.
+
+---
+
+Pascal: Maybe we need a sentence like the following, before the last
+sentence of 2.1(4/2): "The only characters allowed outside of
+comments are those in categories other_format, format_effector and
+graphic_character".
+
+Randy: I'm not sure we need both. You can't put random characters into the text
+unless they're explicitly allowed. And this is a slight maintenance hazard.
+
+Pascal: I insist that I think this is important to lift ambiguities. Your
+statement that you cannot put random characters in the text is, at the
+moment, not supported by RM wording. I might as well argue that you
+can put any character you like unless they are explicitly forbidden by
+a syntax rule (and that would allow control characters in whitespace).
+ It's better to be explicit. And I don't care too much about the
+maintenance problem, because it's not like the character categories
+that are relevant to the language are changing every day.
****************************************************************
Questions? Ask the ACAA Technical Agent