!standard 2.2(7) 07-12-07 AI05-0079-1/01 !class binding interpretation 07-12-07 !status work item 07-12-07 !status received 07-10-25 !priority Low !difficulty Easy !qualifier Omission !subject Other_Format characters should be allowed wherever separators are allowed !summary Other_format characters should be allowed wherever the language allows separators. These characters have no meaning to an Ada program. !question (1) A standard convention is to start a file with a zero width non-breaking space character, 16#0000_FEFF#. By looking at the first bytes you can tell if the file is UTF-8, UTF-16 (LE/BE) or UTF-32 (LE/BE) encoded. That's the convention that Windows XP and Vista use, among others. The definition of Ada programs does not allow this character (classified as Other, Format) outside of identifiers, character literals, string literals, and comments. Thus an Ada compiler has to play games with source representation in order to follow this convention. More generally, it seems weird that a character that acts like a space is not allowed where spaces are. There doesn't seem to be any problem with allowing these more generally. Should these characters be allowed? (Yes.) (2) Recent versions of the Unicode technical report on identifiers (TR31: http://www.unicode.org/reports/tr31/) say that characters in class other_format should not be allowed in programming language identifiers for security reasons. (This changed since the Amendment started standardization.) The problem is basically that these characters can be used to write an identifier that looks like Foo_Bar but is actually FoaB_or. Stripping other_format results in FoaB_or, which is what the compiler sees, and lo and behold, I have introduced a vulnerability in a source file that looks perfectly kosher on your screen. Should the definition of Ada be changed? (No.) !recommendation (See Summary.) !wording Add at the end of 2.1(4/2): The only characters allowed outside of comments are those in categories other_format, format_effector and graphic_character. Add a new paragraph after 2.2(7): One of more other_format characters are allowed anywhere that a separator is[; any such characters have no effect on the meaning of an Ada program]. [Editor's note: These brackets indicate text that would be marked as redundant in the AARM.] AARM Ramification: It is not possible to have other_format characters immediately following an identifier: any other_format characters appearing at that place are part of the identifier. Such characters do not alter the meaning of the identifier anyway. Add an AARM implementation note after 2.1(18.a): In order to process the ACATS, an implementation will have to have the ability to process Latin-1 and UTF-8 formatted files. UTF-8 files by convention start with the character zero width no-break space (16#0000FEFF#), also known as byte order mark (BOM); Latin-1 Ada source files do not start with these characters (the BOM is encoded as 16#EF# 16#BB# 16#BF# for UTF-8; the last two characters are not legal in Ada programs outside of comments). That means it is possible for a compiler to determine which of these file formats are used without operator intervention. !discussion We need wording to say that characters not explicitly allowed are prohibited in programs. This doesn't have that much force, since such characters can be defined as part of the source representation. But the goal is that a file encoded in UTF-8 can be mapped directly to the language rules, without any source representation games. There is one semi-weird side-effect of this wording. If [ZWS] represents zero width space, then if it appears in a compound delimiter: /[ZWS]= the program ought to be illegal, as other_format characters are not allowed between the parts of a compound delimiter (this is not before or after a lexical element). However, the language rules as amendment seem to requirethis to be treated as two single delimiters (as no separator is required between delimiters). That seems bad, as an editor may show this such that it looks like a single compound delimiter (a zero-width space is unlikely to be very obvious). We could have fixed this oddity by allowing other_format characters between the characters of a compound delimiter, but this does not seem very important for the extra descriptive requirements. (And it would allow characters like soft hyphen in compound delimiters, which could cause weird-looking programs). We could also add a special rule to specifically make this case illegal, but that seems klunky at best. The net effect is that an other_format character can act as a separator. On question 2, we choose to do nothing now. Ada 2005 is based on ISO/IEC 10646:2003, which corresponds to Unicode 4.0, and we followed the recommendations of that version of Unicode. Moreover, changing identifier syntax now would introduce an incompatibility (probably fairly slight). One presumes a future version of Ada will update the Unicode reference, and that will have the effect of changing the characters allowed in identifiers slightly. (TR31 also has changed the character classifications that are allowed in identifiers.) That will also be mildly incompatible, and we think that would be a better time to make the change for other_format characters. It should noted that the TR31 pronouncement is not absolute; they suggest that perhaps some characters should be allowed. Essentially, they don't appear to be very sure what the right answer is. Perhaps they'll change their mind again in the future, so it seems silly to be chasing their whims on this issue. !corrigendum 2.2(7) @dinsa One or more separators are allowed between any two adjacent lexical elements, before the first of each @fa, or after the last. At least one separator is required between an @fa, a reserved word, or a @fa and an adjacent @fa, reserved word, or @fa. @dinst One of more other_format characters are allowed anywhere that a separator is[; any such characters have no effect on the meaning of an Ada program]. !ACATS Test An ACATS C-Test could be tried. !appendix From: Robert Dewar Sent: Thursday, October 25, 2007 1:15 PM A standard convention is to start a file with a non-breaking zero width space character, 16#0000_FEFF#. By looking at the first bytes you can tell if the file is UTF-8, UTF-16 (LE/BE) or UTF-32 (LE/BE) encoded. That's the convention that for example Windows XP and Vista use. It is certainly nice for Ada compilers to recognize this. Now you can always regard the BOM as part of your source representation and recognize it as a kind of prefix to the actual file (that's what we have done in GNAT), but I think it would be nice for the standard to allow/require compilers to accept this character as a formatting character. By mentioning this in the standard we help to make sure that all compilers will follow this convention. **************************************************************** From: Tucker Taft Sent: Thursday, October 25, 2007 2:16 PM I could see this as Implementation Advice, but not as more than that, since we have generally agreed that source representation is not part of the standard. **************************************************************** From: Randy Brukardt Sent: Thursday, October 25, 2007 2:50 PM It is interesting that this character (16#0000_FEFF#) is defined as Other, Format by Unicode, and *not* a Separator, Space. Thus it is not covered by the existing rules. (Yes, I just went and looked this up.) There is wording in the standard to specifically allow "other_format" to occur in identifiers and the like, but no such wording for the program text outside of composite tokens. So I think I agree with Robert; there should be some statement that characters in class "other_format" are allowed between tokens (that is, in separators). Otherwise, the fact that they are explicitly allowed in some contexts could suggest that they are not allowed in other contexts (and that is not the intent). I also agree with Tucker that any specific recommendations about source format (such that it start with a particular character) should be Implementation Advice. **************************************************************** From: Robert Dewar Sent: Thursday, October 25, 2007 3:05 PM > I could see this as Implementation Advice, but > not as more than that, since we have generally > agreed that source representation is not part > of the standard. But we specify what sequence of characters is allowed, and I am suggesting that we specifically allow this character to appear as the first character of a source program (so that it is a standard part of the text of the program, rather than being considered as a source representation gizmo). **************************************************************** From: Robert Dewar Sent: Thursday, October 25, 2007 3:07 PM > I also agree with Tucker that any specific recommendations about source > format (such that it start with a particular character) should be > Implementation Advice. Right, I am just suggesting that the standard *allow* this other-format character to appear as the first character of the program (and be ignored), not that it be required. **************************************************************** Summary of private e-mail on this topic between Randy Brukardt and Pascal Leroy, December 2007. Randy: Since the RM specifically notes the existence of the character class, it also has an obligation to say where it is allowed. There is explicit wording to allow it in identifiers, character literals, string literals, and comments (the last because of negative wording: it is not in the disallowed list of characters). There is no general perssion to allow random characters in various places, so one has to presume the intent is only to allow them at these specific positions. Pascal: Hmm, I am quite sure that there was such a permission in my mind when I wrote AI 285. The !proposal section of this AI has: "The characters in the category other_format are effectively ignored in most lexical elements, with the exception that they are illegal in string_literals and character_literals". OK, a separator is not a lexical element, so perhaps that bit is missing. Randy: Right, but that was dropped somewhere along the way. And the e-mail in AI-395 makes it clear that we noticed this and *decided not to fix it*; we only changed character literals and string literals. (Yes, I re-read much of the e-mail on Friday night.) Pascal: Anyway, I agree that something is broken, because my reading would seem to imply that a control character in the middle of program text is OK. Ah, I see, we've lost the following sentence from RM95 2.1(1): "The only characters allowed outside of comments are the graphic_characters and format_effectors". So I agree that a BI is needed. II would rather phrase it in terms of lexical elements, than in terms of separators, though. Something like the following, after 2.2(7): "Characters in category other_format are allowed before or after any lexical element. [These characters have no effect on the meaning of an Ada program.]" AARM Note: It is not possible to have other_format characters after an identifier: any other_format characters appearing at this place are really part of the identifier. They do not alter the meaning of the identifier anyway. Randy: I don't like this note as written, because "after" covers a lot of ground, and surely they're allowed after a separator following an identifier -- and a separator is not a "lexical element". Perhaps "immediately following" instead of "after"? Pascal: No, that doesn't work either, because you want to allow an other_format in the middle of whitespace, and that is not immediately after any lexical element. So I think your solution based on separators, although a bit unpalatable, is the best we can do. --- Pascal: Maybe we need a sentence like the following, before the last sentence of 2.1(4/2): "The only characters allowed outside of comments are those in categories other_format, format_effector and graphic_character". Randy: I'm not sure we need both. You can't put random characters into the text unless they're explicitly allowed. And this is a slight maintenance hazard. Pascal: I insist that I think this is important to lift ambiguities. Your statement that you cannot put random characters in the text is, at the moment, not supported by RM wording. I might as well argue that you can put any character you like unless they are explicitly forbidden by a syntax rule (and that would allow control characters in whitespace). It's better to be explicit. And I don't care too much about the maintenance problem, because it's not like the character categories that are relevant to the language are changing every day. ****************************************************************