!standard 2.2(7) 07-12-07 AI05-0079-1/01 !class binding interpretation 07-12-07 !status work item 07-12-07 !status received 07-10-25 !priority Low !difficulty Easy !qualifier Omission !subject Other_Format characters should be allowed wherever separators are allowed !summary Other_format characters should be allowed wherever the language allows separators. These characters have no meaning to an Ada program. !question A standard convention is to start a file with a zero width non-breaking space character, 16#0000_FEFF#. By looking at the first bytes you can tell if the file is UTF-8, UTF-16 (LE/BE) or UTF-32 (LE/BE) encoded. That's the convention that Windows XP and Vista use, among others. The definition of Ada programs does not allow this character (classified as Other, Format) outside of identifiers, character literals, string literals, and comments. Thus an Ada compiler has to play games with source representation in order to follow this convention. More generally, it seems weird that a character that acts like a space is not allowed where spaces are. There doesn't seem to be any problem with allowing these more generally. Should these characters be allowed? (Yes.) !recommendation (See Summary.) !wording Add a new paragraph after 2.2(7): One of more other_format characters are allowed anywhere that a separator is allowed[; any such characters have no effect on the meaning of an Ada program]. [Editor's note: These brackets indicate text that would be marked as redundant in the AARM.] Add an AARM implementation note after 2.1(18.a): In order to process the ACATS, an implementation will have to have the ability to process Latin-1 and UTF-8 formatted files. UTF-8 files by convention start with the character zero width no-break space (16#0000FEFF#); Latin-1 Ada source files do not start with these characters (the ZWNBS is encoded as 16#EF# 16#BB# 16#BF#; the last two characters are not legal in Ada programs outside of comments). That means it is possible for a compiler to determine which of these file formats are used without operator intervention. !discussion There is one semi-weird side-effect of this wording. If [ZNBS] represents zero width no-break space, then if it appears in a compound delimiter: /[ZNBS]= this would be interpreted as two (simple) delimiters, as other_format characters are not allowed between the parts of a compound delimiter. However, an editor may very show this such that it looks like a single delimiter (a zero-width space is unlikely to be very obvious). We could have fixed this oddity by allowing other_format characters between the characters of a compound delimiter, but this does not seem very important for the extra descriptive requirements. (And it would allow characters like soft hyphen in compound delimiters, which could cause weird-looking programs). !corrigendum 2.2(7) @dinsa One or more separators are allowed between any two adjacent lexical elements, before the first of each @fa, or after the last. At least one separator is required between an @fa, a reserved word, or a @fa and an adjacent @fa, reserved word, or @fa. @dinst One of more @fa characters are allowed anywhere that a separator is allowed; any such characters have no effect on the meaning of an Ada program. !ACATS Test An ACATS C-Test could be tried. !appendix From: Robert Dewar Sent: Thursday, October 25, 2007 1:15 PM A standard convention is to start a file with a non-breaking zero width space character, 16#0000_FEFF#. By looking at the first bytes you can tell if the file is UTF-8, UTF-16 (LE/BE) or UTF-32 (LE/BE) encoded. That's the convention that for example Windows XP and Vista use. It is certainly nice for Ada compilers to recognize this. Now you can always regard the BOM as part of your source representation and recognize it as a kind of prefix to the actual file (that's what we have done in GNAT), but I think it would be nice for the standard to allow/require compilers to accept this character as a formatting character. By mentioning this in the standard we help to make sure that all compilers will follow this convention. **************************************************************** From: Tucker Taft Sent: Thursday, October 25, 2007 2:16 PM I could see this as Implementation Advice, but not as more than that, since we have generally agreed that source representation is not part of the standard. **************************************************************** From: Randy Brukardt Sent: Thursday, October 25, 2007 2:50 PM It is interesting that this character (16#0000_FEFF#) is defined as Other, Format by Unicode, and *not* a Separator, Space. Thus it is not covered by the existing rules. (Yes, I just went and looked this up.) There is wording in the standard to specifically allow "other_format" to occur in identifiers and the like, but no such wording for the program text outside of composite tokens. So I think I agree with Robert; there should be some statement that characters in class "other_format" are allowed between tokens (that is, in separators). Otherwise, the fact that they are explicitly allowed in some contexts could suggest that they are not allowed in other contexts (and that is not the intent). I also agree with Tucker that any specific recommendations about source format (such that it start with a particular character) should be Implementation Advice. **************************************************************** From: Robert Dewar Sent: Thursday, October 25, 2007 3:05 PM > I could see this as Implementation Advice, but > not as more than that, since we have generally > agreed that source representation is not part > of the standard. But we specify what sequence of characters is allowed, and I am suggesting that we specifically allow this character to appear as the first character of a source program (so that it is a standard part of the text of the program, rather than being considered as a source representation gizmo). **************************************************************** From: Robert Dewar Sent: Thursday, October 25, 2007 3:07 PM > I also agree with Tucker that any specific recommendations about source > format (such that it start with a particular character) should be > Implementation Advice. Right, I am just suggesting that the standard *allow* this other-format character to appear as the first character of the program (and be ignored), not that it be required. ****************************************************************