Version 1.1 of ai05s/ai05-0079-1.txt

Unformatted version of ai05s/ai05-0079-1.txt version 1.1
Other versions for file ai05s/ai05-0079-1.txt

!standard 2.2(7)          07-12-07 AI05-0079-1/01
!class binding interpretation 07-12-07
!status work item 07-12-07
!status received 07-10-25
!priority Low
!difficulty Easy
!qualifier Omission
!subject Other_Format characters should be allowed wherever separators are allowed
!summary
Other_format characters should be allowed wherever the language allows separators. These characters have no meaning to an Ada program.
!question
A standard convention is to start a file with a zero width non-breaking space character, 16#0000_FEFF#. By looking at the first bytes you can tell if the file is UTF-8, UTF-16 (LE/BE) or UTF-32 (LE/BE) encoded. That's the convention that Windows XP and Vista use, among others.
The definition of Ada programs does not allow this character (classified as Other, Format) outside of identifiers, character literals, string literals, and comments. Thus an Ada compiler has to play games with source representation in order to follow this convention.
More generally, it seems weird that a character that acts like a space is not allowed where spaces are. There doesn't seem to be any problem with allowing these more generally. Should these characters be allowed? (Yes.)
!recommendation
(See Summary.)
!wording
Add a new paragraph after 2.2(7):
One of more other_format characters are allowed anywhere that a separator is allowed[; any such characters have no effect on the meaning of an Ada program].
[Editor's note: These brackets indicate text that would be marked as redundant in the AARM.]
Add an AARM implementation note after 2.1(18.a):
In order to process the ACATS, an implementation will have to have the ability to process Latin-1 and UTF-8 formatted files. UTF-8 files by convention start with the character zero width no-break space (16#0000FEFF#); Latin-1 Ada source files do not start with these characters (the ZWNBS is encoded as 16#EF# 16#BB# 16#BF#; the last two characters are not legal in Ada programs outside of comments). That means it is possible for a compiler to determine which of these file formats are used without operator intervention.
!discussion
There is one semi-weird side-effect of this wording. If [ZNBS] represents zero width no-break space, then if it appears in a compound delimiter:
/[ZNBS]=
this would be interpreted as two (simple) delimiters, as other_format characters are not allowed between the parts of a compound delimiter. However, an editor may very show this such that it looks like a single delimiter (a zero-width space is unlikely to be very obvious).
We could have fixed this oddity by allowing other_format characters between the characters of a compound delimiter, but this does not seem very important for the extra descriptive requirements. (And it would allow characters like soft hyphen in compound delimiters, which could cause weird-looking programs).
!corrigendum 2.2(7)
Insert after the paragraph:
One or more separators are allowed between any two adjacent lexical elements, before the first of each compilation, or after the last. At least one separator is required between an identifier, a reserved word, or a numeric_literal and an adjacent identifier, reserved word, or numeric_literal.
the new paragraph:
One of more other_format characters are allowed anywhere that a separator is allowed; any such characters have no effect on the meaning of an Ada program.
!ACATS Test
An ACATS C-Test could be tried.
!appendix

From: Robert Dewar
Sent: Thursday, October 25, 2007  1:15 PM

A standard convention is to start a file with a non-breaking
zero width space character, 16#0000_FEFF#. By looking at the
first bytes you can tell if the file is UTF-8, UTF-16 (LE/BE)
or UTF-32 (LE/BE) encoded. That's the convention that for
example Windows XP and Vista use.

It is certainly nice for Ada compilers to recognize this. Now
you can always regard the BOM as part of your source representation
and recognize it as a kind of prefix to the actual file (that's
what we have done in GNAT), but I think it would be nice for the
standard to allow/require compilers to accept this character as
a formatting character. By mentioning this in the standard we
help to make sure that all compilers will follow this convention.

****************************************************************

From: Tucker Taft
Sent: Thursday, October 25, 2007  2:16 PM

I could see this as Implementation Advice, but
not as more than that, since we have generally
agreed that source representation is not part
of the standard.

****************************************************************

From: Randy Brukardt
Sent: Thursday, October 25, 2007  2:50 PM

It is interesting that this character (16#0000_FEFF#) is defined as Other,
Format by Unicode, and *not* a Separator, Space. Thus it is not covered by
the existing rules. (Yes, I just went and looked this up.)

There is wording in the standard to specifically allow "other_format" to
occur in identifiers and the like, but no such wording for the program text
outside of composite tokens.

So I think I agree with Robert; there should be some statement that
characters in class "other_format" are allowed between tokens (that is, in
separators). Otherwise, the fact that they are explicitly allowed in some
contexts could suggest that they are not allowed in other contexts (and that
is not the intent).

I also agree with Tucker that any specific recommendations about source
format (such that it start with a particular character) should be
Implementation Advice.

****************************************************************

From: Robert Dewar
Sent: Thursday, October 25, 2007  3:05 PM

> I could see this as Implementation Advice, but
> not as more than that, since we have generally
> agreed that source representation is not part
> of the standard.

But we specify what sequence of characters is allowed, and I am
suggesting that we specifically allow this character to appear
as the first character of a source program (so that it is a
standard part of the text of the program, rather than being
considered as a source representation gizmo).

****************************************************************

From: Robert Dewar
Sent: Thursday, October 25, 2007  3:07 PM

> I also agree with Tucker that any specific recommendations about source
> format (such that it start with a particular character) should be
> Implementation Advice.

Right, I am just suggesting that the standard *allow* this other-format
character to appear as the first character of the program (and be 
ignored), not that it be required.

****************************************************************


Questions? Ask the ACAA Technical Agent