Version 1.6 of ai05s/ai05-0079-1.txt
!standard 2.1(4/2) 08-05-21 AI05-0079-1/04
!standard 2.2(7)
!class binding interpretation 07-12-07
!status WG9 Approved 08-06-20
!status ARG Approved 7-0-1 08-02-09
!status work item 07-12-07
!status received 07-10-25
!priority Low
!difficulty Easy
!qualifier Omission
!subject An other_format character should be allowed wherever a separator is allowed
!summary
An other_format character should be allowed wherever the language allows a
separator. These characters have no meaning to an Ada program. Characters
that are not in categories other_format, format_effector and graphic_character
are not allowed outside of comments in an Ada program.
!question
A standard convention is to start a file with a zero width
non-breaking space character, 16#0000_FEFF#. By looking at the
first bytes you can tell if the file is UTF-8, UTF-16 (LE/BE)
or UTF-32 (LE/BE) encoded. That's the convention that Windows XP
and Vista use, among others.
The definition of Ada programs does not allow this character
(classified as Other, Format) outside of identifiers, character
literals, string literals, and comments. Thus an Ada compiler
has to play games with source representation in order to follow
this convention.
More generally, it seems weird that a character that acts like
a space is not allowed where spaces are. There doesn't seem to
be any problem with allowing these more generally. Should these
characters be allowed? (Yes.)
!recommendation
(See Summary.)
!wording
Add at the end of 2.1(4/2):
The only characters allowed outside of comments are those in categories
other_format, format_effector, and graphic_character.
Add a new paragraph after 2.2(7):
One or more other_format characters are allowed anywhere that a separator
is[; any such characters have no effect on the meaning of an Ada program].
[Editor's note: These brackets indicate text that would be marked as redundant in
the AARM.]
AARM Ramification: It is not possible to have other_format characters immediately
following an identifier: any other_format characters appearing at that place
are part of the identifier. Such characters do not alter the meaning of the
identifier anyway.
Add an AARM implementation note after 2.1(18.a):
In order to process the ACATS, an implementation will have to have the ability to
process Latin-1 and UTF-8 formatted files. UTF-8 files by convention start with
the character zero width no-break space (16#0000FEFF#), also known as byte order
mark (BOM); Latin-1 Ada source files do not start with these characters (the BOM
is encoded as 16#EF# 16#BB# 16#BF# for UTF-8; the last two characters are not legal
in Ada programs outside of comments). That means it is possible for a compiler to
determine which of these file formats are used without operator intervention.
!discussion
We need wording to say that characters not explicitly allowed are prohibited in
programs. This wording was unintentionally dropped from the Ada 95 version (which
had it in 2.1(1)). This wording doesn't have that much force, since such characters
can be defined as part of the source representation. But the goal is that a file
encoded in UTF-8 can be mapped directly to the language rules, without any source
representation games.
There is one semi-weird side-effect of this wording. If [ZWS] represents zero width
space, then if it appears in a compound delimiter:
/[ZWS]=
the program ought to be illegal, as other_format characters are not allowed between
the parts of a compound delimiter (this is not before or after a lexical element).
However, the language rules as amended seem to require this to be treated as
two single delimiters (as no separator is required between delimiters). That seems
bad, as an editor may show this such that it looks like a single compound delimiter
(a zero-width space is unlikely to be very obvious).
We could have fixed this oddity by allowing other_format characters between the
characters of a compound delimiter, but this does not seem very important for the
extra descriptive requirements. (And it would allow characters like soft hyphen
in compound delimiters, which could cause weird-looking programs). We could also
add a special rule to specifically make this case illegal, but that seems klunky at
best. The net effect is that an other_format character can act as a separator.
!corrigendum 2.1(4/2)
Replace the paragraph:
The coded representation for characters is implementation defined
(it need not be a representation defined within ISO/IEC 10646:2003).
A character whose relative code position in its plane is 16#FFFE# or 16#FFFF#
is not allowed anywhere in the text of a program.
by:
The coded representation for characters is implementation defined
(it need not be a representation defined within ISO/IEC 10646:2003).
A character whose relative code position in its plane is 16#FFFE# or 16#FFFF#
is not allowed anywhere in the text of a program.
The only characters allowed outside of comments are those in categories
other_format, format_effector, and graphic_character.
!corrigendum 2.2(7)
Insert after the paragraph:
One or more separators are allowed between any two adjacent lexical elements, before
the first of each compilation, or after the last. At least one separator is
required between an identifier, a reserved word, or a numeric_literal
and an adjacent identifier, reserved word, or numeric_literal.
the new paragraph:
One of more other_format characters are allowed anywhere that a separator
is; any such characters have no effect on the meaning of an Ada program.
!ACATS Test
An ACATS C-Test could be tried.
!appendix
From: Robert Dewar
Sent: Thursday, October 25, 2007 1:15 PM
A standard convention is to start a file with a non-breaking
zero width space character, 16#0000_FEFF#. By looking at the
first bytes you can tell if the file is UTF-8, UTF-16 (LE/BE)
or UTF-32 (LE/BE) encoded. That's the convention that for
example Windows XP and Vista use.
It is certainly nice for Ada compilers to recognize this. Now
you can always regard the BOM as part of your source representation
and recognize it as a kind of prefix to the actual file (that's
what we have done in GNAT), but I think it would be nice for the
standard to allow/require compilers to accept this character as
a formatting character. By mentioning this in the standard we
help to make sure that all compilers will follow this convention.
****************************************************************
From: Tucker Taft
Sent: Thursday, October 25, 2007 2:16 PM
I could see this as Implementation Advice, but
not as more than that, since we have generally
agreed that source representation is not part
of the standard.
****************************************************************
From: Randy Brukardt
Sent: Thursday, October 25, 2007 2:50 PM
It is interesting that this character (16#0000_FEFF#) is defined as Other,
Format by Unicode, and *not* a Separator, Space. Thus it is not covered by
the existing rules. (Yes, I just went and looked this up.)
There is wording in the standard to specifically allow "other_format" to
occur in identifiers and the like, but no such wording for the program text
outside of composite tokens.
So I think I agree with Robert; there should be some statement that
characters in class "other_format" are allowed between tokens (that is, in
separators). Otherwise, the fact that they are explicitly allowed in some
contexts could suggest that they are not allowed in other contexts (and that
is not the intent).
I also agree with Tucker that any specific recommendations about source
format (such that it start with a particular character) should be
Implementation Advice.
****************************************************************
From: Robert Dewar
Sent: Thursday, October 25, 2007 3:05 PM
> I could see this as Implementation Advice, but
> not as more than that, since we have generally
> agreed that source representation is not part
> of the standard.
But we specify what sequence of characters is allowed, and I am
suggesting that we specifically allow this character to appear
as the first character of a source program (so that it is a
standard part of the text of the program, rather than being
considered as a source representation gizmo).
****************************************************************
From: Robert Dewar
Sent: Thursday, October 25, 2007 3:07 PM
> I also agree with Tucker that any specific recommendations about source
> format (such that it start with a particular character) should be
> Implementation Advice.
Right, I am just suggesting that the standard *allow* this other-format
character to appear as the first character of the program (and be
ignored), not that it be required.
****************************************************************
Summary of private e-mail on this topic between Randy Brukardt and Pascal Leroy,
December 2007.
Randy: Since the RM specifically notes the existence
of the character class, it also has an obligation to say where it is
allowed. There is explicit wording to allow it in identifiers, character
literals, string literals, and comments (the last because of negative
wording: it is not in the disallowed list of characters). There is no
general perssion to allow random characters in various places, so one has to
presume the intent is only to allow them at these specific positions.
Pascal: Hmm, I am quite sure that there was such a permission in my mind when
I wrote AI 285. The !proposal section of this AI has: "The characters
in the category other_format are effectively ignored in most lexical
elements, with the exception that they are illegal in string_literals
and character_literals". OK, a separator is not a lexical element, so
perhaps that bit is missing.
Randy: Right, but that was dropped somewhere along the way. And the e-mail
in AI-395 makes it clear that we noticed this and *decided not to fix it*;
we only changed character literals and string literals. (Yes, I re-read much
of the e-mail on Friday night.)
Pascal: Anyway, I agree that something is broken, because my reading would
seem to imply that a control character in the middle of program text
is OK. Ah, I see, we've lost the following sentence from RM95 2.1(1):
"The only characters allowed outside of comments are the
graphic_characters and format_effectors".
So I agree that a BI is needed. I would rather phrase it in terms of
lexical elements, than in terms of separators, though. Something like
the following, after 2.2(7):
"Characters in category other_format are allowed before or after any
lexical element. [These characters have no effect on the meaning of an
Ada program.]"
AARM Note: It is not possible to have other_format characters after an
identifier: any other_format characters appearing at this place are
really part of the identifier. They do not alter the meaning of the
identifier anyway.
Randy: I don't like this note as written, because "after" covers a lot
of ground, and surely they're allowed after a separator following an
identifier -- and a separator is not a "lexical element".
Perhaps "immediately following" instead of "after"?
Pascal: No, that doesn't work either, because you want to allow an
other_format in the middle of whitespace, and that is not immediately
after any lexical element. So I think your solution based on
separators, although a bit unpalatable, is the best we can do.
---
Pascal: Maybe we need a sentence like the following, before the last
sentence of 2.1(4/2): "The only characters allowed outside of
comments are those in categories other_format, format_effector and
graphic_character".
Randy: I'm not sure we need both. You can't put random characters into the text
unless they're explicitly allowed. And this is a slight maintenance hazard.
Pascal: I insist that I think this is important to lift ambiguities. Your
statement that you cannot put random characters in the text is, at the
moment, not supported by RM wording. I might as well argue that you
can put any character you like unless they are explicitly forbidden by
a syntax rule (and that would allow control characters in whitespace).
It's better to be explicit. And I don't care too much about the
maintenance problem, because it's not like the character categories
that are relevant to the language are changing every day.
****************************************************************
Questions? Ask the ACAA Technical Agent