Version 1.4 of ai05s/ai05-0286-1.txt
!standard 2.1(16/2) 12-02-24 AI05-0286-1/03
!standard A.16(126/2)
!class Amendment 12-02-10
!status Amendment 2012 12-02-10
!status work item 12-02-10
!status received 11-10-01
!priority Medium
!difficulty Medium
!subject Internationalization of Ada
!summary
Implementation Advice is added to recommend that Ada compilers directly support
source code in UTF-8 encoding.
Equal_Case_Insensitive and Hash_Case_Insensitive are added for Wide_String and
Wide_Wide_String.
We recommend that implementations support UTF-8 encoded input and output for
file and directory operations, exception information, and the command line. But
we have no advice on how to do this at this time; implementation experience is
necessary in order to do this without breaking existing code and/or adding many
rarely used subprograms.
!proposal
Full support for Unicode characters is becoming increasingly important. Ada 2005
added support for international identifiers in Ada programs, yet the support for
Unicode is still incomplete in Ada.
We recommend that Ada adopt some solution so that:
(1) Compilers are required to support Unicode characters in source form, by
requiring some form of standard source representation (presumably UTF-8);
(2) File and directory operations should support Unicode characters (presuming
that the target file system does so);
(3) Exception messages and exception information should support Unicode
characters;
(4) Command lines should support Unicode characters
(presuming that the target system allows these).
[Editor's note: The Swiss comment ends here. See also the discussion section.]
(5) Case insensitive operations should be provided for Wide_String and
Wide_Wide_String in the normal way (not just for String). This should use simple
case folding.
!wording
For (1), add the following before 2.1(16/2) (and allow the paragraph numbers to
change, all following paragraphs are either new or notes or deleted):
Implementation Requirements
An Ada implementation shall accept Ada source code in UTF-8 encoding, with or
without a BOM (see A.4.11), where every character is represented by its
code point. The character pair Carriage Return/Line Feed (code points 16#0D#
16#0A#) signifies a single end of line (see 2.2); every other occurrence of a
format_effector other than the character whose code point position is 16#09#
also signifies a single end of line.
AARM Reason: This is simply requiring that an Ada implementation be able to
directly process the ACATS, which is provided in the described format. Note that
files that only contain characters with code points in the first 128 (which is
the majority of the ACATS) are represented in the same way in both UTF-8 and in
"plain" string format. The ACATS includes a BOM in files that have any
characters with code points greater than 127. Note that the BOM contains
characters not legal in Ada source code, so an implementation can use that to
automatically distinguish between files formatted as plain Latin-1 strings and
UTF-8 with BOM.
We allow line endings to be both represented as the pair CR LF (as in Windows
and the ACATS), and as single format_effector characters (usually LF, as in
Linux), in order that files created by standard tools on most operating systems
will meet the standard format.
This requirement will increase portability by having a format that is accepted
by all Ada compilers. Note that implementations can support other source
representations, including structured representations like a parse tree.
End AARM Reason.
Delete Note 2.1(18) and AARM note 2.1(18.a.1/3).
[Editor's note: "code point" is as defined in ISO/IEC 10646; we mention this
fact in AARM 3.5.2(11.p/3) but not normatively. Formally, this is not necessary
(as we include the definitions of 10646 by reference), but some might find it
confusing.]
For (2), (3), and (4), no solution is recommended at this time.
For (5), modify A.4.7(29/2):
"For each of the packages Strings.Fixed, Strings.Bounded, Strings.Unbounded, and
Strings.Maps.Constants, and for {library} functions Strings.Hash,
Strings.Fixed.Hash, Strings.Bounded.Hash, [and] Strings.Unbounded.Hash{,
Strings.Hash_Case_Insensitive, Strings.Fixed.Hash_Case_Insensitive,
Strings.Bounded.Hash_Case_Insensitive, Strings.Unbounded.Hash_Case_Insensitive,
Strings.Equal_Case_Insensitive, Strings.Fixed.Equal_Case_Insensitive,
Strings.Bounded.Equal_Case_Insensitive,
Strings.Unbounded.Equal_Case_Insensitive}, the corresponding wide string package
{or function} has the same contents except that"
Same for A.4.8(29/2).
Replace A.4.10(3/3):
Returns True if the strings are the same, that is if they consist of the same
sequence of characters after applying locale-independent simple case folding,
as defined by documents referenced in the note in section 1 of ISO/IEC
10646:2011. Otherwise, returns False. This function uses the same method as
is used to determine whether two identifiers are the same.
AARM Note: For String, this is equivalent to converting to lower case and
comparing. Not so for other string types. For Wide_Strings and
Wide_Wide_Strings, note that this result is a more accurate comparison than
converting the strings to lower case and comparing the results; it is
possible that the lower case conversions are the same but this routine will
report the strings as different. Additionally, Unicode says that the result
of this function will never change for strings made up solely of defined code
points; there is no such guarantee for case conversion to lower case.
[Editor's note: Yes, I verified that simple case folding and convert to lower
case do the same thing for type String.]
!discussion
The implementation requirement for source code is just saying that it is
required that implementations directly accept the ACATS tests as input. As such,
this is already true of most Ada implementations and thus should not have any
effect on Ada implementations that already support Ada 2005. We're just
specifying this in the Standard to increase the visibility and provide a
standard source form.
Note that an implementation which finds it difficult to meet this requirement
can depend upon the "impossible or impractical" exception to following the
standard (see 1.1.3).
The required form is more than the minimum required to support the ACATS. Only
support for 7-bit ASCII and UTF-8 starting with a BOM, with line endings always
represented by CR LF pairs, is required. However, it seems important to make the
required format useful on Unix, Linux, and OSX, which means at least supporting
LF alone as a line ending. We do this with just a small extra requirement from
the rules of 2.2(2/3), the extra requirement exists so that implementations all
count lines in "standard" source files the same way.
There was an intent in Ada 2005 that implementations use some method to support
file and directory names in UTF-8. We did not develop specific advice at that
time because we wanted to see how implementations would develop solutions before
standardizing on existing practice.
Unfortunately, there is not a lot of practice to standardize. Implementers
report that few customers are requesting support for UTF-8 file names and
operations. In addition, a commonly used practice on some operating systems is
to simply use UTF-8 strings without any special support. This works fine on
systems (like many Unix variants) that don't restrict the upper 128 characters
allowed in file names. The names are reported back by implementations via Name
as well. This works so long as the names are not interpreted or acted on in any
way.
This approach does not work well for the operations in Ada.Directories (because
it generates the names and in addition has operations for composition and
decomposition of file names). And it does not work at all on Windows (which uses
a separate API for wide string operations). It also does not work on OSX (which
restricts the allowed characters in file names).
An earlier version of this AI recommended advice using BOMs to differentiate
UTF-8 strings from existing Latin-1 strings. This prevented compatibility
problems with existing code using Latin-1 file names (especially that which
compose or decompose such strings); Latin-1 is used whenever possible and UTF-8
strings are the unusual condition.
However, this requires interpretation of the strings and thus breaks any code
that "simply use UTF-8 strings" as described above. Some ARG members oppose this
option for this reason. In addition, it was felt that requiring a BOM is
unnatural and ignores Unicode recommendations.
We briefly considered options using a specified Form parameter, but that does
not work for routines return a string (such as Name for files and Simple_Name
for directory operations), especially those where the file name may come from
non-Ada sources (like a directory search). In these cases, the representation
has to be compatible with existing practice -- which includes both encoded UTF-8
strings and Latin-1 strings.
We also considered the existing technique of adding overloaded routines for
these operations. That would take the form of Wide_ and Wide_Wide_ versions of
almost every routine that takes or returns a String in each file package and
directory operations. Besides triggering laughter and/or song from some ARG
members ("into the Wide_Wide_Open" - apologies to Tom Petty), the sheer number
of routines needed is obnoxious (20 Wide_ routines for Ada.Directories alone).
This was not a good solution in Ada 2005 (as it doesn't support UTF-8 or UTF-16
very well), and it is simply is much worse in Ada 2012.
Some wild solutions were floated, involving defining String to be a UTF-8
string (meaning that any indexing or slicing code is very likely wrong, at least
in the margins -- a situation much like Strings with lower bounds other than 1
-- it rarely fails, but is wrong), or having a generalized Root_String'Class
(with semantics and representation separated).
Finally, we considered Implementation Advice that implementations do something
for UTF-8. But such advice provides no value to users, as it does not help them
create portable code (or even somewhat portable code) -- only
implementation-defined solutions could be used.
It was clear that there was no consensus on any solution, even for simple
Implementation Advice. We will need far more time than remains in Ada 2012 to
research and develop a proper solution. Moreover, this National Body comment is
the first formal feedback on this topic in many years, nor have implementers
reported any customer interest, so this topic is not one that has been on any
recent ARG agenda. Thus we reluctantly adopted no solution for this problem.
Similar comments apply to Command_Line processing.
For Ada.Exceptions, we considered adding Wide_ and Wide_Wide_ versions of
Exception_Information, as this is supposed to include the Exception_Identifer,
which is already available in Wide_ and Wide_Wide_ versions. But changing
Exception_Message is not necessary, as any string of bytes will be returned.
Moreover, any changes to Exception_Message to support UTF-8 or Wide_Strings is
likely to break techniques that encode binary information as part of the
Exception_Message. It is not unusual to see code using such techniques. In
addition, the syntax for raising exceptions including a message would become
ambiguous if other string forms are allowed. The need to compatibly deal with
these problems requires more development time than we have remaining for Ada
2012. We felt that the Exception_Information changes were insufficiently
valuable to do alone, and perhaps a better general solution will be developed
for Ada 2020 (making all of these Wide_ and Wide_Wide_ routines obsolete.
---
The original comment included an extremely misguided suggestion to provide a
case insensitive comparison routine for file names. But file name comparison is
not recommended on Windows (where it would be most useful), because the local
comparison convention applies to file names. That local convention may be
different on different file systems accessible from the local machine! Windows
does not provide an API to do such comparisons, and strongly recommend that they
be avoided. Moreover, any general Ada-provided routine would use the Unicode
definitions for case comparison, which are locale-independent and thus would
not exactly match those used by the file system.
Indeed, we considered adding such a routine in AI05-0049-1, and rejected doing
so for these and other reasons (detailed in that AI). Since nothing has changed
about the file systems on operating systems (especially of Windows), nothing has
changed about the conclusions of that AI.
---
In answering the above, we noticed that we had defined Equal_Case_Insensitive
and Hash_Case_Insensitive for String but not Wide_String. We have rectified that
situation.
!corrigendum 2.1(16/2)
Insert after the paragraph:
In a nonstandard mode, the implementation may support a different character
repertoire; in particular, the set of characters that are considered
identifier_letters can be extended or changed to conform to local
conventions.
the new paragraph:
Implementation Advice
An Ada implementation should accept Ada source code in UTF-8 encoding, with or
without a BOM (see A.4.11), where line endings are marked by the pair Carriage
Return/Line Feed (16#0D# 16#0A#) and every other character is represented by its
code point.
*** !corrigendum A.4.7(29/2) TBD
!ACATS Test
As this is implementation advice, it is not formally testable. There would be
value to ignoring the ACATS rules in this case and creating C-Tests that UTF-8
strings work in file I/O and directory operations, but implementations can only
be failed for incorrect implementations, not non-existent ones.
!ASIS
No change needed.
!appendix
From: Robert Dewar
Sent: Sunday, November 6, 2011 6:48 AM
The RM has never been in the business of source representation, yet in practice
we understand that certain things are likely to work in practice.
Given the onslought of complexity with Unicode, I think it would be helpful to
define a canonical representation format that all compilers must recognize. In
practice the ACATS defines such a format as lower case ASCII allowing upper half
Latin-1 graphics, and brackets notation for wide characters, but that's awfully
kludgy and suitable only for interchange not actual use.
I suggest we define a canonical representation in UTF-8 encoding that all
compilers must accept.
This could be an auxiliary standard.
****************************************************************
From: Tucker Taft
Sent: Sunday, November 6, 2011 10:20 AM
I would agree that we should start requiring support for UTF-8. It seems to
have emerged as the one true portable standard.
****************************************************************
From: Robert Dewar
Sent: Sunday, November 6, 2011 10:27 AM
OK, but it's not just "support", we need a mapping document that describes
precisely how Ada sources are expressed in canonical UTF-8 form. You may think
you know, taking the obvious expression, but the RM has nothing whatever to say
about this mapping.
Things like
Is a BOM allowed/required?
How are format effectors represented?
Is brackets notation still allowed?
And of course a basic statement that an A is represented as an A (the RM does
not say this!)
****************************************************************
From: Randy Brukardt
Sent: Monday, November 7, 2011 6:16 PM
> > I would agree that we should start requiring support for UTF-8. It
> > seems to have emerged as the one true portable standard.
We already had this discussion in terms of the ACATS. I already eliminated the
hated "brackets notation" in favor of UTF-8 formatted tests in ACATS 3.0, and
had I ever had enough funding such that I was able to develop/acquire some
character tests (like identifier equivalence using Greek and Cyrillic
characters), those tests would have been in UTF-8.
> OK, but it's not just "support", we need a mapping document that
> describes precisely how Ada sources are expressed in canonical
> UTF-8 form. You may think you know, taking the obvious expression, but
> the RM has nothing whatever to say about this mapping.
>
> Things like
>
> Is a BOM allowed/required?
> How are format effectors represented?
> Is brackets notation still allowed?
>
> And of course a basic statement that an A is represented as an A (the
> RM does not say
> this!)
The obvious "solution" is the one used by the ACATS, which is to use the
standard "Windows" format for the files. (This means in this case the output of
Notepad, if there is any confusion.) This means that (1) a BOM is required; (2)
line endings are <CR><LF>; other format effectors represent themselves (and only
occur in a couple of tests); (3) obviously a compiler can support anything it
likes, but it has no "official" standing (and in my personal opinion, it never
did; it was just a encoded format to be converted to something practical with a
provided tool).
We would need to write something like this up (I though I had done so in the
ACATS documentation, but it seems that I only added a few small notes in 4.8 and
5.1.3). [The ACATS documentation is part of the ACATS, of course, I don't think
it is on-line anywhere, else I would have give a link.]
****************************************************************
From: Robert Dewar
Sent: Monday, November 7, 2011 6:38 PM
...
> We already had this discussion in terms of the ACATS. I already
> eliminated the hated "brackets notation" in favor of UTF-8 formatted
> tests in ACATS 3.0, and had I ever had enough funding such that I was
> able to develop/acquire some character tests (like identifier
> equivalence using Greek and Cyrillic characters), those tests would have been in UTF-8.
I would regard such tests as an abominable waste of time, reflecting my view
that case equivalence is an evil mistake in Ada 2005.
Note that "hated brackets encoding" had a real function early on of maing ACATS
tests transportable over a wide range of environments.
...
>> Things like
>>
>> Is a BOM allowed/required?
>> How are format effectors represented?
>> Is brackets notation still allowed?
>>
>> And of course a basic statement that an A is represented as an A (the
>> RM does not say this!)
>
> The obvious "solution" is the one used by the ACATS, which is to use
> the standard "Windows" format for the files. (This means in this case
> the output of Notepad, if there is any confusion.) This means that (1)
> a BOM is required; (2) line endings are<CR><LF>; other format
> effectors represent themselves (and only occur in a couple of tests);
> (3) obviously a compiler can support anything it likes, but it has no
> "official" standing (and in my personal opinion, it never did; it was
> just a encoded format to be converted to something practical with a provided
> tool).
I assume this is all in the framework of UTF-8 encoding
> We would need to write something like this up (I though I had done so
> in the ACATS documentation, but it seems that I only added a few small
> notes in 4.8 and 5.1.3). [The ACATS documentation is part of the
> ACATS, of course, I don't think it is on-line anywhere, else I would
> have give a link.]
I did not realize that ACATS specified UTF-8 encoding? In what context?
I know it decided to represent sources using Wide_Wide_Character or some such,
but that has nothing to do with source encoding.
****************************************************************
From: Randy Brukardt
Sent: Monday, November 7, 2011 7:40 PM
I'm not sure what you mean by "specified UTF-8 encoding". The ACATS doesn't
"specify" any encoding, but it is distributed with particular encodings
(described in the documentation), and virtually all Ada compilers choose to
support processing the ACATS tests directly. There are a number of tests in
ACATS 3.0 that are distributed in UTF-8 encoding (they have the ".au"
extension), all of the rest are distributed in 7-bit ASCII. And in both cases,
the files are formatted as for Windows (originally MS-DOS, which got the format
from CP/M, which got the format from some DEC OS...).
I know we discussed this here some years back (because otherwise I surely would
not have changed the distribution format).
****************************************************************
From: Robert Dewar
Sent: Monday, November 7, 2011 7:51 PM
OK, I understand, for interest do the UTF-8 files start with a BOM?
****************************************************************
From: Randy Brukardt
Sent: Monday, November 7, 2011 7:59 PM
Yes, the files start with a BOM. (I just went back and rechecked them with a hex
editor.)
I believe that I used Notepad to create the files (wanted the least common
denominator, and Notepad surely qualifies as "least" ;-).
****************************************************************
From: Robert Dewar
Sent: Sunday, November 6, 2011 6:55 AM
It's really pretty horrible to use VT in sources to end a line, this is an
ancient bow to old IBM line printers. I think we should define the use of this
format effector as obsolescent, and catch it using No_Obsolescent_Features.
Not sure about FF, it's certainly horrible to use it as a terminator for a
source line, but I have seen people use it in place of pragma Page. I think this
should probably also be considered obsolescent, but am not so concerned about
that one.
This is certainly not a vital issue!
****************************************************************
From: Tucker Taft
Sent: Sunday, November 6, 2011 10:14 AM
I see no harm in treating these as white space.
I think the bizarreness is treating these as line terminators, since no modern
operating system treats them as such, causing line numbers to mismatch between
Ada's line counting and the line counting of other tools.
****************************************************************
From: Robert Dewar
Sent: Sunday, November 6, 2011 10:22 AM
But you must treat them as line terminators in the logical sense, the RM insists
on this, that is, you must have SOME representation for VT and FF, of course
strictly it does not have to be the corresponding ASCII characters.
BTW, in GNAT, we distinguish between physical line terminators (like CR, LF, or
CR/LF), and logical line terminators (like FF and VT), precisely to avoid the
mismatch you refer to.
****************************************************************
From: Robert Dewar
Sent: Sunday, November 6, 2011 10:24 AM
It's interesting that for NEL (NEXT LINE, 16#85#) our decision in GNAT is to
treat this in 8-bit mode as a character that can appear freely in comments, but
not in program text.
The RM requires that you recognize an NEL as end of line, so you need some
representation for an NEL, we solve this in GNAT by saying that a NEL is only
recognized in UTF-8 encoding.
****************************************************************
From: Randy Brukardt
Sent: Wednesday, January 11, 2012 11:23 PM
We have received the following comment from Switzerland. I'm posting it here so
that we can discuss it, since we'll have to decide how we're going to respond to
it by the next meeting.
Following is the comment as I received it, the only difference being that I've
copied it into HTML. I'm hopeful that putting it into HTML will make it more
readable for most of us (in the original .DOC file, all I get for most of the
examples are lines of square boxes, so my making a PDF is not going to be
helpful). [Editor's note: It's been converted to plain ASCII here, which
probably will render the Unicode parts even more unreadable.]
I'll hold my comments for another message.
----------------------
The Unicode support in Ada 2012 is incomplete and inconsistent. We would like to
illustrate this with a hypothetical but nevertheless realistic example
application:
The program should sort files from one directory "input" into either of the two
directories "match" or "nomatch" based on whether the filename of the files is
listed in the textfile whose name is passed on the command line. The program
should do this for all files that match the optional wildcard file specification
on the command line. (e.g. "sortfiles matchlist *.ad?") The listfile shall be
treated as in the native encoding of the system unless it has an UTF BOM.
This sounds simple but actually cannot be implemented in Ada 2012 in a way that
it would work for all filenames - at least not on Windows and other operating
systems where filenames support Unicode.
The package Ada.Command_Line does not support Wide_Strings so what happens when
somebody would like to call "sortfiles ?? ?? *.txt"? The same problem applies to
the packages Directories and Environment_Variables.
The next problem would be to open the listfile. This cannot be done with Text_Io
because there is no Wide_String version (for parameter Name) of Open.
Reading the contents of the listfile is also a problem. Should Text_Io or
Wide_Text_Io be used? How will Wide_Text_Io interpret the file? (with or without
a BOM). If Text_Io has to be used to read the file in the native non-UTF system
encoding, how can the returned String be converted into Wide_String?
Ada.Strings.UTF_Encoding does not support this. Most programs that use
Wide_Strings and are not purely Unicode - including the data and files they
handle - and therefore they will need conversion routines from and to the native
system (non-UTF) encoding.
To compare the filenames with the lines in the listfile, Wide_String versions in
Ada.Strings.Equal_Case_Insensitive would be needed.
In case of exceptions, it really makes sense to include some information about
what went wrong in the exception message. In the example application above, this
would be the Unicode filename of the file operation that fails. In other cases
this could be some (Unicode) input data that could not be parsed or an
enumeration literal. However neither Ada.Exceptions.Raise_Exception nor
Exception_Message support Wide_Strings.
Due to the fact that Exception_Information usually contains a stack trace and
Ada identifiers can be Unicode, Exception_Information needs to support
Wide_Strings as well. An exception in a "procedure ???(?? : String)" should
create a readable stack trace too.
The inability of standard Ada to fully support Unicode is a serious deficiency
whose importance should not be underestimated. Computers are no longer the sole
preserve of the western hemisphere and accordingly it is unacceptable for
companies to produce products that do not function correctly in every country in
the world or, for that matter, every country in the EU! Despite all the hard
work that has gone into continually improving Ada to make it more viable as the
implementation language of choice, the lack of Unicode support within standard
Ada could yet be a valid reason not to use Ada.
Note: A compiler should support Unicode filenames as well. A package ?? will
most likely be stored in a source file ??.adb. It would probably make sense to
make it mandatory for all Ada 2012 compilers to accept source files in UTF-8
encoding (including an UTF-8 BOM). Otherwise it would be difficult to create
portable source files.
****************************************************************
From: Robert Dewar
Sent: Thursday, January 12, 2012 7:24 AM
I am opposed to doing anything in the context of the current work to respond to
this comment. We have already spent to much effort both at the definition and
implementation level on character set internationalization. Speaking from
AdaCore's point of view, we have not had a single enhancement request or
suggestion, let alone a defect report in this area.
I really think that any further work here should be user based and not language
designer based.
I think it fine to form a specific group to investigate what needs to be done
and gather this user input, But it would be a mistake to rush to add anything to
the current amendment.
****************************************************************
From: Randy Brukardt
Sent: Thursday, January 12, 2012 12:23 AM
First, some general comments:
(1) This comment is really way too late. The sort of massive overhaul of the
standard suggested here would require quite a lot of additional work:
determining what to do, writing wording for it, editing it, and so on. We'd
have to delay the standard at least another year in order to do that.
(2) The ARG has considered most of these issues in the past. In particular, we
discussed Unicode file names, and decided that the name pollution of
"Wide_Open", "Wide_Create", and on and on and on was just over the top.
Instead, we preferred that implementations supported UTF-8 file names.
Indeed, we designed Ada.Strings.UTF_Encoding such that it could be used
easily for this purpose.
There probably is a reasonable comment that this intent is not communicated
very well; I don't think there are even any AARM notes documenting this
intent. Perhaps we should add a bit of Implementation Advice about using
UTF-8 encoding if that is needed on the target system. The same advice could
apply to Exception Message, Command Line, and the other functions noted by
Switzerland.
(3) The entire "Wide_Wide_Wide_Wide_" "solution" is just awful, and making it
worse is not reasonable. We really need to rethink the entire area of string
processing to see if there is some way to decouple encoding from semantics.
Doing so is going to require a lot of work and thought. The alternative is
to continue to make a bigger and bigger mess, especially as we have to
support combinations in packages (Wide filenames in regular Text_IO, regular
filenames in Wide_Text_IO, and all of the other combinations.) But coming up
with an alternative, and getting buy-in, is not possible for Ada 2012; it
has to wait for Ada 2020 because it will take that long.
And a couple of specific comments:
> To compare the filenames with the lines in the listfile, Wide_String versions
> in Ada.Strings.Equal_Case_Insensitive would be needed.
The author is thoroughly confused if they think this is even a good idea, or
possible in general. Microsoft's Windows programming advice is to *never*
compare filenames for equality, and specifically to never do so with the local
machine's case insensitive equality. The reason is that these things depend on
the locale of the local machine and of the file system, which don't need to be
the same. Moreover, the Windows idea of filename equality is likely to be
different than Unicode string equality (which is what I would hope that
Equal_Case_Insensitive uses, since it has nothing to do with file names). And
which Unicode version of string equality is he talking about: "Full" case
folding, "Simple" case folding, or something else? There is no single answer
that is right for all uses, or even for many uses.
Processing of file names can only be done via the file system; the use of
general string routines will always lead to problems in corner cases. (On
networks with multiple operating systems, it's possible that even case
sensitivity will vary depending on the path in use.)
> It would probably make sense to make it mandatory for all Ada 2012 compilers
> to accept source files in UTF-8 encoding (including an UTF-8 BOM).
This is already de-facto required of Ada 2005 implementations, because all Ada
compilers have to process the ACATS, and ACATS 3.0 (and later) include some
files encoded in UTF-8. So that is already true of Ada 2012 implementations as
well, and in fact this is mentioned in the AARM. Requiring a standard source
file format in Ada Standard might be more problematic, as it would require
specifying the exact behavior of line-end combinations among other things, which
Ada has always stayed away from. In any case, Robert Dewar suggested this months
ago, and there will be at least an AI12 on the topic. It seems like a morass to
get into at this late point, however. (Robert's concern about the handling of VT
and FF is just the tip of the iceberg there, and it would be really easy to
introduce a gratuitous incompatibility.)
****************************************************************
From: Tucker Taft
Sent: Thursday, January 12, 2012 8:19 AM
I like your suggestion that we make explicit the recommendation that UTF-8
strings be used by the implementation where Standard.String is specified for
most interactions with the file system, identifier names, command lines, etc.
This is similar to our recommendation in ASIS that Wide_String be interpreted as
UTF-16 in many cases.
I would remove the "way too late" comment. It doesn't seem helpful. Many
delegations don't really get together to do a review until they are forced to do
so. I would rather emphasize (as you do) the complexity of the problem, rather
than the lateness of the comment.
****************************************************************
From: Robert Dewar
Sent: Thursday, January 12, 2012 8:37 AM
> I like your suggestion that we make explicit the recommendation that
> UTF-8 strings be used by the implementation where Standard.String is
> specified for most interactions with the file system, identifier
> names, command lines, etc. This is similar to our recommendation in
> ASIS that Wide_String be interpreted as UTF-16 in many cases.
As implementation advice, I would have no objection
>
> I would remove the "way too late" comment. It doesn't seem helpful.
> Many delegations don't really get together to do a review until they
> are forced to do so. I would rather emphasize (as you do) the
> complexity of the problem, rather than the lateness of the comment.
Implementation advice is a good way to handle the source representation issue as
well, and meets at least one of the delegations suggestions.
****************************************************************
From: Randy Brukardt
Sent: Thursday, January 12, 2012 1:44 PM
> I like your suggestion that we make explicit the recommendation that
> UTF-8 strings be used by the implementation where Standard.String is
> specified for most interactions with the file system, identifier
> names, command lines, etc. This is similar to our recommendation in
> ASIS that Wide_String be interpreted as UTF-16 in many cases.
Good. I note that Robert does as well, so I'll write up a draft this way for
discussion in Houston.
> I would remove the "way too late" comment. It doesn't seem helpful.
> Many delegations don't really get together to do a review until they
> are forced to do so. I would rather emphasize (as you do) the
> complexity of the problem, rather than the lateness of the comment.
This is just a matter of emphasis, isn't it? I said (or meant at least) "it's
way too late because the problem is complex and a lot of changes would be
necessary". You are saying "the problem is complex and a lot of changes would be
necessary, so it is too hard to address in the time remaining". Not much
difference in facts there, just wording.
After all, just because a problem is complex doesn't mean that the ARG shouldn't
try to solve it. The important point is that we don't think it is a good idea to
try to solve it in the time remaining, because we as likely to get it wrong as
right. And wrong won't help anyone, just make the language more complex.
****************************************************************
From: Tucker Taft
Sent: Thursday, January 12, 2012 2:24 PM
> This is just a matter of emphasis, isn't it?...
Yes, but it seems important as far as being appropriately "responsive" to the
various delegations. I just think we can communicate our reasons without
sounding school- marmish.
****************************************************************
From: Randy Brukardt
Sent: Thursday, January 12, 2012 3:02 PM
I'm happy to leave the "political correctness" to others who are better suited
for that than I, I just wanted to make sure that we really agree on the
fundamentals of this topic before I spend several hours crafting an AI.
****************************************************************
From: Ben Brosgol
Sent: Thursday, January 12, 2012 3:17 PM
Don't confuse politeness (a social skill) with political correctness (a
contrived way of expressing things to avoid any possibility of causing offense).
For a blunter definition of "political correctness", see
http://www.urbandictionary.com/define.php?term=politically%20correct
****************************************************************
From: Randy Brukardt
Sent: Friday, February 10, 2012 11:20 PM
Robert Dewar wrote:
> It's really pretty horrible to use VT in sources to end a line, this
> is an ancient bow to old IBM line printers.
> I think we should define the use of this format effector as
> obsolescent, and catch it using No_Obsolescent_Features.
>
> Not sure about FF, it's certainly horrible to use it as a terminator
> for a source line, but I have seen people use it in place of pragma
> Page. I think this should probably also be considered obsolescent, but
> am not so concerned about that one.
>
> This is certainly not a vital issue!
Tucker replied:
> I see no harm in treating these as white space.
> I think the bizarreness is treating these as line terminators, since
> no modern operating system treats them as such, causing line numbers
> to mismatch between Ada's line counting and the line counting of other
> tools.
I would inject a mild note of caution in terms of FF. One could argue that it
makes sense for the interpretation of sources to match the implementation's
Text_IO (so that Ada programs can write source text). If the programmer calls
Text_IO.New_Page, they're probably going to get an FF in their file (that
happens with most of the Ada compilers that I've used). Similarly, reading an FF
will cause the end of a line if it is not already ended (although Text_IO will
probably not write such a file).
I don't give a darn about VT, though, other than to note that there is a
compatibility problem to making a change. (But it is miniscule...)
Robert replied:
> But you must treat them as line terminators in the logical sense, the
> RM insists on this, that is, you must have SOME representation for VT
> and FF, of course strictly it does not have to be the corresponding
> ASCII characters.
The notion that the Standard somehow requires having some representation for
every possible character in every source form is laughable in my view. The
implication that this is required only appears in the AARM and only in a single
note. There is absolutely nothing normative about such a "requirement". It makes
about as much sense as requiring that an Ada compiler only run on a machine with
a two button mouse! A given source format will represent whatever characters it
can (or desires), and that is it.
However, with the proposed introduction of Implementation Advice that compilers
accept UTF-8 encoded files, where every character is represented by its code
point, this becomes more important. If such a UTF-8 file contains a VT
character, then the standard requires it to be treated as a line terminator.
Period. Treating it as white space would require a non-standard mode (where the
"canonical representation" was interpreted other than as recommended by the
standard), or of course ignoring the IA completely. That seems bad if existing
compilers are doing something else with the character.
I'm not sure that the right answer is here. We could add an Implementation
Permission that VT and FF represent 0 line terminators, or just do that for VT
(assuming FF is used in Text_IO files), or say something about Text_IO, or
something else. (We don't need anything to allow <LF><FF> to be treated as a
single line terminator - 2.2(2/3) already says this). For Janus/Ada, I'd
probably not make any change here (the only time I've ever seen a VT in a text
file is in the ACATS test for this character, so I think it is essentially
irrelevant as to how its handled, and for FF the same handing as Text_IO seems
right), and I'd rather not be forced to do so.
****************************************************************
From: Bob Duff
Sent: Saturday, February 11, 2012 9:03 AM
> The notion that the Standard somehow requires having some
> representation for every possible character in every source form is laughable in my view.
Not sure what you mean by "laughable", but formally speaking, OF COURSE an
implementation must support all the characters in the character set the standard
requires. Refusing to compile programs containing VT would be just as
nonconforming as refusing to compile programs containing the letter "A".
Practically speaking, on the other hand, I agree with "I don't give a darn about
VT". But if it never occurs in Ada programs (other than ACVC tests), they
there's no reason to change the rules.
>...The
> implication that this is required only appears in the AARM and only in
>a single note. There is absolutely nothing normative about such a
>"requirement".
I disagree. Implementations can't just make up legality rules.
> I'm not sure that the right answer is here. We could add an
> Implementation Permission ...
See what I said the other day about Implementation Permissions.
I say, insufficiently broken. And it introduces an incompatibility:
if a source contains "-- blah<FF> X := X + 1;" the suggested change will
comment-out the assignment statement. Not likely to occur, but pretty nasty if
it does.
****************************************************************
From: Robert Dewar
Sent: Sunday, February 12, 2012 10:10 AM
>> The notion that the Standard somehow requires having some
>> representation for every possible character in every source form is laughable in my view.
>
> Not sure what you mean by "laughable", but formally speaking, OF
> COURSE an implementation must support all the characters in the
> character set the standard requires. Refusing to compile programs
> containing VT would be just as nonconforming as refusing to compile
> programs containing the letter "A".
I 100% agree with Bob on this, and do not know where Randy is coming from.
I agree we could have impl permission to ignore VT, but I really think any
change to the handling of FF would generate gratuitous incompatibilites in
existing programs, where the use of FF to get new pages in listings is not
uncommon.
> Practically speaking, on the other hand, I agree with "I don't give a
> darn about VT". But if it never occurs in Ada programs (other than
> ACVC tests), they there's no reason to change the rules.
Right, changing the rules does not help existing implementations after all, it
makes extra work!
>> ...The
>> implication that this is required only appears in the AARM and only
>> in a single note. There is absolutely nothing normative about such a
>> "requirement".
>
> I disagree. Implementations can't just make up legality rules.
Yes, exactly
>> I'm not sure that the right answer is here. We could add an
>> Implementation Permission ...
>
> See what I said the other day about Implementation Permissions.
>
> I say, insufficiently broken. And it introduces an incompatibility:
> if a source contains "-- blah<FF> X := X + 1;" the suggested
> change will comment-out the assignment statement. Not likely to
> occur, but pretty nasty if it does.
Yes, exactly
Let's do nothing here, no reason to make a change, not sufficiently broken!
****************************************************************
From: Randy Brukardt
Sent: Monday, February 11, 2012 2:19 PM
> > The notion that the Standard somehow requires having some
> > representation for every possible character in every source
> form is laughable in my view.
>
> Not sure what you mean by "laughable", but formally speaking, OF
> COURSE an implementation must support all the characters in the
> character set the standard requires. Refusing to compile programs
> containing VT would be just as nonconforming as refusing to compile
> programs containing the letter "A".
But this has *nothing* to do with the representation of the source. What I was
saying is that a source representation does not necessarily have to have a
representation for VT (or PI or euro sign or any other character). I think it is
laughable to think that it ought to.
I definitely agree with if the source representation *does* have a
representation for VT, then it has to follow the standard associated with that
character.
> Practically speaking, on the other hand, I agree with "I don't give a
> darn about VT". But if it never occurs in Ada programs (other than
> ACVC tests), they there's no reason to change the rules.
>
> >...The
> > implication that this is required only appears in the AARM and only
> >in a single note. There is absolutely nothing normative about such a
> >"requirement".
>
> I disagree. Implementations can't just make up legality rules.
I never said anything about *making up legality rules*. And I surely was not
considering *rejecting* programs containing VT. However, I think it would be
perfectly OK if 16#0B# happened to be interpreted as a space in some source
representation; there cannot be a requirement on *all* source representations.
More generally, there is almost no requirement that any particular character be
representable in a particular source form. Ada 83 made this clear, by trying to
accommodate keypunch programs (which was archaic even back in 1980). Pretty much
the only requirement is for the digits, letters, space, some line ending, and
the delimiters defined in 2.2 (minus the allowed replacements). Anything else is
optional. (One could imagine having some $ notation [square brackets not being
in the 64 characters of the Unisys keypunches I used in the Paleozoic era of
computing] for additional characters, but that is not helpful unless the tools
also support it. Otherwise, they're just unintelligible gibberish in the text,
making it much harder to read and understand.)
The only indication to the contrary is the second sentence of 2.1(18.a/2), and
it does not follow from any normative requirements (there is no requirement or
need in Ada to translate *back* from the standard characters of an internal
compiler representation to Ada source). IMHO, that sentence is complete fantasy.
Anyway, this will become irrelevant if we adopt the Implementation Advice for a
standard source form, since that form will contain all of the "standard"
characters. It will still be optional (of course) to support this form, but
implementers that don't support it will have to explain themselves. (Which is
easy to do, at least in my case: no one has asked.)
****************************************************************
From: Randy Brukardt
Sent: Monday, February 11, 2012 2:27 PM
...
> > I say, insufficiently broken. And it introduces an incompatibility:
> > if a source contains "-- blah<FF> X := X + 1;" the suggested
> > change will comment-out the assignment statement. Not likely to
> > occur, but pretty nasty if it does.
>
> Yes, exactly
>
> Let's do nothing here, no reason to make a change, not sufficiently
> broken!
I personally agree with this, there is no important reason for a change.
However, someone posting using "Robert Dewar"s name back in November seemed to
think otherwise (call this person Robert Dewar #1):
> It's really pretty horrible to use VT in sources to end a line, this
> is an ancient bow to old IBM line printers. I think we should define
> the use of this
> format effector as obsolescent, and catch it using No_Obsolescent_Features.
>
> Not sure about FF, it's certainly horrible to use it as a terminator
> for a source line, but I have seen people use it in place of pragma
> Page. I think this
> should probably also be considered obsolescent, but am not so
> concerned about that one.
Tucker jumped in to agree (saying that these both should be interpreted as a
space), and then the topic dropped.
I was prepared to ignore this thought forever, but when we decided to put an
Implementation Advice for a standard UTF-8 source format on the agenda for the
upcoming meeting (as a partial response to the Swiss comment), this seemed to be
more important. After all, in that standard format, every character represents
itself (I included wording to say that, as pointed out by Robert in a different
thread), and that surely includes VT.
So Robert #1 wants a change to the handling of VT, and Robert #2 does not. Not
sure which Robert to pay attention to! Note that this is pretty much our last
chance to make any changes here; once the standard format is in use, changing
its interpretation would be too incompatible to contemplate.
****************************************************************
From: Robert Dewar
Sent: Monday, February 13, 2012 2:31 PM
> Tucker jumped in to agree (saying that these both should be
> interpreted as a space), and then the topic dropped.
Well there is nothing inconsistent between thinking something should be fixed or changed, and deciding that it is not worth the trouble!
>
> I was prepared to ignore this thought forever, but when we decided to
> put an Implementation Advice for a standard UTF-8 source format on the
> agenda for the upcoming meeting (as a partial response to the Swiss
> comment), this seemed to be more important. After all, in that
> standard format, every character represents itself (I included wording
> to say that, as pointed out by Robert in a different thread), and that surely includes VT.
>
> So Robert #1 wants a change to the handling of VT, and Robert #2 does not.
> Not sure which Robert to pay attention to! Note that this is pretty
> much our last chance to make any changes here; once the standard
> format is in use, changing its interpretation would be too incompatible to
> contemplate.
Leave VT as is, insufficiently broken to be worth fixing
****************************************************************
From: Robert Dewar
Sent: Monday, February 13, 2012 2:32 PM
> But this has *nothing* to do with the representation of the source.
> What I was saying is that a source representation does not necessarily
> have to have a representation for VT (or PI or euro sign or any other
> character). I think it is laughable to think that it ought to.
I find this incomprehensible. Of course the source representation must allow all
characters to be represented. As Bob Duff says, refusing to have a
representation for VT would be equivalent to refusing to have a represention for
'a'. There is no distinction.
I am completely non-plussed by Randy's "laughable" view here ????
> Anyway, this will become irrelevant if we adopt the Implementation
> Advice for a standard source form, since that form will contain all of
> the "standard" characters. It will still be optional (of course) to
> support this form, but implementers that don't support it will have to
> explain themselves. (Which is easy to do, at least in my case: no one
> has asked.)
Actually you have a positive requirement to document all failure to follow IA,
whether you are asked or not.
****************************************************************
From: Bob Duff
Sent: Monday, February 13, 2012 3:10 PM
> The only indication to the contrary is the second sentence of
> 2.1(18.a/2),
I'm completely mystified -- I must be totally misunderstanding what you mean.
> Anyway, this will become irrelevant if we adopt the Implementation
> Advice
OK, if it's irrelevant, I won't bother arguing about it. I object to "fixing"
2.1(18.a/2), and I object to adding any normative text that tries to say what
2.1(18.a/2) is saying. If you are not proposing to do either of those things,
then I'll drop the matter. Otherwise, I'll answer in more detail.
****************************************************************
From: Randy Brukardt
Sent: Monday, February 13, 2012 3:16 PM
> > But this has *nothing* to do with the representation of the source.
> > What I was saying is that a source representation does not
> > necessarily have to have a representation for VT (or PI or euro sign
> > or any other character). I think it is laughable to think that it ought to.
>
> I find this incomprehensible. Of course the source representation must
> allow all characters to be represented.
> As Bob Duff says, refusing to have a representation for VT would be
> equivalent to refusing to have a represention for 'a'. There is no
> distinction.
Why? There is nothing in the Standard that requires that. It requires an
interpretation for each character that appears in the source, but it cannot say
anything about which characters can appear in any particular source. How could
it? So why do we care which characters can appear? It's actively harmful to
include hacks like "brackets notation" in source to meet such a non-requirement
in straight 8-bit formats -- to take one such example.
And I agree with you: there is no distinction! It's perfectly OK to not allow
'a' (so long as 'A' is allowed). And indeed, the only reason for saying that you
need either 'a' or 'A' is one of practicality: you can't write useful Ada
programs without the various reserved words including 'A'.
> I am completely non-plussed by Randy's "laughable" view here ????
I'm completely flabbergasted that anyone would think that there is any
requirement or value to a requirement otherwise. Moreover, in the absence of a
customer requirement, why should any Ada implementer spend time on this (in any
way)?
Anyway, this is probably going to be irrelevant down the line, so it probably
does not need to be resolved.
> > Anyway, this will become irrelevant if we adopt the Implementation
> > Advice for a standard source form, since that form will contain all
> > of the "standard" characters. It will still be optional (of course)
> > to support this form, but implementers that don't support it will
> > have to explain themselves. (Which is easy to do, at least in my
> > case: no one has asked.)
>
> Actually you have a positive requirement to document all failure to
> follow IA, whether you are asked or not.
Sorry, you misunderstood: that is what my documentation would say: "We didn't
implement UTF-8 formats, because no one has asked for support for identifiers
and string literals with characters other than those in Latin-1."
****************************************************************
From: Randy Brukardt
Sent: Monday, February 13, 2012 3:23 PM
> > The only indication to the contrary is the second sentence of
> > 2.1(18.a/2),
>
> I'm completely mystified -- I must be totally misunderstanding what
> you mean.
Example: If you have an Ada source in some 6-bit character format (say the old
keypunch), does it have to have some mechanism to represent other characters
than those naturally present in that format. I say no, it would be harmful as
the meaning would be inconsistent to what "normal" tools for that format would
expect.
> > Anyway, this will become irrelevant if we adopt the Implementation
> > Advice
>
> OK, if it's irrelevant, I won't bother arguing about it.
> I object to "fixing" 2.1(18.a/2), and I object to adding any normative
> text that tries to say what 2.1(18.a/2) is saying.
> If you are not proposing to do either of those things, then I'll drop
> the matter. Otherwise, I'll answer in more detail.
The Implementation Advice would require a UTF-8 format where every code point
represents the associated character. Thus it renders 2.1(18.a/2) essentially
irrelevant, as any implementation that follows the advice would trivially meet
the requirement. And any implementation that doesn't would do so for good (and
documented) reasons, and it would seem silly to care beyond that (let the market
decide).
I would suggest deleting that AARM note, along with the associated RM note, if
the advice is added -- but it is not clear-cut and we'll have to discuss this in
Houston.
****************************************************************
From: Bob Duff
Sent: Monday, February 13, 2012 3:39 PM
> I would suggest deleting that AARM note, along with the associated RM
> note, if the advice is added -- but it is not clear-cut and we'll have
> to discuss this in Houston.
OK.
I don't object to deleting 2.1(18). And if we do that, then I don't object to
deleting the following AARM annotations.
The purpose of 2.1(18.a/2) was to explain 2.1(18). People would say things
like, "What stops an impl from saying the source rep is FORTRAN, and thereby
pass off a FORTRAN compiler as a conforming Ada impl." The answer is: you can
only do that if you can explain the mapping FORTRAN<-->Ada, which ain't likely.
;-)
****************************************************************
From: Robert Dewar
Sent: Monday, February 13, 2012 5:42 PM
> Why? There is nothing in the Standard that requires that. It requires
> an interpretation for each character that appears in the source, but
> it cannot say anything about which characters can appear in any
> particular source. How could it? So why do we care which characters
> can appear? It's actively harmful to include hacks like "brackets
> notation" in source to meet such a non-requirement in straight 8-bit formats
> -- to take one such example.
This would say that you regard almost any string literal as non-portable. I find
that ludicrous.
> I'm completely flabbergasted that anyone would think that there is any
> requirement or value to a requirement otherwise. Moreover, in the
> absence of a customer requirement, why should any Ada implementer
> spend time on this (in any way)?
Because the standard specifies the abstract character set that must be accepted.
****************************************************************
From: Robert Dewar
Sent: Monday, February 13, 2012 5:42 PM
> Example: If you have an Ada source in some 6-bit character format (say
> the old keypunch), does it have to have some mechanism to represent
> other characters than those naturally present in that format. I say
> no, it would be harmful as the meaning would be inconsistent to what
> "normal" tools for that format would expect.
Yes, of COURSE it does!
****************************************************************
From: Randy Brukardt
Sent: Monday, February 13, 2012 7:03 PM
> > Why? There is nothing in the Standard that requires that. It
> > requires an interpretation for each character that appears in the
> > source, but it cannot say anything about which characters can appear
> > in any particular source. How could it? So why do we care which
> > characters can appear? It's actively harmful to include hacks like
> > "brackets notation" in source to meet such a non-requirement in
> > straight 8-bit formats -- to take one such example.
>
> This would say that you regard almost any string literal as
> non-portable. I find that ludicrous.
Yes, of course. More generally, Ada (83-95-2005) has nothing to say about source
formats, so by definition there is no portability of Ada source. And there
surely is no requirement *in the Standard* that you can convert from one source
format to another. Indeed, I've always considered this a major hole in Ada's
definition, I'd rather have the standard clearly define this one way or another.
As a practical matter, of course, all Ada compilers support processing the
ACATS, so there is in fact a common interchange format. But with a handful of
exceptions, that only requires 7-bit ASCII support, so if you are using anything
else, it's at least potentially non-portable. And if you use the conversion
tools provided by the target system, you're probably going to lose information.
> > I'm completely flabbergasted that anyone would think that there is
> > any requirement or value to a requirement otherwise. Moreover, in
> > the absence of a customer requirement, why should any Ada
> > implementer spend time on this (in any way)?
>
> Because the standard specifies the abstract character set that must be
> accepted.
Not at all, it defines the handling of each character that *might* appear in Ada
source. It never says anything *requiring* that you can actually write those
characters (and I'm not sure that it can). Please find my *any* text that says
the compiler *must* accept source containing the PI character (to take one
example).
Anyway, we can clearly diffuse this question by simply putting in the Standard
that processing UTF-8 is required. And even without *requiring* that, simply
recommending it will definitely reduce the situation (any implementation
following the recommendation will have a clear, common format for Ada source
code). I'd actually be in favor of requiring it, even through that would make
Janus/Ada non-compliant in this area. The only reason for not doing that IMHO is
to avoid making work to implementers for which they have no customer demand.
(And if everyone agrees with you, then there cannot be much actual work involved
for other implementations.)
****************************************************************
Questions? Ask the ACAA Technical Agent