Version 1.1 of ai05s/ai05-0286-1.txt
!standard 2.1(16/2) 12-02-10 AI05-0286-1/01
!standard 11.4.1(19)
!standard A.3.5(0)
!standard A.7(14)
!standard A.15(21)
!standard A.16(126/2)
!class Amendment 12-02-10
!status Amendment 2012 12-02-10
!status work item 12-02-10
!status received 11-10-01
!priority Medium
!difficulty Medium
!subject Internationalization of Ada
!summary
Implementation Advice is added to recommend that Ada compilers directly support
source code in UTF-8 encoding.
Implementation Advice is added to recommend that file and directory operations,
exception information, and the command line accept UTF-8 encoded input and
output.
!proposal
Full support for Unicode characters is becoming increasingly important. Ada 2005
added support for international identifiers in Ada programs, yet the support for
Unicode is still incomplete in Ada.
We recommend that Ada adopt some solution so that:
(1) Compilers are required to support Unicode characters in source form, by
requiring some form of standard source representation (presumably UTF-8);
(2) File and directory operations should support Unicode characters (presuming
that the target file system does so);
(3) Exception messages and exception information should support Unicode
characters;
(4) Command lines should support Unicode characters
(presuming that the target system allows these).
[Editor's note: The Swiss comment ends here. See also the discussion section.]
(5) The use of VT and FF characters to represent line endings is obsolescent
(especially for VT). Something should be done here.
(6) Simple case folding should be provided as an operation in
Ada.Characters.Handling, so that case insensitive comparisons (as opposed to
case conversions) of strings can be accomplished.
!wording
For (1), add the following after 2.1(16/2):
Implementation Advice
An Ada implementation should accept Ada source code in UTF-8 encoding, with or
without a BOM (see A.4.11), where line endings are marked by the pair Carriage
Return/Line Feed (16#0D# 16#0A#) and every other character is represented by its
code point.
AARM Reason: This is simply recommending that an Ada implementation be able to
directly process the ACATS, which is provided in the described format. Note that
files that only contain characters with code points in the first 128 (which is
the majority of the ACATS) are represented in the same way in both UTF-8 and in
"plain" string format. The ACATS includes a BOM in files that have any
characters with code points greater than 127. Note that the BOM contains
characters not legal in Ada source code, so an implementation can use that to
automatically distinguish between files formatted as plain Latin-1 strings and
UTF-8 with BOM.
Delete AARM note 2.1(18.a.1/3).
[Editor's note: I do not know what to do with the Note 2.1(18) and the
associated AARM note. This is still strictly true, because the language only
recommends (as opposed to specify) a format. OTOH, it seems misleading to
me. My preference is to delete it and move a modified version of the AARM note
onto this new Implementation Advice.]
[Editor's note: "code point" is as defined in ISO/IEC 10646; we mention this fact in AARM 3.5.2(11.p/3)
but not normatively. Formally, this is not necessary (as we include the definitions of 10646
by reference), but some might find it confusing.]
For (2):
Add after A.7(14):
Implementation Advice
Subprograms that accept the name or form of an external file should
allow the use of UTF-8 encoded strings that start with a BOM (see A.4.11)
if the target file system allows characters with code points greater than 255 in
names. Functions that return name or form of an external file should return a
UTF-8 encoded string starting with a BOM if and only if the result includes
characters with code points greater than 255.
Add after A.16(126/2):
Subprograms in the package Directories and its children that accept
a string should allow the use of UTF-8 encoded strings that start with a BOM
(see A.4.11) if the target file system allows characters with code points
greater than 255 in any part of a full name. Functions in the package
Directories and its children that return a string should return a UTF-8 encoded
string starting with a BOM if and only if the result includes characters with
code points greater than 255.
For (3):
Add after 11.4.1(19):
Function Exception_Mesage and Exception_Information should return a UTF-8
encoded string starting with a BOM (see A.4.11) if and only if the result
includes characters with code points greater than 255.
AARM Discussion: Since all of the routines that raise and set the exception
message take a string but do not interpret it, we need to say nothing to
allow passing UTF-8 encoded strings with a BOM. Since encoding in this string is
a common programming idiom, implementations should not modify any exception
message string unless it starts with a BOM and does not contain any characters
with code points greater than 255.
For (4):
Add after A.15(21):
Implementation Advice
Functions Argument and Command_Name should return a UTF-8 encoded string
starting with a BOM (see A.4.11) if and only if the result includes characters
with code points greater than 255.
[Q: Should we do this for Environment_Variables as well? I think not; it's not
necessary (you can always put a UTF-8 encoded string there and get it back out
without any language discussion).]
For (5):
*** TBD *** [Presumably an Implementation Permission in 2.2?]
For (6), add after A.3.5(21/3):
function Equal_Case_Insensitive (Left, Right : Wide_String) return Boolean;
Add after A.3.5(61/3):
function Equal_Case_Insensitive (Left, Right : Wide_String) return Boolean;
Returns True if the strings are the same, that is if they consist of the same
sequence of characters after applying locale-independent simple case folding,
as defined by documents referenced in the note in section 1 of ISO/IEC
10646:2011. Otherwise, returns False. This function uses the same method
as is used to determine whether two identifiers are the same. Note that
this result is a more accurate comparison than converting the strings to
upper case and comparing the results; it is possible that the upper case
conversions are the same but this routine will report the strings as
different.
[Editor's note: Should the last sentence be a user note or an AARM note instead?]
!discussion
The implementation advice for source code is just saying that it is recommended
that implementations directly accept the ACATS tests as input. As such, this is
already true of most Ada implementations and thus should not have any effect on
Ada implementations that already support Ada 2005. We're just specifying this in
the Standard to increase the visibility.
Similarly, the advice to support file and directory names in UTF-8 was developed
for Ada 2005. However, the advice was never put into the Standard (or even the
AARM), and thus it never had the visibility to either users or implementers that
is needed. As such implementations have lagged in this area; adding the
Implementation Advice (along with the UTF-8 encoding packages) in Ada 20012
should increase the visibility and reduce this situation.
The alternative of adding Wide_ and Wide_Wide_ versions of every routine
involved (that is, Wide_Open, Wide_Create, Wide_Name, and so on) would clutter
the I/O packages with routines that would be rarely used. In addition, doing
that now would prevent developing a better solution for future versions of Ada
(for instance, supporting a representation-independent "Root_String'Class" type
and using that as a parameter to operations like Open and Create).
---
We require that routines that return Strings (such as the name of an external
file) only return UTF-8 encoded strings when that is necessary, in order to
maximize compatibility with existing applications. Otherwise, the appearance
of Latin-1 characters in file names would cause an incompatible representation
from "plain" string. The BOM at the head of the UTF-8 string is the marker for
the representation change. The Ada.Strings.Encoding packages can be used
to convert the string to a Wide_String or even a Wide_Wide_String as necessary.
We considered using a predefined form to allow UTF-8 encoded names, but that
does nothing to solve the problem of returning UTF-8 encoded strings from
functions like Name and Form. Using a BOM on both sides is more consistent.
---
We could have made some more blanket statement about using UTF-8 encoded
strings in operations that pass value to/from the target system. That would
be easier (one paragraph to solve issues 2, 3, and 4), but it would separate the
Advice a long ways from the uses. Since a primary purpose of this advice is to
increase the visibility of it for Unicode users, hiding it in Section 1 would
defeat the purpose.
---
The original comment included an extremely misguided suggestion to provide a
case insensitive comparison routine for file names. But file name comparison is
not recommended on Windows (where it would be most useful), because the local
comparison convention applies to file names. That local convention may be
different on different file systems accessible from the local machine! Moreover,
any Ada-provided routine would use the Unicode definitions for case comparison,
which are locale-independent and thus would not exactly match those used by the
file system.
This AI recommends adding a case comparison routine that mimicks the equality
test for identifiers (which is just missing from the package Wide_Handling).
But using that routine (or for that matter, any sort of equality) on file names
is a fools game. [The author made that mistake on Windows, and has been fighting
problems with long and short file names and case changes for years.]
!corrigendum 2.1(16/2)
Insert after the paragraph:
In a nonstandard mode, the implementation may support a different character
repertoire; in particular, the set of characters that are considered
identifier_letters can be extended or changed to conform to local
conventions.
the new paragraph:
Implementation Advice
An Ada implementation should accept Ada source code in UTF-8 encoding, with or
without a BOM (see A.4.11), where line endings are marked by the pair Carriage
Return/Line Feed (16#0D# 16#0A#) and every other character is represented by its
code point.
!corrigendum 11.4.1(19)
Insert after the paragraph:
Exception_Message (by default) and Exception_Information should produce
information useful for debugging. Exception_Message should be short (about one
line), whereas Exception_Information can be long. Exception_Message should not
include the Exception_Name. Exception_Information should include both the
Exception_Name and the Exception_Message.
the new paragraph:
Function Exception_Mesage and Exception_Information should return a UTF-8
encoded string starting with a BOM (see A.4.11) if and only if the result
includes characters with code points greater than 255.
!corrigendum A.3.5(0)
Insert new clause:
[A placeholder to cause a conflict; the real wording is found in the conflict
file.]
!corrigendum A.7(14)
Insert after the paragraph:
The exceptions that can be propagated by the execution of an input-output
subprogram are defined in the package IO_Exceptions; the situations in which
they can be propagated are described following the description of the subprogram
(and in clause A.13). The exceptions Storage_Error and Program_Error may be
propagated. (Program_Error can only be propagated due to errors made by the
caller of the subprogram.) Finally, exceptions can be propagated in certain
implementation-defined situations.
the new paragraph:
Implementation Advice
Subprograms that accept the name or form of an external file should
allow the use of UTF-8 encoded strings that start with a BOM (see A.4.11)
if the target file system allows characters with code points greater than 255
in names. Functions that return name or form of an external file should return a
UTF-8 encoded string starting with a BOM if and only if the result includes
characters with code points greater than 255.
!corrigendum A.15(21)
Insert after the paragraph:
An alternative declaration is allowed for package Command_Line if different
functionality is appropriate for the external execution environment.
the new paragraph:
Implementation Advice
Functions Argument and Command_Name should return a UTF-8 encoded string
starting with a BOM (see A.4.11) if and only if the result includes characters
with code points greater than 255.
!corrigendum A.16(126/2)
Insert after the paragraph:
Rename should be supported at least when both New_Name and Old_Name are simple
names and New_Name does not identify an existing external file.
the new paragraph:
Subprograms in the package Directories and its children that accept
a string should allow the use of UTF-8 encoded strings that start with a BOM
(see A.4.11) if the target file system allows characters with code points
greater than 255 in any part of a full name. Functions in the package
Directories and its children that return a string should return a UTF-8 encoded
string starting with a BOM if and only if the result includes characters with
code points greater than 255.
!ACATS Test
As this is implementation advice, it is not formally testable. There would be value to
ignoring the ACATS rules in this case and creating C-Tests that UTF-8 strings work in file I/O
and directory operations, but implementations can only be failed for incorrect implementations,
not non-existent ones.
!ASIS
No change needed.
!appendix
From: Robert Dewar
Sent: Sunday, November 6, 2011 6:48 AM
The RM has never been in the business of source representation, yet in practice
we understand that certain things are likely to work in practice.
Given the onslought of complexity with Unicode, I think it would be helpful to
define a canonical representation format that all compilers must recognize. In
practice the ACATS defines such a format as lower case ASCII allowing upper half
Latin-1 graphics, and brackets notation for wide characters, but that's awfully
kludgy and suitable only for interchange not actual use.
I suggest we define a canonical representation in UTF-8 encoding that all
compilers must accept.
This could be an auxiliary standard.
****************************************************************
From: Tucker Taft
Sent: Sunday, November 6, 2011 10:20 AM
I would agree that we should start requiring support for UTF-8. It seems to
have emerged as the one true portable standard.
****************************************************************
From: Robert Dewar
Sent: Sunday, November 6, 2011 10:27 AM
OK, but it's not just "support", we need a mapping document that describes
precisely how Ada sources are expressed in canonical UTF-8 form. You may think
you know, taking the obvious expression, but the RM has nothing whatever to say
about this mapping.
Things like
Is a BOM allowed/required?
How are format effectors represented?
Is brackets notation still allowed?
And of course a basic statement that an A is represented as an A (the RM does
not say this!)
****************************************************************
From: Randy Brukardt
Sent: Monday, November 7, 2011 6:16 PM
> > I would agree that we should start requiring support for UTF-8. It
> > seems to have emerged as the one true portable standard.
We already had this discussion in terms of the ACATS. I already eliminated the
hated "brackets notation" in favor of UTF-8 formatted tests in ACATS 3.0, and
had I ever had enough funding such that I was able to develop/acquire some
character tests (like identifier equivalence using Greek and Cyrillic
characters), those tests would have been in UTF-8.
> OK, but it's not just "support", we need a mapping document that
> describes precisely how Ada sources are expressed in canonical
> UTF-8 form. You may think you know, taking the obvious expression, but
> the RM has nothing whatever to say about this mapping.
>
> Things like
>
> Is a BOM allowed/required?
> How are format effectors represented?
> Is brackets notation still allowed?
>
> And of course a basic statement that an A is represented as an A (the
> RM does not say
> this!)
The obvious "solution" is the one used by the ACATS, which is to use the
standard "Windows" format for the files. (This means in this case the output of
Notepad, if there is any confusion.) This means that (1) a BOM is required; (2)
line endings are <CR><LF>; other format effectors represent themselves (and only
occur in a couple of tests); (3) obviously a compiler can support anything it
likes, but it has no "official" standing (and in my personal opinion, it never
did; it was just a encoded format to be converted to something practical with a
provided tool).
We would need to write something like this up (I though I had done so in the
ACATS documentation, but it seems that I only added a few small notes in 4.8 and
5.1.3). [The ACATS documentation is part of the ACATS, of course, I don't think
it is on-line anywhere, else I would have give a link.]
****************************************************************
From: Robert Dewar
Sent: Monday, November 7, 2011 6:38 PM
...
> We already had this discussion in terms of the ACATS. I already
> eliminated the hated "brackets notation" in favor of UTF-8 formatted
> tests in ACATS 3.0, and had I ever had enough funding such that I was
> able to develop/acquire some character tests (like identifier
> equivalence using Greek and Cyrillic characters), those tests would have been in UTF-8.
I would regard such tests as an abominable waste of time, reflecting my view
that case equivalence is an evil mistake in Ada 2005.
Note that "hated brackets encoding" had a real function early on of maing ACATS
tests transportable over a wide range of environments.
...
>> Things like
>>
>> Is a BOM allowed/required?
>> How are format effectors represented?
>> Is brackets notation still allowed?
>>
>> And of course a basic statement that an A is represented as an A (the
>> RM does not say this!)
>
> The obvious "solution" is the one used by the ACATS, which is to use
> the standard "Windows" format for the files. (This means in this case
> the output of Notepad, if there is any confusion.) This means that (1)
> a BOM is required; (2) line endings are<CR><LF>; other format
> effectors represent themselves (and only occur in a couple of tests);
> (3) obviously a compiler can support anything it likes, but it has no
> "official" standing (and in my personal opinion, it never did; it was
> just a encoded format to be converted to something practical with a provided
> tool).
I assume this is all in the framework of UTF-8 encoding
> We would need to write something like this up (I though I had done so
> in the ACATS documentation, but it seems that I only added a few small
> notes in 4.8 and 5.1.3). [The ACATS documentation is part of the
> ACATS, of course, I don't think it is on-line anywhere, else I would
> have give a link.]
I did not realize that ACATS specified UTF-8 encoding? In what context?
I know it decided to represent sources using Wide_Wide_Character or some such,
but that has nothing to do with source encoding.
****************************************************************
From: Randy Brukardt
Sent: Monday, November 7, 2011 7:40 PM
I'm not sure what you mean by "specified UTF-8 encoding". The ACATS doesn't
"specify" any encoding, but it is distributed with particular encodings
(described in the documentation), and virtually all Ada compilers choose to
support processing the ACATS tests directly. There are a number of tests in
ACATS 3.0 that are distributed in UTF-8 encoding (they have the ".au"
extension), all of the rest are distributed in 7-bit ASCII. And in both cases,
the files are formatted as for Windows (originally MS-DOS, which got the format
from CP/M, which got the format from some DEC OS...).
I know we discussed this here some years back (because otherwise I surely would
not have changed the distribution format).
****************************************************************
From: Robert Dewar
Sent: Monday, November 7, 2011 7:51 PM
OK, I understand, for interest do the UTF-8 files start with a BOM?
****************************************************************
From: Randy Brukardt
Sent: Monday, November 7, 2011 7:59 PM
Yes, the files start with a BOM. (I just went back and rechecked them with a hex
editor.)
I believe that I used Notepad to create the files (wanted the least common
denominator, and Notepad surely qualifies as "least" ;-).
****************************************************************
From: Robert Dewar
Sent: Sunday, November 6, 2011 6:55 AM
It's really pretty horrible to use VT in sources to end a line, this is an
ancient bow to old IBM line printers. I think we should define the use of this
format effector as obsolescent, and catch it using No_Obsolescent_Features.
Not sure about FF, it's certainly horrible to use it as a terminator for a
source line, but I have seen people use it in place of pragma Page. I think this
should probably also be considered obsolescent, but am not so concerned about
that one.
This is certainly not a vital issue!
****************************************************************
From: Tucker Taft
Sent: Sunday, November 6, 2011 10:14 AM
I see no harm in treating these as white space.
I think the bizarreness is treating these as line terminators, since no modern
operating system treats them as such, causing line numbers to mismatch between
Ada's line counting and the line counting of other tools.
****************************************************************
From: Robert Dewar
Sent: Sunday, November 6, 2011 10:22 AM
But you must treat them as line terminators in the logical sense, the RM insists
on this, that is, you must have SOME representation for VT and FF, of course
strictly it does not have to be the corresponding ASCII characters.
BTW, in GNAT, we distinguish between physical line terminators (like CR, LF, or
CR/LF), and logical line terminators (like FF and VT), precisely to avoid the
mismatch you refer to.
****************************************************************
From: Robert Dewar
Sent: Sunday, November 6, 2011 10:24 AM
It's interesting that for NEL (NEXT LINE, 16#85#) our decision in GNAT is to
treat this in 8-bit mode as a character that can appear freely in comments, but
not in program text.
The RM requires that you recognize an NEL as end of line, so you need some
representation for an NEL, we solve this in GNAT by saying that a NEL is only
recognized in UTF-8 encoding.
****************************************************************
From: Randy Brukardt
Sent: Friday, February 10, 2012 11:20 PM
Robert Dewar wrote:
> It's really pretty horrible to use VT in sources to end a line, this
> is an ancient bow to old IBM line printers.
> I think we should define the use of this format effector as
> obsolescent, and catch it using No_Obsolescent_Features.
>
> Not sure about FF, it's certainly horrible to use it as a terminator
> for a source line, but I have seen people use it in place of pragma
> Page. I think this should probably also be considered obsolescent, but
> am not so concerned about that one.
>
> This is certainly not a vital issue!
Tucker replied:
> I see no harm in treating these as white space.
> I think the bizarreness is treating these as line terminators, since
> no modern operating system treats them as such, causing line numbers
> to mismatch between Ada's line counting and the line counting of other
> tools.
I would inject a mild note of caution in terms of FF. One could argue that it
makes sense for the interpretation of sources to match the implementation's
Text_IO (so that Ada programs can write source text). If the programmer calls
Text_IO.New_Page, they're probably going to get an FF in their file (that
happens with most of the Ada compilers that I've used). Similarly, reading an FF
will cause the end of a line if it is not already ended (although Text_IO will
probably not write such a file).
I don't give a darn about VT, though, other than to note that there is a
compatibility problem to making a change. (But it is miniscule...)
Robert replied:
> But you must treat them as line terminators in the logical sense, the
> RM insists on this, that is, you must have SOME representation for VT
> and FF, of course strictly it does not have to be the corresponding
> ASCII characters.
The notion that the Standard somehow requires having some representation for
every possible character in every source form is laughable in my view. The
implication that this is required only appears in the AARM and only in a single
note. There is absolutely nothing normative about such a "requirement". It makes
about as much sense as requiring that an Ada compiler only run on a machine with
a two button mouse! A given source format will represent whatever characters it
can (or desires), and that is it.
However, with the proposed introduction of Implementation Advice that compilers
accept UTF-8 encoded files, where every character is represented by its code
point, this becomes more important. If such a UTF-8 file contains a VT
character, then the standard requires it to be treated as a line terminator.
Period. Treating it as white space would require a non-standard mode (where the
"canonical representation" was interpreted other than as recommended by the
standard), or of course ignoring the IA completely. That seems bad if existing
compilers are doing something else with the character.
I'm not sure that the right answer is here. We could add an Implementation
Permission that VT and FF represent 0 line terminators, or just do that for VT
(assuming FF is used in Text_IO files), or say something about Text_IO, or
something else. (We don't need anything to allow <LF><FF> to be treated as a
single line terminator - 2.2(2/3) already says this). For Janus/Ada, I'd
probably not make any change here (the only time I've ever seen a VT in a text
file is in the ACATS test for this character, so I think it is essentially
irrelevant as to how its handled, and for FF the same handing as Text_IO seems
right), and I'd rather not be forced to do so.
****************************************************************
From: Randy Brukardt
Sent: Wednesday, January 11, 2012 11:23 PM
We have received the following comment from Switzerland. I'm posting it here so
that we can discuss it, since we'll have to decide how we're going to respond to
it by the next meeting.
Following is the comment as I received it, the only difference being that I've
copied it into HTML. I'm hopeful that putting it into HTML will make it more
readable for most of us (in the original .DOC file, all I get for most of the
examples are lines of square boxes, so my making a PDF is not going to be
helpful). [Editor's note: It's been converted to plain ASCII here, which
probably will render the Unicode parts even more unreadable.]
I'll hold my comments for another message.
----------------------
The Unicode support in Ada 2012 is incomplete and inconsistent. We would like to
illustrate this with a hypothetical but nevertheless realistic example
application:
The program should sort files from one directory "input" into either of the two
directories "match" or "nomatch" based on whether the filename of the files is
listed in the textfile whose name is passed on the command line. The program
should do this for all files that match the optional wildcard file specification
on the command line. (e.g. "sortfiles matchlist *.ad?") The listfile shall be
treated as in the native encoding of the system unless it has an UTF BOM.
This sounds simple but actually cannot be implemented in Ada 2012 in a way that
it would work for all filenames - at least not on Windows and other operating
systems where filenames support Unicode.
The package Ada.Command_Line does not support Wide_Strings so what happens when
somebody would like to call "sortfiles ?? ?? *.txt"? The same problem applies to
the packages Directories and Environment_Variables.
The next problem would be to open the listfile. This cannot be done with Text_Io
because there is no Wide_String version (for parameter Name) of Open.
Reading the contents of the listfile is also a problem. Should Text_Io or
Wide_Text_Io be used? How will Wide_Text_Io interpret the file? (with or without
a BOM). If Text_Io has to be used to read the file in the native non-UTF system
encoding, how can the returned String be converted into Wide_String?
Ada.Strings.UTF_Encoding does not support this. Most programs that use
Wide_Strings and are not purely Unicode - including the data and files they
handle - and therefore they will need conversion routines from and to the native
system (non-UTF) encoding.
To compare the filenames with the lines in the listfile, Wide_String versions in
Ada.Strings.Equal_Case_Insensitive would be needed.
In case of exceptions, it really makes sense to include some information about
what went wrong in the exception message. In the example application above, this
would be the Unicode filename of the file operation that fails. In other cases
this could be some (Unicode) input data that could not be parsed or an
enumeration literal. However neither Ada.Exceptions.Raise_Exception nor
Exception_Message support Wide_Strings.
Due to the fact that Exception_Information usually contains a stack trace and
Ada identifiers can be Unicode, Exception_Information needs to support
Wide_Strings as well. An exception in a "procedure ???(?? : String)" should
create a readable stack trace too.
The inability of standard Ada to fully support Unicode is a serious deficiency
whose importance should not be underestimated. Computers are no longer the sole
preserve of the western hemisphere and accordingly it is unacceptable for
companies to produce products that do not function correctly in every country in
the world or, for that matter, every country in the EU! Despite all the hard
work that has gone into continually improving Ada to make it more viable as the
implementation language of choice, the lack of Unicode support within standard
Ada could yet be a valid reason not to use Ada.
Note: A compiler should support Unicode filenames as well. A package ?? will
most likely be stored in a source file ??.adb. It would probably make sense to
make it mandatory for all Ada 2012 compilers to accept source files in UTF-8
encoding (including an UTF-8 BOM). Otherwise it would be difficult to create
portable source files.
****************************************************************
From: Robert Dewar
Sent: Thursday, January 12, 2012 7:24 AM
I am opposed to doing anything in the context of the current work to respond to
this comment. We have already spent to much effort both at the definition and
implementation level on character set internationalization. Speaking from
AdaCore's point of view, we have not had a single enhancement request or
suggestion, let alone a defect report in this area.
I really think that any further work here should be user based and not language
designer based.
I think it fine to form a specific group to investigate what needs to be done
and gather this user input, But it would be a mistake to rush to add anything to
the current amendment.
****************************************************************
From: Randy Brukardt
Sent: Thursday, January 12, 2012 12:23 AM
First, some general comments:
(1) This comment is really way too late. The sort of massive overhaul of the
standard suggested here would require quite a lot of additional work:
determining what to do, writing wording for it, editing it, and so on. We'd
have to delay the standard at least another year in order to do that.
(2) The ARG has considered most of these issues in the past. In particular, we
discussed Unicode file names, and decided that the name pollution of
"Wide_Open", "Wide_Create", and on and on and on was just over the top.
Instead, we preferred that implementations supported UTF-8 file names.
Indeed, we designed Ada.Strings.UTF_Encoding such that it could be used
easily for this purpose.
There probably is a reasonable comment that this intent is not communicated
very well; I don't think there are even any AARM notes documenting this
intent. Perhaps we should add a bit of Implementation Advice about using
UTF-8 encoding if that is needed on the target system. The same advice could
apply to Exception Message, Command Line, and the other functions noted by
Switzerland.
(3) The entire "Wide_Wide_Wide_Wide_" "solution" is just awful, and making it
worse is not reasonable. We really need to rethink the entire area of string
processing to see if there is some way to decouple encoding from semantics.
Doing so is going to require a lot of work and thought. The alternative is
to continue to make a bigger and bigger mess, especially as we have to
support combinations in packages (Wide filenames in regular Text_IO, regular
filenames in Wide_Text_IO, and all of the other combinations.) But coming up
with an alternative, and getting buy-in, is not possible for Ada 2012; it
has to wait for Ada 2020 because it will take that long.
And a couple of specific comments:
> To compare the filenames with the lines in the listfile, Wide_String versions
> in Ada.Strings.Equal_Case_Insensitive would be needed.
The author is thoroughly confused if they think this is even a good idea, or
possible in general. Microsoft's Windows programming advice is to *never*
compare filenames for equality, and specifically to never do so with the local
machine's case insensitive equality. The reason is that these things depend on
the locale of the local machine and of the file system, which don't need to be
the same. Moreover, the Windows idea of filename equality is likely to be
different than Unicode string equality (which is what I would hope that
Equal_Case_Insensitive uses, since it has nothing to do with file names). And
which Unicode version of string equality is he talking about: "Full" case
folding, "Simple" case folding, or something else? There is no single answer
that is right for all uses, or even for many uses.
Processing of file names can only be done via the file system; the use of
general string routines will always lead to problems in corner cases. (On
networks with multiple operating systems, it's possible that even case
sensitivity will vary depending on the path in use.)
> It would probably make sense to make it mandatory for all Ada 2012 compilers
> to accept source files in UTF-8 encoding (including an UTF-8 BOM).
This is already de-facto required of Ada 2005 implementations, because all Ada
compilers have to process the ACATS, and ACATS 3.0 (and later) include some
files encoded in UTF-8. So that is already true of Ada 2012 implementations as
well, and in fact this is mentioned in the AARM. Requiring a standard source
file format in Ada Standard might be more problematic, as it would require
specifying the exact behavior of line-end combinations among other things, which
Ada has always stayed away from. In any case, Robert Dewar suggested this months
ago, and there will be at least an AI12 on the topic. It seems like a morass to
get into at this late point, however. (Robert's concern about the handling of VT
and FF is just the tip of the iceberg there, and it would be really easy to
introduce a gratuitous incompatibility.)
****************************************************************
From: Tucker Taft
Sent: Thursday, January 12, 2012 8:19 AM
I like your suggestion that we make explicit the recommendation that UTF-8
strings be used by the implementation where Standard.String is specified for
most interactions with the file system, identifier names, command lines, etc.
This is similar to our recommendation in ASIS that Wide_String be interpreted as
UTF-16 in many cases.
I would remove the "way too late" comment. It doesn't seem helpful. Many
delegations don't really get together to do a review until they are forced to do
so. I would rather emphasize (as you do) the complexity of the problem, rather
than the lateness of the comment.
****************************************************************
From: Robert Dewar
Sent: Thursday, January 12, 2012 8:37 AM
> I like your suggestion that we make explicit the recommendation that
> UTF-8 strings be used by the implementation where Standard.String is
> specified for most interactions with the file system, identifier
> names, command lines, etc. This is similar to our recommendation in
> ASIS that Wide_String be interpreted as UTF-16 in many cases.
As implementation advice, I would have no objection
>
> I would remove the "way too late" comment. It doesn't seem helpful.
> Many delegations don't really get together to do a review until they
> are forced to do so. I would rather emphasize (as you do) the
> complexity of the problem, rather than the lateness of the comment.
Implementation advice is a good way to handle the source representation issue as
well, and meets at least one of the delegations suggestions.
****************************************************************
From: Randy Brukardt
Sent: Thursday, January 12, 2012 1:44 PM
> I like your suggestion that we make explicit the recommendation that
> UTF-8 strings be used by the implementation where Standard.String is
> specified for most interactions with the file system, identifier
> names, command lines, etc. This is similar to our recommendation in
> ASIS that Wide_String be interpreted as UTF-16 in many cases.
Good. I note that Robert does as well, so I'll write up a draft this way for
discussion in Houston.
> I would remove the "way too late" comment. It doesn't seem helpful.
> Many delegations don't really get together to do a review until they
> are forced to do so. I would rather emphasize (as you do) the
> complexity of the problem, rather than the lateness of the comment.
This is just a matter of emphasis, isn't it? I said (or meant at least) "it's
way too late because the problem is complex and a lot of changes would be
necessary". You are saying "the problem is complex and a lot of changes would be
necessary, so it is too hard to address in the time remaining". Not much
difference in facts there, just wording.
After all, just because a problem is complex doesn't mean that the ARG shouldn't
try to solve it. The important point is that we don't think it is a good idea to
try to solve it in the time remaining, because we as likely to get it wrong as
right. And wrong won't help anyone, just make the language more complex.
****************************************************************
From: Tucker Taft
Sent: Thursday, January 12, 2012 2:24 PM
> This is just a matter of emphasis, isn't it?...
Yes, but it seems important as far as being appropriately "responsive" to the
various delegations. I just think we can communicate our reasons without
sounding school- marmish.
****************************************************************
From: Randy Brukardt
Sent: Thursday, January 12, 2012 3:02 PM
I'm happy to leave the "political correctness" to others who are better suited
for that than I, I just wanted to make sure that we really agree on the
fundamentals of this topic before I spend several hours crafting an AI.
****************************************************************
From: Ben Brosgol
Sent: Thursday, January 12, 2012 3:17 PM
Don't confuse politeness (a social skill) with political correctness (a
contrived way of expressing things to avoid any possibility of causing offense).
For a blunter definition of "political correctness", see
http://www.urbandictionary.com/define.php?term=politically%20correct
****************************************************************
Questions? Ask the ACAA Technical Agent