CVS difference for ai05s/ai05-0286-1.txt

Differences between 1.3 and version 1.4
Log of other versions for file ai05s/ai05-0286-1.txt

--- ai05s/ai05-0286-1.txt	2012/02/19 04:54:05	1.3
+++ ai05s/ai05-0286-1.txt	2012/03/13 01:49:19	1.4
@@ -1,8 +1,4 @@
-!standard  2.1(16/2)                                12-02-14    AI05-0286-1/02
-!standard 11.4.1(19)
-!standard  A.3.5(0)
-!standard  A.7(14)
-!standard  A.15(21)
+!standard  2.1(16/2)                                12-02-24    AI05-0286-1/03
 !standard  A.16(126/2)
 !class Amendment 12-02-10
 !status Amendment 2012 12-02-10
@@ -17,10 +13,14 @@
 Implementation Advice is added to recommend that Ada compilers directly support
 source code in UTF-8 encoding.
 
-Implementation Advice is added to recommend that file and directory operations,
-exception information, and the command line accept UTF-8 encoded input and
-output.
+Equal_Case_Insensitive and Hash_Case_Insensitive are added for Wide_String and
+Wide_Wide_String.
 
+We recommend that implementations support UTF-8 encoded input and output for
+file and directory operations, exception information, and the command line. But
+we have no advice on how to do this at this time; implementation experience is
+necessary in order to do this without breaking existing code and/or adding many
+rarely used subprograms.
 
 !proposal
 
@@ -44,22 +44,25 @@
 
 [Editor's note: The Swiss comment ends here. See also the discussion section.]
 
-(5) Simple case folding should be provided as an operation in
-Ada.Characters.Handling, so that case insensitive comparisons (as opposed to
-case conversions) of strings can be accomplished.
+(5) Case insensitive operations should be provided for Wide_String and
+Wide_Wide_String in the normal way (not just for String). This should use simple
+case folding.
 
 !wording
 
-For (1), add the following after 2.1(16/2):
+For (1), add the following before 2.1(16/2) (and allow the paragraph numbers to
+change, all following paragraphs are either new or notes or deleted):
 
-Implementation Advice
+Implementation Requirements
 
-An Ada implementation should accept Ada source code in UTF-8 encoding, with or
-without a BOM (see A.4.11), where line endings are marked by the pair Carriage
-Return/Line Feed (16#0D# 16#0A#) and every other character is represented by its
-code point.
+An Ada implementation shall accept Ada source code in UTF-8 encoding, with or
+without a BOM (see A.4.11), where every character is represented by its
+code point. The character pair Carriage Return/Line Feed (code points 16#0D#
+16#0A#) signifies a single end of line (see 2.2); every other occurrence of a
+format_effector other than the character whose code point position is 16#09#
+also signifies a single end of line.
 
-AARM Reason: This is simply recommending that an Ada implementation be able to
+AARM Reason: This is simply requiring that an Ada implementation be able to
 directly process the ACATS, which is provided in the described format. Note that
 files that only contain characters with code points in the first 128 (which is
 the majority of the ACATS) are represented in the same way in both UTF-8 and in
@@ -68,155 +71,162 @@
 characters not legal in Ada source code, so an implementation can use that to
 automatically distinguish between files formatted as plain Latin-1 strings and
 UTF-8 with BOM.
-
-Delete AARM note 2.1(18.a.1/3).
-
-[Editor's note: I took the simplest approach here, and treated 7-bit ASCII
-input files as a subset of UTF-8. (The ACATS only uses 7-bit ASCII and UTF-8,
-never 8-bit Latin-1.) As such, only a single requirement was needed.
-Practically, however, it's likely that compilers will treat pure 8-bit input
-(with no BOM) differently than UTF-8 input (with a BOM). I wasn't sure if it
-was worthwhile to describe both formats, it mostly seemed like more wording
-to me. But this is easy to change.]
-
-[Editor's note: I do not know what to do with the Note 2.1(18) and the
-associated AARM note. This is still *strictly* true, because the language only
-*recommends* (as opposed to *specify*) a format. OTOH, it seems misleading to
-me. My preference is to delete it and move a modified version of the AARM note
-onto this new Implementation Advice. Bob thinks it would be better to delete
-both.]
-
-[Editor's opinion: I would actually prefer that we made this a requirement,
-rather than Implementation Advice. Then there is no need for 2.1(18) and the
-associated notes. In this case, implementation still could appeal to the
-"impossible or impractical" exception. I've always thought that the lack of
-a standard source format decreased the portability of Ada source code
-unnecessarily. OTOH, we can be somewhat more informal in Implementation Advice,
-so that might help describe this better. And the documentation requirement
-hopefully will reduce the chances of implementers ignoring it.]
-
-[Editor's note: "code point" is as defined in ISO/IEC 10646; we mention this fact in AARM 3.5.2(11.p/3)
-but not normatively. Formally, this is not necessary (as we include the definitions of 10646
-by reference), but some might find it confusing.]
-
-For (2):
-
-Add after A.7(14):
-
-Implementation Advice
-
-Subprograms that accept the name or form of an external file should
-allow the use of UTF-8 encoded strings that start with a BOM (see A.4.11)
-if the target file system allows characters with code points greater than 255 in
-names. Functions that return name or form of an external file should return a
-UTF-8 encoded string starting with a BOM if and only if the result includes
-characters with code points greater than 255.
-
-Add after A.16(126/2):
-
-Subprograms in the package Directories and its children that accept
-a string should allow the use of UTF-8 encoded strings that start with a BOM
-(see A.4.11) if the target file system allows characters with code points
-greater than 255 in any part of a full name. Functions in the package
-Directories and its children that return a string should return a UTF-8 encoded
-string starting with a BOM if and only if the result includes characters with
-code points greater than 255.
 
-For (3):
+We allow line endings to be both represented as the pair CR LF (as in Windows
+and the ACATS), and as single format_effector characters (usually LF, as in
+Linux), in order that files created by standard tools on most operating systems
+will meet the standard format.
+
+This requirement will increase portability by having a format that is accepted
+by all Ada compilers. Note that implementations can support other source
+representations, including structured representations like a parse tree.
+End AARM Reason.
+
+Delete Note 2.1(18) and AARM note 2.1(18.a.1/3).
+
+[Editor's note: "code point" is as defined in ISO/IEC 10646; we mention this
+fact in AARM 3.5.2(11.p/3) but not normatively. Formally, this is not necessary
+(as we include the definitions of 10646 by reference), but some might find it
+confusing.]
+
+For (2), (3), and (4), no solution is recommended at this time.
+
+For (5), modify A.4.7(29/2):
+
+"For each of the packages Strings.Fixed, Strings.Bounded, Strings.Unbounded, and
+Strings.Maps.Constants, and for {library} functions Strings.Hash,
+Strings.Fixed.Hash, Strings.Bounded.Hash, [and] Strings.Unbounded.Hash{,
+Strings.Hash_Case_Insensitive, Strings.Fixed.Hash_Case_Insensitive,
+Strings.Bounded.Hash_Case_Insensitive, Strings.Unbounded.Hash_Case_Insensitive,
+Strings.Equal_Case_Insensitive, Strings.Fixed.Equal_Case_Insensitive,
+Strings.Bounded.Equal_Case_Insensitive,
+Strings.Unbounded.Equal_Case_Insensitive}, the corresponding wide string package
+{or function} has the same contents except that"
 
-Add after 11.4.1(19):
+Same for A.4.8(29/2).
 
-Function Exception_Message and Exception_Information should return a UTF-8
-encoded string starting with a BOM (see A.4.11) if and only if the result
-includes characters with code points greater than 255.
+Replace A.4.10(3/3):
 
-AARM Discussion: Since all of the routines that raise and set the exception
-message take a string but do not interpret it, we need to say nothing to
-allow passing UTF-8 encoded strings with a BOM. Since encoding in this string is
-a common programming idiom, implementations should not modify any exception
-message string unless it starts with a BOM and does not contain any characters
-with code points greater than 255.
-
-For (4):
-
-Add after A.15(21):
-
-Implementation Advice
-
-Functions Argument and Command_Name should return a UTF-8 encoded string
-starting with a BOM (see A.4.11) if and only if the result includes characters
-with code points greater than 255.
-
-
-[Q: Should we do this for Environment_Variables as well? I think not; it's not
-necessary (you can always put a UTF-8 encoded string there and get it back out
-without any language discussion).]
-
-For (5), add after A.3.5(21/3):
-
-function Equal_Case_Insensitive (Left, Right : Wide_String) return Boolean;
-
-Add after A.3.5(61/3):
-
-function Equal_Case_Insensitive (Left, Right : Wide_String) return Boolean;
-
    Returns True if the strings are the same, that is if they consist of the same
    sequence of characters after applying locale-independent simple case folding,
    as defined by documents referenced in the note in section 1 of ISO/IEC
-   10646:2011. Otherwise, returns False. This function uses the same method
-   as is used to determine whether two identifiers are the same. Note that
-   this result is a more accurate comparison than converting the strings to
-   upper case and comparing the results; it is possible that the upper case
-   conversions are the same but this routine will report the strings as
-   different.
-
-[Editor's note: Should the last sentence be a user note or an AARM note
-instead?]
-
-!discussion
+   10646:2011. Otherwise, returns False. This function uses the same method as
+   is used to determine whether two identifiers are the same.
 
-The implementation advice for source code is just saying that it is recommended
-that implementations directly accept the ACATS tests as input. As such, this is
-already true of most Ada implementations and thus should not have any effect on
-Ada implementations that already support Ada 2005. We're just specifying this in
-the Standard to increase the visibility.
-
-Similarly, the advice to support file and directory names in UTF-8 was developed
-for Ada 2005. However, the advice was never put into the Standard (or even the
-AARM), and thus it never had the visibility to either users or implementers that
-is needed. As such implementations have lagged in this area; adding the
-Implementation Advice (along with the UTF-8 encoding packages) in Ada 20012
-should increase the visibility and reduce this situation.
-
-The alternative of adding Wide_ and Wide_Wide_ versions of every routine
-involved (that is, Wide_Open, Wide_Create, Wide_Name, and so on) would clutter
-the I/O packages with routines that would be rarely used. In addition, doing
-that now would prevent developing a better solution for future versions of Ada
-(for instance, supporting a representation-independent "Root_String'Class" type
-and using that as a parameter to operations like Open and Create).
+AARM Note: For String, this is equivalent to converting to lower case and
+   comparing. Not so for other string types. For Wide_Strings and
+   Wide_Wide_Strings, note that this result is a more accurate comparison than
+   converting the strings to lower case and comparing the results; it is
+   possible that the lower case conversions are the same but this routine will
+   report the strings as different. Additionally, Unicode says that the result
+   of this function will never change for strings made up solely of defined code
+   points; there is no such guarantee for case conversion to lower case.
 
----
-
-We require that routines that return Strings (such as the name of an external
-file) only return UTF-8 encoded strings when that is necessary, in order to
-maximize compatibility with existing applications. Otherwise, the appearance
-of Latin-1 characters in file names would cause an incompatible representation
-from "plain" string. The BOM at the head of the UTF-8 string is the marker for
-the representation change. The Ada.Strings.Encoding packages can be used
-to convert the string to a Wide_String or even a Wide_Wide_String as necessary.
-
-We considered using a predefined form to allow UTF-8 encoded names, but that
-does nothing to solve the problem of returning UTF-8 encoded strings from
-functions like Name and Form. Using a BOM on both sides is more consistent.
+[Editor's note: Yes, I verified that simple case folding and convert to lower
+case do the same thing for type String.]
 
----
+!discussion
 
-We could have made some more blanket statement about using UTF-8 encoded
-strings in operations that pass value to/from the target system. That would
-be easier (one paragraph to solve issues 2, 3, and 4), but it would separate the
-Advice a long ways from the uses. Since a primary purpose of this advice is to
-increase the visibility of it for Unicode users, hiding it in Section 1 would
-defeat the purpose.
+The implementation requirement for source code is just saying that it is
+required that implementations directly accept the ACATS tests as input. As such,
+this is already true of most Ada implementations and thus should not have any
+effect on Ada implementations that already support Ada 2005. We're just
+specifying this in the Standard to increase the visibility and provide a
+standard source form.
+
+Note that an implementation which finds it difficult to meet this requirement
+can depend upon the "impossible or impractical" exception to following the
+standard (see 1.1.3).
+
+The required form is more than the minimum required to support the ACATS. Only
+support for 7-bit ASCII and UTF-8 starting with a BOM, with line endings always
+represented by CR LF pairs, is required. However, it seems important to make the
+required format useful on Unix, Linux, and OSX, which means at least supporting
+LF alone as a line ending. We do this with just a small extra requirement from
+the rules of 2.2(2/3), the extra requirement exists so that implementations all
+count lines in "standard" source files the same way.
+
+
+There was an intent in Ada 2005 that implementations use some method to support
+file and directory names in UTF-8. We did not develop specific advice at that
+time because we wanted to see how implementations would develop solutions before
+standardizing on existing practice.
+
+Unfortunately, there is not a lot of practice to standardize. Implementers
+report that few customers are requesting support for UTF-8 file names and
+operations. In addition, a commonly used practice on some operating systems is
+to simply use UTF-8 strings without any special support. This works fine on
+systems (like many Unix variants) that don't restrict the upper 128 characters
+allowed in file names. The names are reported back by implementations via Name
+as well. This works so long as the names are not interpreted or acted on in any
+way.
+
+This approach does not work well for the operations in Ada.Directories (because
+it generates the names and in addition has operations for composition and
+decomposition of file names). And it does not work at all on Windows (which uses
+a separate API for wide string operations). It also does not work on OSX (which
+restricts the allowed characters in file names).
+
+An earlier version of this AI recommended advice using BOMs to differentiate
+UTF-8 strings from existing Latin-1 strings. This prevented compatibility
+problems with existing code using Latin-1 file names (especially that which
+compose or decompose such strings); Latin-1 is used whenever possible and UTF-8
+strings are the unusual condition.
+
+However, this requires interpretation of the strings and thus breaks any code
+that "simply use UTF-8 strings" as described above. Some ARG members oppose this
+option for this reason. In addition, it was felt that requiring a BOM is
+unnatural and ignores Unicode recommendations.
+
+We briefly considered options using a specified Form parameter, but that does
+not work for routines return a string (such as Name for files and Simple_Name
+for directory operations), especially those where the file name may come from
+non-Ada sources (like a directory search). In these cases, the representation
+has to be compatible with existing practice -- which includes both encoded UTF-8
+strings and Latin-1 strings.
+
+We also considered the existing technique of adding overloaded routines for
+these operations. That would take the form of Wide_ and Wide_Wide_ versions of
+almost every routine that takes or returns a String in each file package and
+directory operations. Besides triggering laughter and/or song from some ARG
+members ("into the Wide_Wide_Open" - apologies to Tom Petty), the sheer number
+of routines needed is obnoxious (20 Wide_ routines for Ada.Directories alone).
+This was not a good solution in Ada 2005 (as it doesn't support UTF-8 or UTF-16
+very well), and it is simply is much worse in Ada 2012.
+
+Some wild solutions were floated, involving *defining* String to be a UTF-8
+string (meaning that any indexing or slicing code is very likely wrong, at least
+in the margins -- a situation much like Strings with lower bounds other than 1
+-- it rarely fails, but is wrong), or having a generalized Root_String'Class
+(with semantics and representation separated).
+
+Finally, we considered Implementation Advice that implementations do something
+for UTF-8. But such advice provides no value to users, as it does not help them
+create portable code (or even *somewhat* portable code) -- only
+implementation-defined solutions could be used.
+
+It was clear that there was no consensus on any solution, even for simple
+Implementation Advice. We will need far more time than remains in Ada 2012 to
+research and develop a proper solution. Moreover, this National Body comment is
+the first formal feedback on this topic in many years, nor have implementers
+reported any customer interest, so this topic is not one that has been on any
+recent ARG agenda. Thus we reluctantly adopted no solution for this problem.
+
+Similar comments apply to Command_Line processing.
+
+For Ada.Exceptions, we considered adding Wide_ and Wide_Wide_ versions of
+Exception_Information, as this is supposed to include the Exception_Identifer,
+which is already available in Wide_ and Wide_Wide_ versions. But changing
+Exception_Message is not necessary, as any string of bytes will be returned.
+Moreover, any changes to Exception_Message to support UTF-8 or Wide_Strings is
+likely to break techniques that encode binary information as part of the
+Exception_Message. It is not unusual to see code using such techniques. In
+addition, the syntax for raising exceptions including a message would become
+ambiguous if other string forms are allowed. The need to compatibly deal with
+these problems requires more development time than we have remaining for Ada
+2012. We felt that the Exception_Information changes were insufficiently
+valuable to do alone, and perhaps a better general solution will be developed
+for Ada 2020 (making all of these Wide_ and Wide_Wide_ routines obsolete.
 
 ---
 
@@ -224,17 +234,23 @@
 case insensitive comparison routine for file names. But file name comparison is
 not recommended on Windows (where it would be most useful), because the local
 comparison convention applies to file names. That local convention may be
-different on different file systems accessible from the local machine! Moreover,
-any Ada-provided routine would use the Unicode definitions for case comparison,
-which are locale-independent and thus would not exactly match those used by the
-file system.
-
-This AI recommends adding a case comparison routine that mimicks the equality
-test for *identifiers* (which is just missing from the package Wide_Handling).
-But using that routine (or for that matter, any sort of equality) on file names
-is a fools game. [The author made that mistake on Windows, and has been fighting
-problems with long and short file names and case changes for years.]
+different on different file systems accessible from the local machine! Windows
+does not provide an API to do such comparisons, and strongly recommend that they
+be avoided. Moreover, any general Ada-provided routine would use the Unicode
+definitions for case comparison, which are locale-independent and thus would
+not exactly match those used by the file system.
+
+Indeed, we considered adding such a routine in AI05-0049-1, and rejected doing
+so for these and other reasons (detailed in that AI). Since nothing has changed
+about the file systems on operating systems (especially of Windows), nothing has
+changed about the conclusions of that AI.
 
+---
+
+In answering the above, we noticed that we had defined Equal_Case_Insensitive
+and Hash_Case_Insensitive for String but not Wide_String. We have rectified that
+situation.
+
 !corrigendum 2.1(16/2)
 
 @dinsa
@@ -249,80 +265,16 @@
 without a BOM (see A.4.11), where line endings are marked by the pair Carriage
 Return/Line Feed (16#0D# 16#0A#) and every other character is represented by its
 code point.
-
-!corrigendum 11.4.1(19)
 
-@dinsa
-Exception_Message (by default) and Exception_Information should produce
-information useful for debugging. Exception_Message should be short (about one
-line), whereas Exception_Information can be long. Exception_Message should not
-include the Exception_Name. Exception_Information should include both the
-Exception_Name and the Exception_Message.
-@dinst
-Function Exception_Message and Exception_Information should return a UTF-8
-encoded string starting with a BOM (see A.4.11) if and only if the result
-includes characters with code points greater than 255.
-
-!corrigendum A.3.5(0)
-
-@dinsc
-[A placeholder to cause a conflict; the real wording is found in the conflict
-file.]
-
-!corrigendum A.7(14)
-
-@dinsa
-The exceptions that can be propagated by the execution of an input-output
-subprogram are defined in the package IO_Exceptions; the situations in which
-they can be propagated are described following the description of the subprogram
-(and in clause A.13). The exceptions Storage_Error and Program_Error may be
-propagated. (Program_Error can only be propagated due to errors made by the
-caller of the subprogram.) Finally, exceptions can be propagated in certain
-implementation-defined situations.
-@dinst
-@s8<@i<Implementation Advice>>
-
-Subprograms that accept the name or form of an external file should
-allow the use of UTF-8 encoded strings that start with a BOM (see A.4.11)
-if the target file system allows characters with code points greater than 255
-in names. Functions that return name or form of an external file should return a
-UTF-8 encoded string starting with a BOM if and only if the result includes
-characters with code points greater than 255.
-
-!corrigendum A.15(21)
-
-@dinsa
-An alternative declaration is allowed for package Command_Line if different
-functionality is appropriate for the external execution environment.
-@dinst
-@s8<@i<Implementation Advice>>
-
-Functions Argument and Command_Name should return a UTF-8 encoded string
-starting with a BOM (see A.4.11) if and only if the result includes characters
-with code points greater than 255.
-
-
-!corrigendum A.16(126/2)
-
-@dinsa
-Rename should be supported at least when both New_Name and Old_Name are simple
-names and New_Name does not identify an existing external file.
-@dinst
-Subprograms in the package Directories and its children that accept
-a string should allow the use of UTF-8 encoded strings that start with a BOM
-(see A.4.11) if the target file system allows characters with code points
-greater than 255 in any part of a full name. Functions in the package
-Directories and its children that return a string should return a UTF-8 encoded
-string starting with a BOM if and only if the result includes characters with
-code points greater than 255.
+*** !corrigendum A.4.7(29/2) TBD
 
 
 !ACATS Test
 
-As this is implementation advice, it is not formally testable. There would be value to
-ignoring the ACATS rules in this case and creating C-Tests that UTF-8 strings work in file I/O
-and directory operations, but implementations can only be failed for incorrect implementations,
-not non-existent ones.
+As this is implementation advice, it is not formally testable. There would be
+value to ignoring the ACATS rules in this case and creating C-Tests that UTF-8
+strings work in file I/O and directory operations, but implementations can only
+be failed for incorrect implementations, not non-existent ones.
 
 !ASIS
 
@@ -1279,7 +1231,8 @@
 > it cannot say anything about which characters can appear in any
 > particular source. How could it? So why do we care which characters
 > can appear? It's actively harmful to include hacks like "brackets
-> notation" in source to meet such a non-requirement in straight 8-bit formats -- to take one such example.
+> notation" in source to meet such a non-requirement in straight 8-bit formats
+> -- to take one such example.
 
 This would say that you regard almost any string literal as non-portable. I find
 that ludicrous.

Questions? Ask the ACAA Technical Agent