!standard 2.1(16/2) 12-02-14 AI05-0286-1/02 !standard 11.4.1(19) !standard A.3.5(0) !standard A.7(14) !standard A.15(21) !standard A.16(126/2) !class Amendment 12-02-10 !status Amendment 2012 12-02-10 !status work item 12-02-10 !status received 11-10-01 !priority Medium !difficulty Medium !subject Internationalization of Ada !summary Implementation Advice is added to recommend that Ada compilers directly support source code in UTF-8 encoding. Implementation Advice is added to recommend that file and directory operations, exception information, and the command line accept UTF-8 encoded input and output. !proposal Full support for Unicode characters is becoming increasingly important. Ada 2005 added support for international identifiers in Ada programs, yet the support for Unicode is still incomplete in Ada. We recommend that Ada adopt some solution so that: (1) Compilers are required to support Unicode characters in source form, by requiring some form of standard source representation (presumably UTF-8); (2) File and directory operations should support Unicode characters (presuming that the target file system does so); (3) Exception messages and exception information should support Unicode characters; (4) Command lines should support Unicode characters (presuming that the target system allows these). [Editor's note: The Swiss comment ends here. See also the discussion section.] (5) Simple case folding should be provided as an operation in Ada.Characters.Handling, so that case insensitive comparisons (as opposed to case conversions) of strings can be accomplished. !wording For (1), add the following after 2.1(16/2): Implementation Advice An Ada implementation should accept Ada source code in UTF-8 encoding, with or without a BOM (see A.4.11), where line endings are marked by the pair Carriage Return/Line Feed (16#0D# 16#0A#) and every other character is represented by its code point. AARM Reason: This is simply recommending that an Ada implementation be able to directly process the ACATS, which is provided in the described format. Note that files that only contain characters with code points in the first 128 (which is the majority of the ACATS) are represented in the same way in both UTF-8 and in "plain" string format. The ACATS includes a BOM in files that have any characters with code points greater than 127. Note that the BOM contains characters not legal in Ada source code, so an implementation can use that to automatically distinguish between files formatted as plain Latin-1 strings and UTF-8 with BOM. Delete AARM note 2.1(18.a.1/3). [Editor's note: I took the simplest approach here, and treated 7-bit ASCII input files as a subset of UTF-8. (The ACATS only uses 7-bit ASCII and UTF-8, never 8-bit Latin-1.) As such, only a single requirement was needed. Practically, however, it's likely that compilers will treat pure 8-bit input (with no BOM) differently than UTF-8 input (with a BOM). I wasn't sure if it was worthwhile to describe both formats, it mostly seemed like more wording to me. But this is easy to change.] [Editor's note: I do not know what to do with the Note 2.1(18) and the associated AARM note. This is still *strictly* true, because the language only *recommends* (as opposed to *specify*) a format. OTOH, it seems misleading to me. My preference is to delete it and move a modified version of the AARM note onto this new Implementation Advice. Bob thinks it would be better to delete both.] [Editor's opinion: I would actually prefer that we made this a requirement, rather than Implementation Advice. Then there is no need for 2.1(18) and the associated notes. In this case, implementation still could appeal to the "impossible or impractical" exception. I've always thought that the lack of a standard source format decreased the portability of Ada source code unnecessarily. OTOH, we can be somewhat more informal in Implementation Advice, so that might help describe this better. And the documentation requirement hopefully will reduce the chances of implementers ignoring it.] [Editor's note: "code point" is as defined in ISO/IEC 10646; we mention this fact in AARM 3.5.2(11.p/3) but not normatively. Formally, this is not necessary (as we include the definitions of 10646 by reference), but some might find it confusing.] For (2): Add after A.7(14): Implementation Advice Subprograms that accept the name or form of an external file should allow the use of UTF-8 encoded strings that start with a BOM (see A.4.11) if the target file system allows characters with code points greater than 255 in names. Functions that return name or form of an external file should return a UTF-8 encoded string starting with a BOM if and only if the result includes characters with code points greater than 255. Add after A.16(126/2): Subprograms in the package Directories and its children that accept a string should allow the use of UTF-8 encoded strings that start with a BOM (see A.4.11) if the target file system allows characters with code points greater than 255 in any part of a full name. Functions in the package Directories and its children that return a string should return a UTF-8 encoded string starting with a BOM if and only if the result includes characters with code points greater than 255. For (3): Add after 11.4.1(19): Function Exception_Message and Exception_Information should return a UTF-8 encoded string starting with a BOM (see A.4.11) if and only if the result includes characters with code points greater than 255. AARM Discussion: Since all of the routines that raise and set the exception message take a string but do not interpret it, we need to say nothing to allow passing UTF-8 encoded strings with a BOM. Since encoding in this string is a common programming idiom, implementations should not modify any exception message string unless it starts with a BOM and does not contain any characters with code points greater than 255. For (4): Add after A.15(21): Implementation Advice Functions Argument and Command_Name should return a UTF-8 encoded string starting with a BOM (see A.4.11) if and only if the result includes characters with code points greater than 255. [Q: Should we do this for Environment_Variables as well? I think not; it's not necessary (you can always put a UTF-8 encoded string there and get it back out without any language discussion).] For (5), add after A.3.5(21/3): function Equal_Case_Insensitive (Left, Right : Wide_String) return Boolean; Add after A.3.5(61/3): function Equal_Case_Insensitive (Left, Right : Wide_String) return Boolean; Returns True if the strings are the same, that is if they consist of the same sequence of characters after applying locale-independent simple case folding, as defined by documents referenced in the note in section 1 of ISO/IEC 10646:2011. Otherwise, returns False. This function uses the same method as is used to determine whether two identifiers are the same. Note that this result is a more accurate comparison than converting the strings to upper case and comparing the results; it is possible that the upper case conversions are the same but this routine will report the strings as different. [Editor's note: Should the last sentence be a user note or an AARM note instead?] !discussion The implementation advice for source code is just saying that it is recommended that implementations directly accept the ACATS tests as input. As such, this is already true of most Ada implementations and thus should not have any effect on Ada implementations that already support Ada 2005. We're just specifying this in the Standard to increase the visibility. Similarly, the advice to support file and directory names in UTF-8 was developed for Ada 2005. However, the advice was never put into the Standard (or even the AARM), and thus it never had the visibility to either users or implementers that is needed. As such implementations have lagged in this area; adding the Implementation Advice (along with the UTF-8 encoding packages) in Ada 20012 should increase the visibility and reduce this situation. The alternative of adding Wide_ and Wide_Wide_ versions of every routine involved (that is, Wide_Open, Wide_Create, Wide_Name, and so on) would clutter the I/O packages with routines that would be rarely used. In addition, doing that now would prevent developing a better solution for future versions of Ada (for instance, supporting a representation-independent "Root_String'Class" type and using that as a parameter to operations like Open and Create). --- We require that routines that return Strings (such as the name of an external file) only return UTF-8 encoded strings when that is necessary, in order to maximize compatibility with existing applications. Otherwise, the appearance of Latin-1 characters in file names would cause an incompatible representation from "plain" string. The BOM at the head of the UTF-8 string is the marker for the representation change. The Ada.Strings.Encoding packages can be used to convert the string to a Wide_String or even a Wide_Wide_String as necessary. We considered using a predefined form to allow UTF-8 encoded names, but that does nothing to solve the problem of returning UTF-8 encoded strings from functions like Name and Form. Using a BOM on both sides is more consistent. --- We could have made some more blanket statement about using UTF-8 encoded strings in operations that pass value to/from the target system. That would be easier (one paragraph to solve issues 2, 3, and 4), but it would separate the Advice a long ways from the uses. Since a primary purpose of this advice is to increase the visibility of it for Unicode users, hiding it in Section 1 would defeat the purpose. --- The original comment included an extremely misguided suggestion to provide a case insensitive comparison routine for file names. But file name comparison is not recommended on Windows (where it would be most useful), because the local comparison convention applies to file names. That local convention may be different on different file systems accessible from the local machine! Moreover, any Ada-provided routine would use the Unicode definitions for case comparison, which are locale-independent and thus would not exactly match those used by the file system. This AI recommends adding a case comparison routine that mimicks the equality test for *identifiers* (which is just missing from the package Wide_Handling). But using that routine (or for that matter, any sort of equality) on file names is a fools game. [The author made that mistake on Windows, and has been fighting problems with long and short file names and case changes for years.] !corrigendum 2.1(16/2) @dinsa In a nonstandard mode, the implementation may support a different character repertoire; in particular, the set of characters that are considered @fas can be extended or changed to conform to local conventions. @dinst @s8<@i> An Ada implementation should accept Ada source code in UTF-8 encoding, with or without a BOM (see A.4.11), where line endings are marked by the pair Carriage Return/Line Feed (16#0D# 16#0A#) and every other character is represented by its code point. !corrigendum 11.4.1(19) @dinsa Exception_Message (by default) and Exception_Information should produce information useful for debugging. Exception_Message should be short (about one line), whereas Exception_Information can be long. Exception_Message should not include the Exception_Name. Exception_Information should include both the Exception_Name and the Exception_Message. @dinst Function Exception_Message and Exception_Information should return a UTF-8 encoded string starting with a BOM (see A.4.11) if and only if the result includes characters with code points greater than 255. !corrigendum A.3.5(0) @dinsc [A placeholder to cause a conflict; the real wording is found in the conflict file.] !corrigendum A.7(14) @dinsa The exceptions that can be propagated by the execution of an input-output subprogram are defined in the package IO_Exceptions; the situations in which they can be propagated are described following the description of the subprogram (and in clause A.13). The exceptions Storage_Error and Program_Error may be propagated. (Program_Error can only be propagated due to errors made by the caller of the subprogram.) Finally, exceptions can be propagated in certain implementation-defined situations. @dinst @s8<@i> Subprograms that accept the name or form of an external file should allow the use of UTF-8 encoded strings that start with a BOM (see A.4.11) if the target file system allows characters with code points greater than 255 in names. Functions that return name or form of an external file should return a UTF-8 encoded string starting with a BOM if and only if the result includes characters with code points greater than 255. !corrigendum A.15(21) @dinsa An alternative declaration is allowed for package Command_Line if different functionality is appropriate for the external execution environment. @dinst @s8<@i> Functions Argument and Command_Name should return a UTF-8 encoded string starting with a BOM (see A.4.11) if and only if the result includes characters with code points greater than 255. !corrigendum A.16(126/2) @dinsa Rename should be supported at least when both New_Name and Old_Name are simple names and New_Name does not identify an existing external file. @dinst Subprograms in the package Directories and its children that accept a string should allow the use of UTF-8 encoded strings that start with a BOM (see A.4.11) if the target file system allows characters with code points greater than 255 in any part of a full name. Functions in the package Directories and its children that return a string should return a UTF-8 encoded string starting with a BOM if and only if the result includes characters with code points greater than 255. !ACATS Test As this is implementation advice, it is not formally testable. There would be value to ignoring the ACATS rules in this case and creating C-Tests that UTF-8 strings work in file I/O and directory operations, but implementations can only be failed for incorrect implementations, not non-existent ones. !ASIS No change needed. !appendix From: Robert Dewar Sent: Sunday, November 6, 2011 6:48 AM The RM has never been in the business of source representation, yet in practice we understand that certain things are likely to work in practice. Given the onslought of complexity with Unicode, I think it would be helpful to define a canonical representation format that all compilers must recognize. In practice the ACATS defines such a format as lower case ASCII allowing upper half Latin-1 graphics, and brackets notation for wide characters, but that's awfully kludgy and suitable only for interchange not actual use. I suggest we define a canonical representation in UTF-8 encoding that all compilers must accept. This could be an auxiliary standard. **************************************************************** From: Tucker Taft Sent: Sunday, November 6, 2011 10:20 AM I would agree that we should start requiring support for UTF-8. It seems to have emerged as the one true portable standard. **************************************************************** From: Robert Dewar Sent: Sunday, November 6, 2011 10:27 AM OK, but it's not just "support", we need a mapping document that describes precisely how Ada sources are expressed in canonical UTF-8 form. You may think you know, taking the obvious expression, but the RM has nothing whatever to say about this mapping. Things like Is a BOM allowed/required? How are format effectors represented? Is brackets notation still allowed? And of course a basic statement that an A is represented as an A (the RM does not say this!) **************************************************************** From: Randy Brukardt Sent: Monday, November 7, 2011 6:16 PM > > I would agree that we should start requiring support for UTF-8. It > > seems to have emerged as the one true portable standard. We already had this discussion in terms of the ACATS. I already eliminated the hated "brackets notation" in favor of UTF-8 formatted tests in ACATS 3.0, and had I ever had enough funding such that I was able to develop/acquire some character tests (like identifier equivalence using Greek and Cyrillic characters), those tests would have been in UTF-8. > OK, but it's not just "support", we need a mapping document that > describes precisely how Ada sources are expressed in canonical > UTF-8 form. You may think you know, taking the obvious expression, but > the RM has nothing whatever to say about this mapping. > > Things like > > Is a BOM allowed/required? > How are format effectors represented? > Is brackets notation still allowed? > > And of course a basic statement that an A is represented as an A (the > RM does not say > this!) The obvious "solution" is the one used by the ACATS, which is to use the standard "Windows" format for the files. (This means in this case the output of Notepad, if there is any confusion.) This means that (1) a BOM is required; (2) line endings are ; other format effectors represent themselves (and only occur in a couple of tests); (3) obviously a compiler can support anything it likes, but it has no "official" standing (and in my personal opinion, it never did; it was just a encoded format to be converted to something practical with a provided tool). We would need to write something like this up (I though I had done so in the ACATS documentation, but it seems that I only added a few small notes in 4.8 and 5.1.3). [The ACATS documentation is part of the ACATS, of course, I don't think it is on-line anywhere, else I would have give a link.] **************************************************************** From: Robert Dewar Sent: Monday, November 7, 2011 6:38 PM ... > We already had this discussion in terms of the ACATS. I already > eliminated the hated "brackets notation" in favor of UTF-8 formatted > tests in ACATS 3.0, and had I ever had enough funding such that I was > able to develop/acquire some character tests (like identifier > equivalence using Greek and Cyrillic characters), those tests would have been in UTF-8. I would regard such tests as an abominable waste of time, reflecting my view that case equivalence is an evil mistake in Ada 2005. Note that "hated brackets encoding" had a real function early on of maing ACATS tests transportable over a wide range of environments. ... >> Things like >> >> Is a BOM allowed/required? >> How are format effectors represented? >> Is brackets notation still allowed? >> >> And of course a basic statement that an A is represented as an A (the >> RM does not say this!) > > The obvious "solution" is the one used by the ACATS, which is to use > the standard "Windows" format for the files. (This means in this case > the output of Notepad, if there is any confusion.) This means that (1) > a BOM is required; (2) line endings are; other format > effectors represent themselves (and only occur in a couple of tests); > (3) obviously a compiler can support anything it likes, but it has no > "official" standing (and in my personal opinion, it never did; it was > just a encoded format to be converted to something practical with a provided > tool). I assume this is all in the framework of UTF-8 encoding > We would need to write something like this up (I though I had done so > in the ACATS documentation, but it seems that I only added a few small > notes in 4.8 and 5.1.3). [The ACATS documentation is part of the > ACATS, of course, I don't think it is on-line anywhere, else I would > have give a link.] I did not realize that ACATS specified UTF-8 encoding? In what context? I know it decided to represent sources using Wide_Wide_Character or some such, but that has nothing to do with source encoding. **************************************************************** From: Randy Brukardt Sent: Monday, November 7, 2011 7:40 PM I'm not sure what you mean by "specified UTF-8 encoding". The ACATS doesn't "specify" any encoding, but it is distributed with particular encodings (described in the documentation), and virtually all Ada compilers choose to support processing the ACATS tests directly. There are a number of tests in ACATS 3.0 that are distributed in UTF-8 encoding (they have the ".au" extension), all of the rest are distributed in 7-bit ASCII. And in both cases, the files are formatted as for Windows (originally MS-DOS, which got the format from CP/M, which got the format from some DEC OS...). I know we discussed this here some years back (because otherwise I surely would not have changed the distribution format). **************************************************************** From: Robert Dewar Sent: Monday, November 7, 2011 7:51 PM OK, I understand, for interest do the UTF-8 files start with a BOM? **************************************************************** From: Randy Brukardt Sent: Monday, November 7, 2011 7:59 PM Yes, the files start with a BOM. (I just went back and rechecked them with a hex editor.) I believe that I used Notepad to create the files (wanted the least common denominator, and Notepad surely qualifies as "least" ;-). **************************************************************** From: Robert Dewar Sent: Sunday, November 6, 2011 6:55 AM It's really pretty horrible to use VT in sources to end a line, this is an ancient bow to old IBM line printers. I think we should define the use of this format effector as obsolescent, and catch it using No_Obsolescent_Features. Not sure about FF, it's certainly horrible to use it as a terminator for a source line, but I have seen people use it in place of pragma Page. I think this should probably also be considered obsolescent, but am not so concerned about that one. This is certainly not a vital issue! **************************************************************** From: Tucker Taft Sent: Sunday, November 6, 2011 10:14 AM I see no harm in treating these as white space. I think the bizarreness is treating these as line terminators, since no modern operating system treats them as such, causing line numbers to mismatch between Ada's line counting and the line counting of other tools. **************************************************************** From: Robert Dewar Sent: Sunday, November 6, 2011 10:22 AM But you must treat them as line terminators in the logical sense, the RM insists on this, that is, you must have SOME representation for VT and FF, of course strictly it does not have to be the corresponding ASCII characters. BTW, in GNAT, we distinguish between physical line terminators (like CR, LF, or CR/LF), and logical line terminators (like FF and VT), precisely to avoid the mismatch you refer to. **************************************************************** From: Robert Dewar Sent: Sunday, November 6, 2011 10:24 AM It's interesting that for NEL (NEXT LINE, 16#85#) our decision in GNAT is to treat this in 8-bit mode as a character that can appear freely in comments, but not in program text. The RM requires that you recognize an NEL as end of line, so you need some representation for an NEL, we solve this in GNAT by saying that a NEL is only recognized in UTF-8 encoding. **************************************************************** From: Randy Brukardt Sent: Wednesday, January 11, 2012 11:23 PM We have received the following comment from Switzerland. I'm posting it here so that we can discuss it, since we'll have to decide how we're going to respond to it by the next meeting. Following is the comment as I received it, the only difference being that I've copied it into HTML. I'm hopeful that putting it into HTML will make it more readable for most of us (in the original .DOC file, all I get for most of the examples are lines of square boxes, so my making a PDF is not going to be helpful). [Editor's note: It's been converted to plain ASCII here, which probably will render the Unicode parts even more unreadable.] I'll hold my comments for another message. ---------------------- The Unicode support in Ada 2012 is incomplete and inconsistent. We would like to illustrate this with a hypothetical but nevertheless realistic example application: The program should sort files from one directory "input" into either of the two directories "match" or "nomatch" based on whether the filename of the files is listed in the textfile whose name is passed on the command line. The program should do this for all files that match the optional wildcard file specification on the command line. (e.g. "sortfiles matchlist *.ad?") The listfile shall be treated as in the native encoding of the system unless it has an UTF BOM. This sounds simple but actually cannot be implemented in Ada 2012 in a way that it would work for all filenames - at least not on Windows and other operating systems where filenames support Unicode. The package Ada.Command_Line does not support Wide_Strings so what happens when somebody would like to call "sortfiles ?? ?? *.txt"? The same problem applies to the packages Directories and Environment_Variables. The next problem would be to open the listfile. This cannot be done with Text_Io because there is no Wide_String version (for parameter Name) of Open. Reading the contents of the listfile is also a problem. Should Text_Io or Wide_Text_Io be used? How will Wide_Text_Io interpret the file? (with or without a BOM). If Text_Io has to be used to read the file in the native non-UTF system encoding, how can the returned String be converted into Wide_String? Ada.Strings.UTF_Encoding does not support this. Most programs that use Wide_Strings and are not purely Unicode - including the data and files they handle - and therefore they will need conversion routines from and to the native system (non-UTF) encoding. To compare the filenames with the lines in the listfile, Wide_String versions in Ada.Strings.Equal_Case_Insensitive would be needed. In case of exceptions, it really makes sense to include some information about what went wrong in the exception message. In the example application above, this would be the Unicode filename of the file operation that fails. In other cases this could be some (Unicode) input data that could not be parsed or an enumeration literal. However neither Ada.Exceptions.Raise_Exception nor Exception_Message support Wide_Strings. Due to the fact that Exception_Information usually contains a stack trace and Ada identifiers can be Unicode, Exception_Information needs to support Wide_Strings as well. An exception in a "procedure ???(?? : String)" should create a readable stack trace too. The inability of standard Ada to fully support Unicode is a serious deficiency whose importance should not be underestimated. Computers are no longer the sole preserve of the western hemisphere and accordingly it is unacceptable for companies to produce products that do not function correctly in every country in the world or, for that matter, every country in the EU! Despite all the hard work that has gone into continually improving Ada to make it more viable as the implementation language of choice, the lack of Unicode support within standard Ada could yet be a valid reason not to use Ada. Note: A compiler should support Unicode filenames as well. A package ?? will most likely be stored in a source file ??.adb. It would probably make sense to make it mandatory for all Ada 2012 compilers to accept source files in UTF-8 encoding (including an UTF-8 BOM). Otherwise it would be difficult to create portable source files. **************************************************************** From: Robert Dewar Sent: Thursday, January 12, 2012 7:24 AM I am opposed to doing anything in the context of the current work to respond to this comment. We have already spent to much effort both at the definition and implementation level on character set internationalization. Speaking from AdaCore's point of view, we have not had a single enhancement request or suggestion, let alone a defect report in this area. I really think that any further work here should be user based and not language designer based. I think it fine to form a specific group to investigate what needs to be done and gather this user input, But it would be a mistake to rush to add anything to the current amendment. **************************************************************** From: Randy Brukardt Sent: Thursday, January 12, 2012 12:23 AM First, some general comments: (1) This comment is really way too late. The sort of massive overhaul of the standard suggested here would require quite a lot of additional work: determining what to do, writing wording for it, editing it, and so on. We'd have to delay the standard at least another year in order to do that. (2) The ARG has considered most of these issues in the past. In particular, we discussed Unicode file names, and decided that the name pollution of "Wide_Open", "Wide_Create", and on and on and on was just over the top. Instead, we preferred that implementations supported UTF-8 file names. Indeed, we designed Ada.Strings.UTF_Encoding such that it could be used easily for this purpose. There probably is a reasonable comment that this intent is not communicated very well; I don't think there are even any AARM notes documenting this intent. Perhaps we should add a bit of Implementation Advice about using UTF-8 encoding if that is needed on the target system. The same advice could apply to Exception Message, Command Line, and the other functions noted by Switzerland. (3) The entire "Wide_Wide_Wide_Wide_" "solution" is just awful, and making it worse is not reasonable. We really need to rethink the entire area of string processing to see if there is some way to decouple encoding from semantics. Doing so is going to require a lot of work and thought. The alternative is to continue to make a bigger and bigger mess, especially as we have to support combinations in packages (Wide filenames in regular Text_IO, regular filenames in Wide_Text_IO, and all of the other combinations.) But coming up with an alternative, and getting buy-in, is not possible for Ada 2012; it has to wait for Ada 2020 because it will take that long. And a couple of specific comments: > To compare the filenames with the lines in the listfile, Wide_String versions > in Ada.Strings.Equal_Case_Insensitive would be needed. The author is thoroughly confused if they think this is even a good idea, or possible in general. Microsoft's Windows programming advice is to *never* compare filenames for equality, and specifically to never do so with the local machine's case insensitive equality. The reason is that these things depend on the locale of the local machine and of the file system, which don't need to be the same. Moreover, the Windows idea of filename equality is likely to be different than Unicode string equality (which is what I would hope that Equal_Case_Insensitive uses, since it has nothing to do with file names). And which Unicode version of string equality is he talking about: "Full" case folding, "Simple" case folding, or something else? There is no single answer that is right for all uses, or even for many uses. Processing of file names can only be done via the file system; the use of general string routines will always lead to problems in corner cases. (On networks with multiple operating systems, it's possible that even case sensitivity will vary depending on the path in use.) > It would probably make sense to make it mandatory for all Ada 2012 compilers > to accept source files in UTF-8 encoding (including an UTF-8 BOM). This is already de-facto required of Ada 2005 implementations, because all Ada compilers have to process the ACATS, and ACATS 3.0 (and later) include some files encoded in UTF-8. So that is already true of Ada 2012 implementations as well, and in fact this is mentioned in the AARM. Requiring a standard source file format in Ada Standard might be more problematic, as it would require specifying the exact behavior of line-end combinations among other things, which Ada has always stayed away from. In any case, Robert Dewar suggested this months ago, and there will be at least an AI12 on the topic. It seems like a morass to get into at this late point, however. (Robert's concern about the handling of VT and FF is just the tip of the iceberg there, and it would be really easy to introduce a gratuitous incompatibility.) **************************************************************** From: Tucker Taft Sent: Thursday, January 12, 2012 8:19 AM I like your suggestion that we make explicit the recommendation that UTF-8 strings be used by the implementation where Standard.String is specified for most interactions with the file system, identifier names, command lines, etc. This is similar to our recommendation in ASIS that Wide_String be interpreted as UTF-16 in many cases. I would remove the "way too late" comment. It doesn't seem helpful. Many delegations don't really get together to do a review until they are forced to do so. I would rather emphasize (as you do) the complexity of the problem, rather than the lateness of the comment. **************************************************************** From: Robert Dewar Sent: Thursday, January 12, 2012 8:37 AM > I like your suggestion that we make explicit the recommendation that > UTF-8 strings be used by the implementation where Standard.String is > specified for most interactions with the file system, identifier > names, command lines, etc. This is similar to our recommendation in > ASIS that Wide_String be interpreted as UTF-16 in many cases. As implementation advice, I would have no objection > > I would remove the "way too late" comment. It doesn't seem helpful. > Many delegations don't really get together to do a review until they > are forced to do so. I would rather emphasize (as you do) the > complexity of the problem, rather than the lateness of the comment. Implementation advice is a good way to handle the source representation issue as well, and meets at least one of the delegations suggestions. **************************************************************** From: Randy Brukardt Sent: Thursday, January 12, 2012 1:44 PM > I like your suggestion that we make explicit the recommendation that > UTF-8 strings be used by the implementation where Standard.String is > specified for most interactions with the file system, identifier > names, command lines, etc. This is similar to our recommendation in > ASIS that Wide_String be interpreted as UTF-16 in many cases. Good. I note that Robert does as well, so I'll write up a draft this way for discussion in Houston. > I would remove the "way too late" comment. It doesn't seem helpful. > Many delegations don't really get together to do a review until they > are forced to do so. I would rather emphasize (as you do) the > complexity of the problem, rather than the lateness of the comment. This is just a matter of emphasis, isn't it? I said (or meant at least) "it's way too late because the problem is complex and a lot of changes would be necessary". You are saying "the problem is complex and a lot of changes would be necessary, so it is too hard to address in the time remaining". Not much difference in facts there, just wording. After all, just because a problem is complex doesn't mean that the ARG shouldn't try to solve it. The important point is that we don't think it is a good idea to try to solve it in the time remaining, because we as likely to get it wrong as right. And wrong won't help anyone, just make the language more complex. **************************************************************** From: Tucker Taft Sent: Thursday, January 12, 2012 2:24 PM > This is just a matter of emphasis, isn't it?... Yes, but it seems important as far as being appropriately "responsive" to the various delegations. I just think we can communicate our reasons without sounding school- marmish. **************************************************************** From: Randy Brukardt Sent: Thursday, January 12, 2012 3:02 PM I'm happy to leave the "political correctness" to others who are better suited for that than I, I just wanted to make sure that we really agree on the fundamentals of this topic before I spend several hours crafting an AI. **************************************************************** From: Ben Brosgol Sent: Thursday, January 12, 2012 3:17 PM Don't confuse politeness (a social skill) with political correctness (a contrived way of expressing things to avoid any possibility of causing offense). For a blunter definition of "political correctness", see http://www.urbandictionary.com/define.php?term=politically%20correct **************************************************************** From: Randy Brukardt Sent: Friday, February 10, 2012 11:20 PM Robert Dewar wrote: > It's really pretty horrible to use VT in sources to end a line, this > is an ancient bow to old IBM line printers. > I think we should define the use of this format effector as > obsolescent, and catch it using No_Obsolescent_Features. > > Not sure about FF, it's certainly horrible to use it as a terminator > for a source line, but I have seen people use it in place of pragma > Page. I think this should probably also be considered obsolescent, but > am not so concerned about that one. > > This is certainly not a vital issue! Tucker replied: > I see no harm in treating these as white space. > I think the bizarreness is treating these as line terminators, since > no modern operating system treats them as such, causing line numbers > to mismatch between Ada's line counting and the line counting of other > tools. I would inject a mild note of caution in terms of FF. One could argue that it makes sense for the interpretation of sources to match the implementation's Text_IO (so that Ada programs can write source text). If the programmer calls Text_IO.New_Page, they're probably going to get an FF in their file (that happens with most of the Ada compilers that I've used). Similarly, reading an FF will cause the end of a line if it is not already ended (although Text_IO will probably not write such a file). I don't give a darn about VT, though, other than to note that there is a compatibility problem to making a change. (But it is miniscule...) Robert replied: > But you must treat them as line terminators in the logical sense, the > RM insists on this, that is, you must have SOME representation for VT > and FF, of course strictly it does not have to be the corresponding > ASCII characters. The notion that the Standard somehow requires having some representation for every possible character in every source form is laughable in my view. The implication that this is required only appears in the AARM and only in a single note. There is absolutely nothing normative about such a "requirement". It makes about as much sense as requiring that an Ada compiler only run on a machine with a two button mouse! A given source format will represent whatever characters it can (or desires), and that is it. However, with the proposed introduction of Implementation Advice that compilers accept UTF-8 encoded files, where every character is represented by its code point, this becomes more important. If such a UTF-8 file contains a VT character, then the standard requires it to be treated as a line terminator. Period. Treating it as white space would require a non-standard mode (where the "canonical representation" was interpreted other than as recommended by the standard), or of course ignoring the IA completely. That seems bad if existing compilers are doing something else with the character. I'm not sure that the right answer is here. We could add an Implementation Permission that VT and FF represent 0 line terminators, or just do that for VT (assuming FF is used in Text_IO files), or say something about Text_IO, or something else. (We don't need anything to allow to be treated as a single line terminator - 2.2(2/3) already says this). For Janus/Ada, I'd probably not make any change here (the only time I've ever seen a VT in a text file is in the ACATS test for this character, so I think it is essentially irrelevant as to how its handled, and for FF the same handing as Text_IO seems right), and I'd rather not be forced to do so. **************************************************************** From: Bob Duff Sent: Saturday, February 11, 2012 9:03 AM > The notion that the Standard somehow requires having some > representation for every possible character in every source form is laughable in my view. Not sure what you mean by "laughable", but formally speaking, OF COURSE an implementation must support all the characters in the character set the standard requires. Refusing to compile programs containing VT would be just as nonconforming as refusing to compile programs containing the letter "A". Practically speaking, on the other hand, I agree with "I don't give a darn about VT". But if it never occurs in Ada programs (other than ACVC tests), they there's no reason to change the rules. >...The > implication that this is required only appears in the AARM and only in >a single note. There is absolutely nothing normative about such a >"requirement". I disagree. Implementations can't just make up legality rules. > I'm not sure that the right answer is here. We could add an > Implementation Permission ... See what I said the other day about Implementation Permissions. I say, insufficiently broken. And it introduces an incompatibility: if a source contains "-- blah X := X + 1;" the suggested change will comment-out the assignment statement. Not likely to occur, but pretty nasty if it does. **************************************************************** From: Robert Dewar Sent: Sunday, February 12, 2012 10:10 AM >> The notion that the Standard somehow requires having some >> representation for every possible character in every source form is laughable in my view. > > Not sure what you mean by "laughable", but formally speaking, OF > COURSE an implementation must support all the characters in the > character set the standard requires. Refusing to compile programs > containing VT would be just as nonconforming as refusing to compile > programs containing the letter "A". I 100% agree with Bob on this, and do not know where Randy is coming from. I agree we could have impl permission to ignore VT, but I really think any change to the handling of FF would generate gratuitous incompatibilites in existing programs, where the use of FF to get new pages in listings is not uncommon. > Practically speaking, on the other hand, I agree with "I don't give a > darn about VT". But if it never occurs in Ada programs (other than > ACVC tests), they there's no reason to change the rules. Right, changing the rules does not help existing implementations after all, it makes extra work! >> ...The >> implication that this is required only appears in the AARM and only >> in a single note. There is absolutely nothing normative about such a >> "requirement". > > I disagree. Implementations can't just make up legality rules. Yes, exactly >> I'm not sure that the right answer is here. We could add an >> Implementation Permission ... > > See what I said the other day about Implementation Permissions. > > I say, insufficiently broken. And it introduces an incompatibility: > if a source contains "-- blah X := X + 1;" the suggested > change will comment-out the assignment statement. Not likely to > occur, but pretty nasty if it does. Yes, exactly Let's do nothing here, no reason to make a change, not sufficiently broken! **************************************************************** From: Randy Brukardt Sent: Monday, February 11, 2012 2:19 PM > > The notion that the Standard somehow requires having some > > representation for every possible character in every source > form is laughable in my view. > > Not sure what you mean by "laughable", but formally speaking, OF > COURSE an implementation must support all the characters in the > character set the standard requires. Refusing to compile programs > containing VT would be just as nonconforming as refusing to compile > programs containing the letter "A". But this has *nothing* to do with the representation of the source. What I was saying is that a source representation does not necessarily have to have a representation for VT (or PI or euro sign or any other character). I think it is laughable to think that it ought to. I definitely agree with if the source representation *does* have a representation for VT, then it has to follow the standard associated with that character. > Practically speaking, on the other hand, I agree with "I don't give a > darn about VT". But if it never occurs in Ada programs (other than > ACVC tests), they there's no reason to change the rules. > > >...The > > implication that this is required only appears in the AARM and only > >in a single note. There is absolutely nothing normative about such a > >"requirement". > > I disagree. Implementations can't just make up legality rules. I never said anything about *making up legality rules*. And I surely was not considering *rejecting* programs containing VT. However, I think it would be perfectly OK if 16#0B# happened to be interpreted as a space in some source representation; there cannot be a requirement on *all* source representations. More generally, there is almost no requirement that any particular character be representable in a particular source form. Ada 83 made this clear, by trying to accommodate keypunch programs (which was archaic even back in 1980). Pretty much the only requirement is for the digits, letters, space, some line ending, and the delimiters defined in 2.2 (minus the allowed replacements). Anything else is optional. (One could imagine having some $ notation [square brackets not being in the 64 characters of the Unisys keypunches I used in the Paleozoic era of computing] for additional characters, but that is not helpful unless the tools also support it. Otherwise, they're just unintelligible gibberish in the text, making it much harder to read and understand.) The only indication to the contrary is the second sentence of 2.1(18.a/2), and it does not follow from any normative requirements (there is no requirement or need in Ada to translate *back* from the standard characters of an internal compiler representation to Ada source). IMHO, that sentence is complete fantasy. Anyway, this will become irrelevant if we adopt the Implementation Advice for a standard source form, since that form will contain all of the "standard" characters. It will still be optional (of course) to support this form, but implementers that don't support it will have to explain themselves. (Which is easy to do, at least in my case: no one has asked.) **************************************************************** From: Randy Brukardt Sent: Monday, February 11, 2012 2:27 PM ... > > I say, insufficiently broken. And it introduces an incompatibility: > > if a source contains "-- blah X := X + 1;" the suggested > > change will comment-out the assignment statement. Not likely to > > occur, but pretty nasty if it does. > > Yes, exactly > > Let's do nothing here, no reason to make a change, not sufficiently > broken! I personally agree with this, there is no important reason for a change. However, someone posting using "Robert Dewar"s name back in November seemed to think otherwise (call this person Robert Dewar #1): > It's really pretty horrible to use VT in sources to end a line, this > is an ancient bow to old IBM line printers. I think we should define > the use of this > format effector as obsolescent, and catch it using No_Obsolescent_Features. > > Not sure about FF, it's certainly horrible to use it as a terminator > for a source line, but I have seen people use it in place of pragma > Page. I think this > should probably also be considered obsolescent, but am not so > concerned about that one. Tucker jumped in to agree (saying that these both should be interpreted as a space), and then the topic dropped. I was prepared to ignore this thought forever, but when we decided to put an Implementation Advice for a standard UTF-8 source format on the agenda for the upcoming meeting (as a partial response to the Swiss comment), this seemed to be more important. After all, in that standard format, every character represents itself (I included wording to say that, as pointed out by Robert in a different thread), and that surely includes VT. So Robert #1 wants a change to the handling of VT, and Robert #2 does not. Not sure which Robert to pay attention to! Note that this is pretty much our last chance to make any changes here; once the standard format is in use, changing its interpretation would be too incompatible to contemplate. **************************************************************** From: Robert Dewar Sent: Monday, February 13, 2012 2:31 PM > Tucker jumped in to agree (saying that these both should be > interpreted as a space), and then the topic dropped. Well there is nothing inconsistent between thinking something should be fixed or changed, and deciding that it is not worth the trouble! > > I was prepared to ignore this thought forever, but when we decided to > put an Implementation Advice for a standard UTF-8 source format on the > agenda for the upcoming meeting (as a partial response to the Swiss > comment), this seemed to be more important. After all, in that > standard format, every character represents itself (I included wording > to say that, as pointed out by Robert in a different thread), and that surely includes VT. > > So Robert #1 wants a change to the handling of VT, and Robert #2 does not. > Not sure which Robert to pay attention to! Note that this is pretty > much our last chance to make any changes here; once the standard > format is in use, changing its interpretation would be too incompatible to > contemplate. Leave VT as is, insufficiently broken to be worth fixing **************************************************************** From: Robert Dewar Sent: Monday, February 13, 2012 2:32 PM > But this has *nothing* to do with the representation of the source. > What I was saying is that a source representation does not necessarily > have to have a representation for VT (or PI or euro sign or any other > character). I think it is laughable to think that it ought to. I find this incomprehensible. Of course the source representation must allow all characters to be represented. As Bob Duff says, refusing to have a representation for VT would be equivalent to refusing to have a represention for 'a'. There is no distinction. I am completely non-plussed by Randy's "laughable" view here ???? > Anyway, this will become irrelevant if we adopt the Implementation > Advice for a standard source form, since that form will contain all of > the "standard" characters. It will still be optional (of course) to > support this form, but implementers that don't support it will have to > explain themselves. (Which is easy to do, at least in my case: no one > has asked.) Actually you have a positive requirement to document all failure to follow IA, whether you are asked or not. **************************************************************** From: Bob Duff Sent: Monday, February 13, 2012 3:10 PM > The only indication to the contrary is the second sentence of > 2.1(18.a/2), I'm completely mystified -- I must be totally misunderstanding what you mean. > Anyway, this will become irrelevant if we adopt the Implementation > Advice OK, if it's irrelevant, I won't bother arguing about it. I object to "fixing" 2.1(18.a/2), and I object to adding any normative text that tries to say what 2.1(18.a/2) is saying. If you are not proposing to do either of those things, then I'll drop the matter. Otherwise, I'll answer in more detail. **************************************************************** From: Randy Brukardt Sent: Monday, February 13, 2012 3:16 PM > > But this has *nothing* to do with the representation of the source. > > What I was saying is that a source representation does not > > necessarily have to have a representation for VT (or PI or euro sign > > or any other character). I think it is laughable to think that it ought to. > > I find this incomprehensible. Of course the source representation must > allow all characters to be represented. > As Bob Duff says, refusing to have a representation for VT would be > equivalent to refusing to have a represention for 'a'. There is no > distinction. Why? There is nothing in the Standard that requires that. It requires an interpretation for each character that appears in the source, but it cannot say anything about which characters can appear in any particular source. How could it? So why do we care which characters can appear? It's actively harmful to include hacks like "brackets notation" in source to meet such a non-requirement in straight 8-bit formats -- to take one such example. And I agree with you: there is no distinction! It's perfectly OK to not allow 'a' (so long as 'A' is allowed). And indeed, the only reason for saying that you need either 'a' or 'A' is one of practicality: you can't write useful Ada programs without the various reserved words including 'A'. > I am completely non-plussed by Randy's "laughable" view here ???? I'm completely flabbergasted that anyone would think that there is any requirement or value to a requirement otherwise. Moreover, in the absence of a customer requirement, why should any Ada implementer spend time on this (in any way)? Anyway, this is probably going to be irrelevant down the line, so it probably does not need to be resolved. > > Anyway, this will become irrelevant if we adopt the Implementation > > Advice for a standard source form, since that form will contain all > > of the "standard" characters. It will still be optional (of course) > > to support this form, but implementers that don't support it will > > have to explain themselves. (Which is easy to do, at least in my > > case: no one has asked.) > > Actually you have a positive requirement to document all failure to > follow IA, whether you are asked or not. Sorry, you misunderstood: that is what my documentation would say: "We didn't implement UTF-8 formats, because no one has asked for support for identifiers and string literals with characters other than those in Latin-1." **************************************************************** From: Randy Brukardt Sent: Monday, February 13, 2012 3:23 PM > > The only indication to the contrary is the second sentence of > > 2.1(18.a/2), > > I'm completely mystified -- I must be totally misunderstanding what > you mean. Example: If you have an Ada source in some 6-bit character format (say the old keypunch), does it have to have some mechanism to represent other characters than those naturally present in that format. I say no, it would be harmful as the meaning would be inconsistent to what "normal" tools for that format would expect. > > Anyway, this will become irrelevant if we adopt the Implementation > > Advice > > OK, if it's irrelevant, I won't bother arguing about it. > I object to "fixing" 2.1(18.a/2), and I object to adding any normative > text that tries to say what 2.1(18.a/2) is saying. > If you are not proposing to do either of those things, then I'll drop > the matter. Otherwise, I'll answer in more detail. The Implementation Advice would require a UTF-8 format where every code point represents the associated character. Thus it renders 2.1(18.a/2) essentially irrelevant, as any implementation that follows the advice would trivially meet the requirement. And any implementation that doesn't would do so for good (and documented) reasons, and it would seem silly to care beyond that (let the market decide). I would suggest deleting that AARM note, along with the associated RM note, if the advice is added -- but it is not clear-cut and we'll have to discuss this in Houston. **************************************************************** From: Bob Duff Sent: Monday, February 13, 2012 3:39 PM > I would suggest deleting that AARM note, along with the associated RM > note, if the advice is added -- but it is not clear-cut and we'll have > to discuss this in Houston. OK. I don't object to deleting 2.1(18). And if we do that, then I don't object to deleting the following AARM annotations. The purpose of 2.1(18.a/2) was to explain 2.1(18). People would say things like, "What stops an impl from saying the source rep is FORTRAN, and thereby pass off a FORTRAN compiler as a conforming Ada impl." The answer is: you can only do that if you can explain the mapping FORTRAN<-->Ada, which ain't likely. ;-) **************************************************************** From: Robert Dewar Sent: Monday, February 13, 2012 5:42 PM > Why? There is nothing in the Standard that requires that. It requires > an interpretation for each character that appears in the source, but > it cannot say anything about which characters can appear in any > particular source. How could it? So why do we care which characters > can appear? It's actively harmful to include hacks like "brackets > notation" in source to meet such a non-requirement in straight 8-bit formats -- to take one such example. This would say that you regard almost any string literal as non-portable. I find that ludicrous. > I'm completely flabbergasted that anyone would think that there is any > requirement or value to a requirement otherwise. Moreover, in the > absence of a customer requirement, why should any Ada implementer > spend time on this (in any way)? Because the standard specifies the abstract character set that must be accepted. **************************************************************** From: Robert Dewar Sent: Monday, February 13, 2012 5:42 PM > Example: If you have an Ada source in some 6-bit character format (say > the old keypunch), does it have to have some mechanism to represent > other characters than those naturally present in that format. I say > no, it would be harmful as the meaning would be inconsistent to what > "normal" tools for that format would expect. Yes, of COURSE it does! **************************************************************** From: Randy Brukardt Sent: Monday, February 13, 2012 7:03 PM > > Why? There is nothing in the Standard that requires that. It > > requires an interpretation for each character that appears in the > > source, but it cannot say anything about which characters can appear > > in any particular source. How could it? So why do we care which > > characters can appear? It's actively harmful to include hacks like > > "brackets notation" in source to meet such a non-requirement in > > straight 8-bit formats -- to take one such example. > > This would say that you regard almost any string literal as > non-portable. I find that ludicrous. Yes, of course. More generally, Ada (83-95-2005) has nothing to say about source formats, so by definition there is no portability of Ada source. And there surely is no requirement *in the Standard* that you can convert from one source format to another. Indeed, I've always considered this a major hole in Ada's definition, I'd rather have the standard clearly define this one way or another. As a practical matter, of course, all Ada compilers support processing the ACATS, so there is in fact a common interchange format. But with a handful of exceptions, that only requires 7-bit ASCII support, so if you are using anything else, it's at least potentially non-portable. And if you use the conversion tools provided by the target system, you're probably going to lose information. > > I'm completely flabbergasted that anyone would think that there is > > any requirement or value to a requirement otherwise. Moreover, in > > the absence of a customer requirement, why should any Ada > > implementer spend time on this (in any way)? > > Because the standard specifies the abstract character set that must be > accepted. Not at all, it defines the handling of each character that *might* appear in Ada source. It never says anything *requiring* that you can actually write those characters (and I'm not sure that it can). Please find my *any* text that says the compiler *must* accept source containing the PI character (to take one example). Anyway, we can clearly diffuse this question by simply putting in the Standard that processing UTF-8 is required. And even without *requiring* that, simply recommending it will definitely reduce the situation (any implementation following the recommendation will have a clear, common format for Ada source code). I'd actually be in favor of requiring it, even through that would make Janus/Ada non-compliant in this area. The only reason for not doing that IMHO is to avoid making work to implementers for which they have no customer demand. (And if everyone agrees with you, then there cannot be much actual work involved for other implementations.) ****************************************************************