!standard 11.4.1(19) 19-01-04 AI12-0021-1/06 !standard A.8.1(15) !standard A.8.2(28.3/4) !standard A.8.4(18) !standard A.10.1(85) !standard A.12.1(26) !standard A.15.1(0) !standard A.16.2(0) !standard A.17.1(0) !class Amendment 12-03-13 !status Amendment 1-2012 18-12-10 !status ARG Approved 10-0-2 18-12-10 !status work item 12-02-25 !status received 12-02-25 !priority High !difficulty Hard !subject Additional internationalization of Ada !summary Add support for the use of the entire set of characters from ISO/IEC 10646:2017 for file and directory names by the operations of the Annex A facilities. !proposal In addition to the facilities already provided, (1) File and directory operations should support Unicode characters (presuming that the target file system does so); (2) Exception messages and exception information should support Unicode characters; (3) Command lines should support Unicode characters (presuming that the target system allows these). !wording Add after 11.4.1(19): NOTES UTF-8 encoding (see A.4.11) can be used to represent non-ASCII characters in exception messages. Add after A.8.1(15): ... -- Enclosing package Ada.Sequential_IO package Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_String := ""; Form : in Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_String; Form : in Wide_String := ""); function Name (File : in File_Type) return Wide_String; function Form (File : in File_Type) return Wide_String; end Wide_File_Names; package Wide_Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_Wide_String := ""; Form : in Wide_Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_Wide_String; Form : in Wide_Wide_String := ""); function Name (File : in File_Type) return Wide_Wide_String; function Form (File : in File_Type) return Wide_Wide_String; end Wide_Wide_File_Names; Add after A.8.2(28.3/4): The nested package Wide_File_Names provides operations equivalent to the operations of the same name of the outer package except that Wide_String is used instead of String for the name and form of the external file. The nested package Wide_Wide_File_Names provides operations equivalent to the operations of the same name of the outer package except that Wide_Wide_String is used instead of String for the name and form of the external file. Add after A.8.4(18): ... -- Enclosing package Ada.Direct_IO package Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := Inout_File; Name : in Wide_String := ""; Form : in Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_String; Form : in Wide_String := ""); function Name (File : in File_Type) return Wide_String; function Form (File : in File_Type) return Wide_String; end Wide_File_Names; package Wide_Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := Inout_File; Name : in Wide_Wide_String := ""; Form : in Wide_Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_Wide_String; Form : in Wide_Wide_String := ""); function Name (File : in File_Type) return Wide_Wide_String; function Form (File : in File_Type) return Wide_Wide_String; end Wide_Wide_File_Names; Add after A.10.1(85): ... -- Enclosing package Ada.Text_IO package Wide_File_Names is -- File management procedure Create (File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_String := ""; Form : in Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_String; Form : in Wide_String := ""); function Name (File : in File_Type) return Wide_String; function Form (File : in File_Type) return Wide_String; end Wide_File_Names; package Wide_Wide_File_Names is -- File management procedure Create (File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_Wide_String := ""; Form : in Wide_Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_Wide_String; Form : in Wide_Wide_String := ""); function Name (File : in File_Type) return Wide_Wide_String; function Form (File : in File_Type) return Wide_Wide_String; end Wide_Wide_File_Names; Add after A.12.1(26): ... -- Enclosing package Ada.Stream_IO package Wide_File_Names is -- File management procedure Create (File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_String := ""; Form : in Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_String; Form : in Wide_String := ""); function Name (File : in File_Type) return Wide_String; function Form (File : in File_Type) return Wide_String; end Wide_File_Names; package Wide_Wide_File_Names is -- File management procedure Create (File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_Wide_String := ""; Form : in Wide_Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_Wide_String; Form : in Wide_Wide_String := ""); function Name (File : in File_Type) return Wide_Wide_String; function Form (File : in File_Type) return Wide_Wide_String; end Wide_Wide_File_Names; Add section A.15.1: A.15.1 The Packages Wide_Command_Line and Wide_Wide_Command_Line The packages Wide_Command_Line and Wide_Wide_Command_Line allow a program to obtain the values of its arguments and to set the exit status code to be returned on normal termination. Static Semantics The specification of package Wide_Command_Line is the same as for Command_Line, except that each occurrence of String is replaced by Wide_String. The specification of package Wide_Wide_Command_Line is the same as for Command_Line, except that each occurrence of String is replaced by Wide_Wide_String. Add section A.16.2: A.16.2 The Packages Wide_Directories and Wide_Wide_Directories The packages Wide_Directories and Wide_Wide_Directories provide operations for manipulating files and directories, and their names. Static Semantics The specification of package Wide_Directories is the same as for Directories (including its optional child packages Information and Hierarchical_File_Names), except that each occurrence of String is replaced by Wide_String. The specification of package Wide_Wide_Directories is the same as for Directories (including its optional child packages Information and Hierarchical_File_Names), except that each occurrence of String is replaced by Wide_Wide_String. Add section A.17.1: A.17.1 The Packages Wide_Environment_Variables and Wide_Wide_Environment_Variables The packages Wide_Environment_Variables and Wide_Wide_Environment_Variables allow a program to read or modify environment variables. Static Semantics The specification of package Wide_Environment_Variables is the same as for Environment_Variables, except that each occurrence of String is replaced by Wide_String. The specification of package Wide_Wide_Environment_Variables is the same as for Environment_Variables, except that each occurrence of String is replaced by Wide_Wide_String. !discussion These issues defy an easy solution. Changing the behavior of the existing routines would break existing workarounds (which on some targets, like most Linux systems, have no problems with directly using UTF-8 strings) and other commonly used functionality (like encoding binary data in exception messages). Adding even more Wide_Wide_ packages and routines is a combinational explosion. The crux of this problem is that the semantics and representation of strings have become co-mingled. What we really need to do is to separate these; the difficulty with that is mostly with retaining adequate performance. The way-out solution would be to declare a semi-magic Root_String interface (or perhaps an abstract type); string literals, "lvalue"s and indexing already can be supported with existing Ada 2020 facilities. Something on the line of: package General_Strings is type Root_String is interface with Constant_Indexing => Get_Char, Variable_Indexing => Set_Char, String_Literal => Assign; -- A string literal calls this procedure. function Get_Char (A : Root_String; I : Positive) return Wide_Wide_Character; -- Returns the Ith character of A (regardless of the representation of A). type LValue (D : access Wide_Wide_Character) with Implicit_Dereferencing => D is null record; function Set_Char (A : in out Root_String; I : Positive) return LValue; -- Returns a reference to the Ith character of A (regardless of the -- representation of A). function Slice (A : Root_String; L : Positive; R : Natural) return Wide_Wide_String; -- Returns a slice of the string. (Not sure if this is a good idea, or best left out.) -- One imagines an LValue slice as well; I'll leave that as an exercise for the reader. procedure Assign (Trg : in out Root_String; Src : in Wide_Wide_String); -- Assigns Src to Trg, converting the representation as needed. function Value (Obj : Root_String) return Wide_Wide_String; -- Retrieves the value of Obj as a Wide_Wide_String. -- "&" defined in the expected way. -- The stream attributes would be expected to work for these; perhaps they'd need -- to be part of the interface. end General_Strings; [Note: I didn't try to think of good names for these routines and parameters; that would need to done, of course.] Then we'd have a bunch of concrete instances: type Latin_1_String (L, R : Positive) is new General_Strings.Root_String with Obj : String (L .. R); -- The obvious implementations of the routines. type Bounded_UTF_8_String (Byte_Len : Natural) is new General_Strings.Root_String with Obj : UTF_8_String (1 .. Byte_Len); -- The not-quite-so-obvious implementations of the routines. and so on for every interesting representation. In addition, we'd have Ada.Strings.General (which would have approximately the contents of Ada.Strings.Fixed, with all of the String parameters converted to Root_String'Class). And most of the IO routines that take strings would have versions that would take Root_String'Class (these would need different names or packages, unfortunately, to avoid ambiguity). Similarly for exception messages, and so on. The real key here is that the string types would carry their representation along when passed into routines (which have to be new for this reason). Once that is available, then any problems can be dealt with by simply using whatever representation is appropriate for the target system. ==== The above was discussed at the Lisbon 2018 meeting and considered too ambitious for the Ada 2020 timescales. It was considered useful though to add child packages Wide_File_Names and Wide_Wide_File_Names for each I/O package, containing just those operations that take a filename as a parameter, and Wide_ and Wide_Wide_ versions of Ada.Directories, Ada.Command_Line and Ada.Environment_Variables. We do not try to add wide versions of exception messages. We want existing code to work unmodified. However, a wide exception message would either make existing syntax incompatible by making it ambiguous, or would make it painful to use wide messages by not having syntax as an option. Having implementations use multiple formats for exception messages would break techniques where the values of objects are streamed as part of the message (a common work-around to attach values to a raised exception). Instead, we recommend that projects that require Wide_Wide_Character messages use UTF-8 encoding. Note that UTF-8 encoding needs to be applied by projects, not implementations; if an implementation was to use UTF-8 encoding for messages, streamed values would possibly be destroyed (as upper-128 characters are expanded into two octets). !corrigendum 11.4.1(19) @dinsa Exception_Message (by default) and Exception_Information should produce information useful for debugging. Exception_Message should be short (about one line), whereas Exception_Information can be long. Exception_Message should not include the Exception_Name. Exception_Information should include both the Exception_Name and the Exception_Message. @dinst @xindent<@s9> !corrigendum A.8.1(15) @dinsa @xcode< Status_Error : @b IO_Exceptions.Status_Error; Mode_Error : @b IO_Exceptions.Mode_Error; Name_Error : @b IO_Exceptions.Name_Error; Use_Error : @b IO_Exceptions.Use_Error; Device_Error : @b IO_Exceptions.Device_Error; End_Error : @b IO_Exceptions.End_Error; Data_Error : @b IO_Exceptions.Data_Error;> @dinss @xcode< @b Wide_File_Names @b> @xcode< -- @ft<@i>> @xcode< @b Create(File : @b File_Type; Mode : @b File_Mode := Out_File; Name : @b Wide_String := ""; Form : @b Wide_String := "");> @xcode< @b Open (File : @b File_Type; Mode : @b File_Mode; Name : @b Wide_String; Form : @b Wide_String := "");> @xcode< @b Name (File : @b File_Type) @b Wide_String;> @xcode< @b Form (File : @b File_Type) @b Wide_String;> @xcode< @b Wide_File_Names;> @xcode< @b Wide_Wide_File_Names @b> @xcode< -- @ft<@i>> @xcode< @b Create(File : @b File_Type; Mode : @b File_Mode := Out_File; Name : @b Wide_Wide_String := ""; Form : @b Wide_Wide_String := "");> @xcode< @b Open (File : @b File_Type; Mode : @b File_Mode; Name : @b Wide_Wide_String; Form : @b Wide_Wide_String := "");> @xcode< @b Name (File : @b File_Type) @b Wide_Wide_String;> @xcode< @b Form (File : @b File_Type) @b Wide_Wide_String;> @xcode< @b Wide_Wide_File_Names;> !corrigendum A.8.2(28.3/4) @dinsa @xindent @dinss The nested package Wide_File_Names provides operations equivalent to the operations of the same name of the outer package except that Wide_String is used instead of String for the name and form of the external file. The nested package Wide_Wide_File_Names provides operations equivalent to the operations of the same name of the outer package except that Wide_Wide_String is used instead of String for the name and form of the external file. !corrigendum A.8.4(18) @dinsa @xcode< Status_Error : @b IO_Exceptions.Status_Error; Mode_Error : @b IO_Exceptions.Mode_Error; Name_Error : @b IO_Exceptions.Name_Error; Use_Error : @b IO_Exceptions.Use_Error; Device_Error : @b IO_Exceptions.Device_Error; End_Error : @b IO_Exceptions.End_Error; Data_Error : @b IO_Exceptions.Data_Error;> @dinss @xcode< @b Wide_File_Names @b> @xcode< -- @ft<@i>> @xcode< @b Create(File : @b File_Type; Mode : @b File_Mode := Inout_File; Name : @b Wide_String := ""; Form : @b Wide_String := "");> @xcode< @b Open (File : @b File_Type; Mode : @b File_Mode; Name : @b Wide_String; Form : @b Wide_String := "");> @xcode< @b Name (File : @b File_Type) @b Wide_String;> @xcode< @b Form (File : @b File_Type) @b Wide_String;> @xcode< @b Wide_File_Names;> @xcode< @b Wide_Wide_File_Names @b> @xcode< -- @ft<@i>> @xcode< @b Create(File : @b File_Type; Mode : @b File_Mode := Inout_File; Name : @b Wide_Wide_String := ""; Form : @b Wide_Wide_String := "");> @xcode< @b Open (File : @b File_Type; Mode : @b File_Mode; Name : @b Wide_Wide_String; Form : @b Wide_Wide_String := "");> @xcode< @b Name (File : @b File_Type) @b Wide_Wide_String;> @xcode< @b Form (File : @b File_Type) @b Wide_Wide_String;> @xcode< @b Wide_Wide_File_Names;> !corrigendum A.10.1(85) @drepl @xcode< Status_Error : @b IO_Exceptions.Status_Error; Mode_Error : @b IO_Exceptions.Mode_Error; Name_Error : @b IO_Exceptions.Name_Error; Use_Error : @b IO_Exceptions.Use_Error; Device_Error : @b IO_Exceptions.Device_Error; End_Error : @b IO_Exceptions.End_Error; Data_Error : @b IO_Exceptions.Data_Error; Layout_Error : @b IO_Exceptions.Layout_Error; @b ... -- @ft<@i> @b Ada.Text_IO;> @dby @xcode< Status_Error : @b IO_Exceptions.Status_Error; Mode_Error : @b IO_Exceptions.Mode_Error; Name_Error : @b IO_Exceptions.Name_Error; Use_Error : @b IO_Exceptions.Use_Error; Device_Error : @b IO_Exceptions.Device_Error; End_Error : @b IO_Exceptions.End_Error; Data_Error : @b IO_Exceptions.Data_Error; Layout_Error : @b IO_Exceptions.Layout_Error;> @xcode< @b Wide_File_Names @b> @xcode< -- @ft<@i>> @xcode< @b Create (File : @b File_Type; Mode : @b File_Mode := Out_File; Name : @b Wide_String := ""; Form : @b Wide_String := "");> @xcode< @b Open (File : @b File_Type; Mode : @b File_Mode; Name : @b Wide_String; Form : @b Wide_String := "");> @xcode< @b Name (File : @b File_Type) @b Wide_String;> @xcode< @b Form (File : @b File_Type) @b Wide_String;> @xcode< @b Wide_File_Names;> @xcode< @b Wide_Wide_File_Names @b> @xcode< -- @ft<@i>> @xcode< @b Create (File : @b File_Type; Mode : @b File_Mode := Out_File; Name : @b Wide_Wide_String := ""; Form : @b Wide_Wide_String := "");> @xcode< @b Open (File : @b File_Type; Mode : @b File_Mode; Name : @b Wide_Wide_String; Form : @b Wide_Wide_String := "");> @xcode< @b Name (File : @b File_Type) @b Wide_Wide_String;> @xcode< @b Form (File : @b File_Type) @b Wide_Wide_String;> @xcode< @b Wide_Wide_File_Names;> @xcode<@b ... -- @ft<@i> @b Ada.Text_IO;> !corrigendum A.12.1(26) @dinsa @xcode< Status_Error : @b IO_Exceptions.Status_Error; Mode_Error : @b IO_Exceptions.Mode_Error; Name_Error : @b IO_Exceptions.Name_Error; Use_Error : @b IO_Exceptions.Use_Error; Device_Error : @b IO_Exceptions.Device_Error; End_Error : @b IO_Exceptions.End_Error; Data_Error : @b IO_Exceptions.Data_Error;> @dinss @xcode< @b Wide_File_Names @b> @xcode< -- @ft<@i>> @xcode< @b Create (File : @b File_Type; Mode : @b File_Mode := Out_File; Name : @b Wide_String := ""; Form : @b Wide_String := "");> @xcode< @b Open (File : @b File_Type; Mode : @b File_Mode; Name : @b Wide_String; Form : @b Wide_String := "");> @xcode< @b Name (File : @b File_Type) @b Wide_String;> @xcode< @b Form (File : @b File_Type) @b Wide_String;> @xcode< @b Wide_File_Names;> @xcode< @b Wide_Wide_File_Names @b> @xcode< -- @ft<@i>> @xcode< @b Create (File : @b File_Type; Mode : @b File_Mode := Out_File; Name : @b Wide_Wide_String := ""; Form : @b Wide_Wide_String := "");> @xcode< @b Open (File : @b File_Type; Mode : @b File_Mode; Name : @b Wide_Wide_String; Form : @b Wide_Wide_String := "");> @xcode< @b Name (File : @b File_Type) @b Wide_Wide_String;> @xcode< @b Form (File : @b File_Type) @b Wide_Wide_String;> @xcode< @b Wide_Wide_File_Names;> !corrigendum A.15.1(0) @dinsc The packages Wide_Command_Line and Wide_Wide_Command_Line allow a program to obtain the values of its arguments and to set the exit status code to be returned on normal termination. @s8<@i> The specification of package Wide_Command_Line is the same as for Command_Line, except that each occurrence of String is replaced by Wide_String. The specification of package Wide_Wide_Command_Line is the same as for Command_Line, except that each occurrence of String is replaced by Wide_Wide_String. !corrigendum A.16.2(0) @dinsc The packages Wide_Directories and Wide_Wide_Directories provide operations for manipulating files and directories, and their names. @s8<@i> The specification of package Wide_Directories is the same as for Directories (including its optional child packages Information and Hierarchical_File_Names), except that each occurrence of String is replaced by Wide_String. The specification of package Wide_Wide_Directories is the same as for Directories (including its optional child packages Information and Hierarchical_File_Names), except that each occurrence of String is replaced by Wide_Wide_String. !corrigendum A.17.1(0) @dinsc The packages Wide_Environment_Variables and Wide_Wide_Environment_Variables allow a program to read or modify environment variables. @s8<@i> The specification of package Wide_Environment_Variables is the same as for Environment_Variables, except that each occurrence of String is replaced by Wide_String. The specification of package Wide_Wide_Environment_Variables is the same as for Environment_Variables, except that each occurrence of String is replaced by Wide_Wide_String. !ACATS test ACATS C-Tests are needed for the new packages (nested and otherwise). !appendix This AI was created from the ashes of AI05-0286-1 (that is, those portions that defied an easy solution). **************************************************************** From: Gautier de Montmollin Sent: Thursday, March 14, 2013 5:51 AM !topic Ada.Directories: Form parameter for all subprograms with file or directory names !reference Ada 2012 RM A.16 !from Author Gautier de Montmollin 2013-03-14 !keywords directories !discussion In Ada.Directories, only a few subprograms of those having a String for file or directory name provide a Form parameter. It prevents an implementation providing the same as for Ada.Text_IO, for instance recognizing the "encoding=utf-8" sub-string. As a result Ada.Directories becomes practically useless for software meant to run on file systems with international character sets. **************************************************************** From: Adam Beneschan Sent: Thursday, March 14, 2013 10:23 AM The Form parameter is implementation-dependent, so any solution based on adding a Form parameter is going to be implementation-dependent. Because of this, it might be better to request AdaCore or whoever your compiler vendor is to add their own package Ada.Directories.Extensions to provide the functionality you want, since it's going to be implementation-dependent anyway. It appears that the ARG is starting to think about a more permanent solution, to the general problem of allowing Wide_String and Wide_Wide_String in places where only String is currently allowed. See AI12-0021. Personally, I think something like this ought to be done. The current "solution", in which a String is used to hold UTF-8 sequences in some situations, is an obnoxious hack. A String is an array of characters; and to me, the idea that a String whose 'Length is (say) 23 can be used to represent a string that really has 19 characters in it, is an abuse. It's been allowed as a temporary compromise, because something was needed and a real solution is difficult. But it's still an abuse. If we're going to entrench the idea of using String types to hold non-String data such as UTF-8 bit encodings, we might as well give up and start programming in C. So I'm not in favor of adding anything like this to the language standard. **************************************************************** From: Jeff Cousins Sent: Thursday, March 14, 2013 10:56 AM Thanks for replying on this topic Adam. It does seem to be something that it is going to be very hard to make rules about, it looks like it's going to be up to whatever implementation is used to offer something sensible for the platform used. The discussion of AI05-0286-1/02 in the minutes of the 46th ARG meeting, publicly available at http://www.ada-auth.org/arg-minutes.html, might show that it has been thought about, but how hard an issue it is to tackle. **************************************************************** From: Gautier de Montmollin Sent: Thursday, March 14, 2013 3:04 PM Anyway, a Wide_String version with no rule is fundamentally better than the String version with no rule! For Wide_String, implementors will follow the UTF-16 de facto standard in place for 20 years at least and that's it. For the String version Ada.Directories is just dysfunctional... So please don't wait too long... **************************************************************** From: Adam Beneschan Sent: Thursday, March 14, 2013 6:33 PM I think there's some widespread and fundamental confusion when it comes to UTF and encodings. A Wide_String is just an array of Wide_Characters. A Wide_Character is, fundamentally, just a number between 0 and 65535, where each number represents a character that has been assigned that number in the Unicode Basic Multilingual Plane. A Wide_String is an array of those numbers. There should be no "encoding" involved, UTF-16 or otherwise. A Wide_String will normally be represented as just an array of 16-bit integers that mean themselves. This means that a Wide_Character in a Wide_String can't represent a character in a different plane, i.e. from U+10000 and up. But that's what Wide_Wide_String is for. And I am certain that the ARG will not produce a solution that allows Wide_Strings as file names that doesn't also allow Wide_Wide_Strings. So UTF-16 has no place in this discussion. If Ada.Directories allows Wide_Strings and Wide_Wide_Strings, the implementation may need to convert them to UTF-8 in order to communicate with the OS, but the Ada program that uses Ada.Directories doesn't need to know about this implementation detail. I feel like there are a lot of people who aren't clear on the distinctions between the concepts (and the use of things like charset="utf-8" in HTML files just adds to the confusion, since UTF-8 is really an encoding algorithm and not a character set). Hopefully I've helped clear up some confusion among a few people, but I feel like this is a losing battle. **************************************************************** From: Gautier de Montmollin Sent: Thursday, March 14, 2013 7:10 PM Could not agree more! Go for Ada.Directories with String's, Wide_String's and Wide_Wide_String's ! **************************************************************** From: Randy Brukardt Sent: Thursday, March 14, 2013 7:54 PM That's easy, but it doesn't fix anything. That's because you have to be able to Create and Open files with the result. And pass Forms that contain file names. And retrieve Names and Forms. And on and on. Note that this is completely orthogonal to what kind of I/O the package supports: it has nothing to do with Wide_Text_IO, for instance. To follow the Wide_xxx and Wide_Wide_xxx to it's logical limit, you'd have to add Wide_ and Wide_Wide_ versions of all of the file manipulation routines in *every* existing I/O package. Which is insane (no, we do not want "Wide_Wide_Open" and "Wide_Wide_Name"). Moreover, we would have to decide what happens if the name of a file that contains 32-bit characters name is retrieved via Name. And recall that Name is required to return a "name that uniquely identifies the file", which usually means including the full path. In which case, there could be 32-bit characters in the result returned by Name even if the simple name only is ASCII (if for instance the user's login name and thus home directory contained such characters). The obvious solution to this problem is to raise an exception -- but that would be incompatible with existing practice on Linux (where UTF-8 can be used in type String without any interpretation) as well as practice involving Form parameters. And it would be incompatible for anyone unfortunate enough to run their program from a directory named using characters above position 256. We need to avoid run-time incompatibilities if at all possible (because there is no automatic way to detect them); while this particular case mostly involves implementation-defined behavior, the effect would be just as dangerous to programs that depend on it. It was cases like these that caused the ARG to discard the rough proposal that I had made for Ada 2012 and decide to defer any change until the next version of Ada (whenever that will be). As far as I can tell, the only real solution is to blow it all up and start over using a tagged root type (tentatively named Root_String'Class, although maybe we'd use it only for file names in which case it would get an appropriate name for that). If the tagged root type allowed string literals (the only real change needed to the language), there wouldn't be much user-level change, and the implementations could include properly typed UTF-8 and UTF-16 strings, along with anything else that might make sense. Note that there are similar problems with Ada.Command_Line, Ada.Environment_Variables, Ada.Exceptions (the exception message part), and probably other packages that we haven't thought about. This makes blowing it up more attractive, because adding dozens of new routines that hardly anyone will want to use, and adding incompatibilities as well, does not seem like a good plan. I don't know if there will be the will to "blow it up", but in any case, there is nothing simple or easy about this problem, and it does everyone a disservice to claim that there is an "easy" solution. **************************************************************** From: Robert Leif Sent: Friday, March 15, 2013 11:25 AM I believe that an alternative solution to the problem is to proceed one step up in abstraction. A linear array generic type could be made that included the string operations of Text_IO. It could be instantiated with any type of character including 4 bit characters or even 1 bit characters. Then it could be the basis of Root_String'Class or whatever you want to call it. If anyone is interested, I have been spent my last years writing XML schemas (CytometryML.org) written in the XML Schema Definition Language (XSD). XSD 1.1 includes assertions and restriction (generics). I basically fake datatype declarations in Ada specifications. **************************************************************** From: Gautier de Montmollin Sent: Saturday, March 16, 2013 8:51 AM Another idea would be not to change the standard at all about this, and persuade at least one major compiler vendor to use utf-8 for file or directory names in Ada.Directories. For instance GNAT is applying the equivalent tactic for arguments in Ada.Command_Line, since 2008. From the Devlopment Log, NF-62-HB07-027-gnat: "Unicode characters on Windows command line On Windows Ada.Command_Line now supports Unicode characters. Arguments are returned encoded in UTF-8 allowing better handling of Unicode file names names as arguments." **************************************************************** From: Vadim Godunko Sent: Thursday, April 25, 2013 9:00 AM > Another idea would be not to change the standard at all about this, > and persuade at least one major compiler vendor to use utf-8 for file > or directory names in Ada.Directories. Use of some UTF-XX is fine for Windows and MacOSX which is UTF-based. On POSIX systems any encoding can be selected by user and it is important to use it consistently for each call to imported libraries and to do input/output operations. **************************************************************** From: Florian Weimer Sent: Sunday, April 28, 2013 9:28 AM There's also an expectation that it's possible to access files whose names are not in the encoding range of the current locale. **************************************************************** From: Randy Brukardt Sent: Wednesday, May 1, 2013 1:13 PM Huh? UTF-8 covers all locales as all possible characters are in it; there should be no adjustment afterwards or there is something quite wrong going on. Locales only apply to pure 8-bit encodings, and that's impossible to do anything sensible with. There is an issue on Windows about file name equivalence, but that's something that simply never, even should be used, because it's impossible to work out (in part because of the locale issue). Linux and Unix don't have that problem. Anyway, there are two problems here, and they're at cross-purposes. One is the desire to let IO routines open and manipulate any file that could exist on the system. Second is the desire to portably be able to create and manipulate the *name* of any file that Text_IO can *create*. On a system with no file name rules, it's clearly not possible to do both (you've got to have some rules in order to portable manipulation). The purpose of Ada.Directories is exclusively the second - *portable* manipulation of files and names. That means that by definition it will have to be more limited than "everything the system can do". Thus, using UTF-8 exclusively would be sufficient for it. OTOH, we probably don't want such a restriction in Text_IO.Open (for example). That seems OK to me, as those "old" interfaces aren't going anywhere even if a new set is created using Root_String'Class or Wide_Wide_String. If you need bizarre capabilities on Linux (such as EBCDIC file names), use the old interfaces. **************************************************************** From: Yannick Duchene Sent: Wednesday, May 1, 2013 1:47 PM While it is true Unicode covers most languages and locales characters requirements, it does cover everything possibly needed. There are two main reasons: the first, Unicode is always defining new characters as the standard evolves (which implies all possible characters are not necessarily in it), the second, Unicode is not so much welcome in some countries (like Japan) where some lobbies (official or not) do all they can to preserve their own encoding as the official encoding, arguing Unicode is missing too many specificities of their writing system. But Unicode also has private use areas, which enable enough additional local definitions (this requires a local agreements between the parties involved, an issue the Ada standard does not have to bother with). Unicode is the good choice, but will not make every one happy before a long time. I would say well-formed UTF-8, with the requirement to be transparent with code-points from private use areas: no attempt to transform, interpret or decide if whether or not such a code-point is valid or not for a file-name and always accept it as valid. (hope I did not missed the point, as I have not read all the mails on this issue) **************************************************************** From: Randy Brukardt Sent: Wednesday, May 1, 2013 6:08 PM ... > > Huh? UTF-8 covers all locales as all possible characters are in it; > > While it is true Unicode covers most languages and locales characters > requirements, it does cover everything possibly needed. There are two > main > reasons: the first, Unicode is always defining new characters as the > standard evolves (which implies all possible characters are not > necessarily in it), If they're not in Unicode, they're not anywhere. In any case, added characters are not an issue. > ... the second, Unicode is > not so much welcome in some countries (like Japan) where some lobbies > (official or not) do all they can to preserve their own encoding as > the official encoding, arguing Unicode is missing too many > specificities of their writing system. But Unicode also has private > use areas, which enable enough additional local definitions (this > requires a local agreements between the parties involved, an issue the > Ada standard does not have to bother with). > > Unicode is the good choice, but will not make every one happy before a > long time. > > I would say well-formed UTF-8, with the requirement to be transparent > with code-points from private use areas: no attempt to transform, > interpret or decide if whether or not such a code-point is valid or > not for a file-name and always accept it as valid. What's a legal file name is implementation-defined, and I certainly don't see that changing. Some characters are not allowed in Windows file names, for example, and the Ada standard cannot try to insist that they're allowed. So I find this irrelevant -- indeed, if there is any support at all for UTF-8 file names will always be implementation-defined. The problem now is that we don't have any sane way to *allow* it -- there will never be a *requirement* to support it. **************************************************************** From: Justin Squirek Sent: Tuesday, November 13, 2018 1:29 PM Hey Jeff, I made some very minor wording edits. [This is version /03 of the AI -Editor.] **************************************************************** From: Randy Brukardt Sent: Wednesday, December 5, 2018 6:47 PM Here are some editorial comments on this: (1) I realize you're new here, Justin, but when wording says "modify" some paragraph, the changes have to be marked with {} for insertions and [] for deletions. This is the preferred form, because it makes my job easier and it is easier to see the changes. Otherwise, you need to use "Replace". In any case, you are not supposed to remove those marks from Jeff's version, nor are you allowed to change existing wording without showing the changes. Moreover, why would anyone change "declarations are repeated" to "declarations get repeated"? There's nothing wrong with the first wording, and we don't change existing wording just because someone would like a different verb. I've ignored this change completely. (2) You changed the spacing of the various packages. Jeff copied the spacing of the original IO Sequential_IO and Direct_IO packages exactly. I'd agree that the original spacing is suboptimal, but when putting new things directly next to old things, we generally copy the original style, rather than invent a new one. (With RM wording, context is important.) So I left the spacing as Jeff had it. OTOH, Jeff got the spacing wrong for Text_IO and Stream_IO (these are the better spacing). So I did use your changes there. (Don't you love consistency???) (3) You changed the default mode for Create to "InOut_File" for most of these packages. Jeff did have the wrong for Direct_IO and Stream_IO, but you have it spelled wrong (the 'o' should be in lower case). And you added it to Text_IO, but Text_IO doesn't even have that mode. Another change I ignored. (4) In A.15.1, Jeff has "is the same as", which you changed to "identical". Jeff is just copying the style of the existing similar wording, and it is not a good idea to use a different form (recall the issues that I previously noted about consistency?) We also try to avoid words like "identical" and "equivalent" because it is rarely true. "identical except blah" is *not* identical, after all! I left the wording Jeff had. [Editor's note: These comments apply to version /03 as posted.] **************************************************************** From: Jeff Cousins Sent: Wednesday, November 14, 2018 12:13 PM Thanks again Justin. 11.4.1.(19) “non-ASCII” seems a bit too colloquial, I still prefer “non-ASCII characters”. A.15.1, A.16.2 “is the same as” seems to be the more normal RM-speak than “is identical to”, e.g. A.11. Otherwise fine. **************************************************************** From: Randy Brukardt Sent: Wednesday, November 14, 2018 6:07 PM > 11.4.1.(19) "non-ASCII" seems a bit too colloquial, I still prefer "non-ASCII characters". It doesn't matter, as the term "ASCII" isn't defined for a normative description of characters as it is not an ISO standard, so you shouldn't use it here (or anywhere in normative wording; it's OK in AARM notes). You could tie this text to the contents of the package ASCII, something like "characters not present in package ASCII". But since the package ASCII is obsolescent, I'd recommend against that. Ada.Characters.Handling uses "ISO_646" for this purpose (that being the ISO standard in question), so you could say something like "characters not in ISO 646" or you could even reference the subtype directly "characters not in Characters.Handling.ISO_646". Finally, you could simply say what you really mean and talk about character code points: "characters whose code point is greater than 127". Moral: everything about characters is harder than it seems. :-) **************************************************************** From: Randy Brukardt Sent: Wednesday, December 5, 2018 7:10 PM Here are my technical comments on this AI: [Aside: "you" in most of these cases was originally Jeff, but both authors share responsibility.] >Add after 11.4.1(19): > >It is recommended that exception messages requiring non-ASCII use UTF-8 >encoding. You have this in an Implementation Advice section. But this is advice to the programmer -- the implementation cannot and must not do this on its own. (How would the user of Exception_Message know that it is encoded in UTF-8 if the implementation did that itself? Moreover, what would that do to messages that include a streamed [binary] portion? Only a project can decide to use UTF-8 encoding universally for messages.) So this should be a user note. If it is a user note, we can be less rigorous, so saying "non-ASCII" arguably would be better than "non ISO-646". Additionally, the discussion needs an explanation of why this cannot be changed. (Answer: it appears in existing syntax and pragmas that would become ambiguous or illegal if the definition of the message changed. That would be a substantial compatibility problem. Similarly, existing code that does not assume UTF-8 encoding must continue to work unmodified, including code that encodes values into the messages; silently breaking code would be an even worse compatibility problem.) --- >Add at the end of A.8.2: You need to provide the paragraph number. If you had done that, you might have seen that the end of this subclause is an Implementation Permissions section, in which this text is totally inappropriate. --- In A.16.2: >The specification of package Wide_Directories is the same as for >Directories (including its optional child package >Hierarchical_File_Names), except that each occurrence of String is replaced >by Wide_String. I think you need to mention it's not-optional child of Information as well. The contents of Information are not specified by the language, but it needs to be present. (And suggested contents are given in the AARM for Windows and Linux implementations.) See A.16(124/2). We don't want Wide_Directories.Information to be any different than Directories.Information. **************************************************************** From: Jeff Cousins Sent: Thursday, December 6, 2018 10:46 AM (3) You changed the default mode for Create to "InOut_File" for most of these packages. Jeff did have the wrong for Direct_IO and Stream_IO, but you have it spelled wrong (the 'o' should be in lower case). And you added it to Text_IO, but Text_IO doesn't even have that mode. Another change I ignored. ?? The Modes I used were as per the parents, i.e. Out_File for Sequential_IO, InOut_File (though, as Randy says, I should have said Inout_File) for Direct_IO, and Out_File for Text_IO and Stream_IO. **************************************************************** From: Randy Brukardt Sent: Friday, December 7, 2018 12:49 AM >(3) You changed the default mode for Create to "InOut_File" for most of >these packages. Jeff did have the wrong for Direct_IO and Stream_IO, but you >have it spelled wrong (the 'o' should be in lower case). And you added it to >Text_IO, but Text_IO doesn't even have that mode. Another change I ignored. >?? The Modes I used were as per the parents, i.e. Out_File for Sequential_IO, >InOut_File (though, as Randy says, I should have said Inout_File) for >Direct_IO, and Out_File for Text_IO and Stream_IO. You're right, your version was originally correct. Justin's version confused me enough to get them all messed up. I corrected them in a new AI version. ... >> You have this in an Implementation Advice section ... >> Additionally, the discussion needs an explanation ... ... >Good points. I've done both of these things. >>>Add at the end of A.8.2: >>You need to provide the paragraph number. If you had done that, you might >>have seen that the end of this subclause is an Implementation Permissions >>section, in which this text is totally inappropriate >Agreed, I don’t know how I missed it. I fixed this, too, in a new draft. >>>The specification of package Wide_Directories is the same as for Directories >>>(including its optional child package Hierarchical_File_Names), except that >>>each occurrence of String is replaced by Wide_String. >>I think you need to mention it's not-optional child of Information as well. >>The contents of Information are not specified by the language, but it needs >>to be present. (And suggested contents are given in the AARM for Windows and >>Linux implementations.) See A.16(124/2). We don't want >>Wide_Directories.Information to be any different than >>Directories.Information. >I must admit that the existence of child package Information had totally passed >me by. Though if the underlying OS doesn’t provide any additional information, >than won’t the child package not exist, or are you saying that it will exist >but be empty? If the latter, then I think it should be shown in the Static >Semantics section, even if it just contains a “ ... -- not specified by the >language” type of comment. Restriction No_Implementation_Identifiers treats this package as language-defined with implementation-defined contents, just like Machine_Code. That's probably the model that we should use here. This package is hidden as much as it is because we can't mention Windows or Linux in the normative Standard - if we could have, we would have required minimum contents on those systems and allowed implementations to add to them. It's presence had escaped Robert Dewar, too, probably because he famously refused to look in the AARM. Not sure if AdaCore ever fixed this oversight. (If I could think of a sane way to test this in the ACATS I would.) Anyway, I added it to your existing text by mentioning "optional child packages Information and Hierarchical_File_Names". It's in the index, people can find it if they have to. **************************************************************** From: Randy Brukardt Sent: Wednesday, December 12, 2018 1:33 AM When I put these into the RM, I noticed a number of issues. (1) I dropped the change to A.8.2(1) because it isn't relevant: the wide and wide wide forms of text io are defined by equivalence, so they don't need to be mentioned here (and it is bad precedent, we'd have to make a change like that in other general places as well). (2) The A.8.2(28.3/4) addition read: The nested package Wide_File_Names provides operations equivalent to those in regular Sequential_IO except that Wide_String is used instead of String for the name and form of the external file. The nested package Wide_Wide_File_Names provides operations equivalent to those in regular Sequential_IO except that Wide_Wide_String is used instead of String for the name and form of the external file. But this subclause applies to all 4 of the IO packages: sequential io, direct io, text io, and stream io. It's plain wrong to put some text that only applies to sequential io into this clause (we have A.8.3 for that). Moreover, this rule is plenty generic to apply to all of the packages. So I replaced it by: The nested package Wide_File_Names provides operations equivalent to the operations of the same name of the outer package except that Wide_String is used instead of String for the name and form of the external file. The nested package Wide_Wide_File_Names provides operations equivalent to the operations of the same name of the outer package except that Wide_Wide_String is used instead of String for the name and form of the external file. (3) Now that there is a general description, we don't need to repeat it at A.8.2(20), A.10.2(5), and A.12.1(33). So those changes were dropped as well. **************************************************************** From: Randy Brukardt Sent: Wednesday, December 12, 2018 1:43 AM What happens when someone successfully opens a file with a Wide_Wide_Wide_Wide_Wide_String (say, with some emojiis) and then calls the Name that returns a String? One hopes that the result is implementation-defined. Do we need to say something about that, or is it OK to leave it unspecified in the Standard because file names are implementation-defined in the first place? A similar question can be asked about Form parameters. The reason I ask is that Name is something at this typically implemented in the Ada runtime (the names of the files are generally not retrievable from open file handles - at least not in Windows or (especially) Unix/Linux, so the Ada runtime has to store this information). Storing the information in multiple formats seems bizarre at best, storing it in the most general format is going to waste a lot of space, take a full scan to convert to String, etc. I was thinking about possible Janus/Ada implementations of this feature and it isn't that clear what it is required to do. **************************************************************** From: Tucker Taft Sent: Wednesday, December 12, 2018 7:25 AM I would agree we can leave it implementation defined, though if you use the same package for opening and calling Name, I would hope you get back what you put in. If I were to implement it, I would use a UTF-8 representation for Wide and Wide_Wide, with some kind of BOM to avoid ambiguity with the Latin-1 used for normal strings. **************************************************************** From: Randy Brukardt Sent: Monday, January 7, 2019 11:16 PM I would tend to agree that at least using the same package for both should do something reproducible (the ACATS assumes that for String, of course, and I'd expect the same for tests for the Wide and Wide_Wide versions), and that Wide_Wide_Characters should still be recognizable when using the Wide_Wide Name. It seems that we should at least mention in the AARM this expectation. Perhaps something like: Existing A.8.2(28.4/5): The nested package Wide_File_Names provides operations equivalent to the operations of the same name of the outer package except that Wide_String is used instead of String for the name and form of the external file. New note: AARM Implementation Note: The expectation is that if one opens a file with this package and then requests the name with Name, the result will include the same wide_characters that were used to open the file (rather than some encoded form). Since file names are completely implementation-defined, we can't say this normatively. We have no expectations when retrieving the name of a file opened with this package using the String Name. Similar considerations apply to the nested package Wide_Wide_File_Names (see below). If this is OK, I'll stick this into the AARM note AI (it's one that we never vote on, it just holds mail like this). Or we could put it into a cleanup AI if we want more work. :-) **************************************************************** From: Tucker Taft Sent: Monday, January 7, 2019 11:32 PM Your suggestion for an AARM note sounds fine. The point is to explain our intent, so an implementor can try to do the right thing. ****************************************************************