!standard 11.4.1(19) 18-10-29 AI12-0021-1/02 !standard A.8.1(15) !standard A.8.2(1) !standard A.8.2(29) !standard A.8.4(18) !standard A.8.4(20) !standard A.10.1(85) !standard A.10.2(5) !standard A.12.1(26) !standard A.12.1(33) !standard A.15.1(0) !standard A.16.2(0) !standard A.17.1(0) !class Amendment 12-03-13 !status work item 12-02-25 !status received 12-02-25 !priority High !difficulty Hard !subject Additional internationalization of Ada !summary Support is added for the use of the entire set of characters from ISO/IEC 10646:2017 for file and directory names by the operations of the Annex A facilities. !proposal In addition to the facilities already provided, (1) File and directory operations should support Unicode characters (presuming that the target file system does so); (2) Exception messages and exception information should support Unicode characters; (3) Command lines should support Unicode characters (presuming that the target system allows these). !wording Add after 11.4.1(19): It is recommended that UTF-8 encoding be used for encoding exception messages that require non-ASCII characters. Add after A.8.1(15): ... -- Enclosing package Ada.Sequential_IO package Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_String := ""; Form : in Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_String; Form : in Wide_String := ""); function Name (File : in File_Type) return Wide_String; function Form (File : in File_Type) return Wide_String; end Wide_File_Names; package Wide_Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_Wide_String := ""; Form : in Wide_Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_Wide_String; Form : in Wide_Wide_String := ""); function Name (File : in File_Type) return Wide_Wide_String; function Form (File : in File_Type) return Wide_Wide_String; end Wide_Wide_File_Names; Modify A.8.2(1): The procedures and functions described in this subclause provide for the control of external files; their declarations are repeated in each of the packages for sequential, direct, text, {wide text, wide wide text, }and stream input-output. For text input-output, the procedures Create, Open, and Reset have additional effects described in subclause A.10.2. Add at the end of A.8.2: The nested package Wide_File_Names provides operations that are the equivalent of those for regular Sequential_IO except that Wide_String is used instead of String for the name and form of the external file. The nested package Wide_Wide_File_Names provides operations that are the equivalent of those for regular Sequential_IO except that Wide_Wide_String is used instead of String for the name and form of the external file. Add after A.8.4(18): ... -- Enclosing package Ada.Direct_IO package Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := InOut_File; Name : in Wide_String := ""; Form : in Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_String; Form : in Wide_String := ""); function Name (File : in File_Type) return Wide_String; function Form (File : in File_Type) return Wide_String; end Wide_File_Names; package Wide_Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := InOut_File; Name : in Wide_Wide_String := ""; Form : in Wide_Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_Wide_String; Form : in Wide_Wide_String := ""); function Name (File : in File_Type) return Wide_Wide_String; function Form (File : in File_Type) return Wide_Wide_String; end Wide_Wide_File_Names; Add at the end of A.8.4: The nested package Wide_File_Names provides operations that are the equivalent of those for regular Direct_IO except that Wide_String is used instead of String for the name and form of the external file. The nested package Wide_Wide_File_Names provides operations that are the equivalent of those for regular Direct_IO except that Wide_Wide_String is used instead of String for the name and form of the external file. Add after A.10.1(85): ... -- Enclosing package Ada.Text_IO package Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_String := ""; Form : in Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_String; Form : in Wide_String := ""); function Name (File : in File_Type) return Wide_String; function Form (File : in File_Type) return Wide_String; end Wide_File_Names; package Wide_Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_Wide_String := ""; Form : in Wide_Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_Wide_String; Form : in Wide_Wide_String := ""); function Name (File : in File_Type) return Wide_Wide_String; function Form (File : in File_Type) return Wide_Wide_String; end Wide_Wide_File_Names; Add at the end of A.10.2: The nested package Wide_File_Names provides operations that are the equivalent of those for regular Text_IO except that Wide_String is used instead of String for the name and form of the external file. The nested package Wide_Wide_File_Names provides operations that are the equivalent of those for regular Text_IO except that Wide_Wide_String is used instead of String for the name and form of the external file. Add after A.12.1(26): ... -- Enclosing package Ada.Stream_IO package Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_String := ""; Form : in Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_String; Form : in Wide_String := ""); function Name (File : in File_Type) return Wide_String; function Form (File : in File_Type) return Wide_String; end Wide_File_Names; package Wide_Wide_File_Names is -- File management procedure Create(File : in out File_Type; Mode : in File_Mode := Out_File; Name : in Wide_Wide_String := ""; Form : in Wide_Wide_String := ""); procedure Open (File : in out File_Type; Mode : in File_Mode; Name : in Wide_Wide_String; Form : in Wide_Wide_String := ""); function Name (File : in File_Type) return Wide_Wide_String; function Form (File : in File_Type) return Wide_Wide_String; end Wide_Wide_File_Names; Add after A.12.1(33): The nested package Wide_File_Names provides operations that are the equivalent of those for regular Stream_IO except that Wide_String is used instead of String for the name and form of the external file. The nested package Wide_Wide_File_Names provides operations that are the equivalent of those for regular Stream_IO except that Wide_Wide_String is used instead of String for the name and form of the external file. Add section A.15.1: A.15.1 The Packages Wide_Command_Line and Wide_Wide_Command_Line The packages Wide_Command_Line and Wide_Wide_Command_Line allow a program to obtain the values of its arguments and to set the exit status code to be returned on normal termination. Static Semantics The specification of package Wide_Command_Line is the same as for Command_Line, except that each occurrence of String is replaced by Wide_String. The specification of package Wide_Wide_Command_Line is the same as for Command_Line, except that each occurrence of String is replaced by Wide_Wide_String. Add section A.16.2: A.16.2 The Packages Wide_Directories and Wide_Wide_Directories The packages Wide_Directories and Wide_Wide_Directories provide operations for manipulating files and directories, and their names. Static Semantics The specification of package Wide_Directories is the same as for Directories (including its optional child package Hierarchical_File_Names), except that each occurrence of String is replaced by Wide_String. The specification of package Wide_Wide_Directories is the same as for Directories (including its optional child package Hierarchical_File_Names), except that each occurrence of String is replaced by Wide_Wide_String. Add section A.17.1: A.17.1 The Packages Wide_Environment_Variables and Wide_Wide_Environment_Variables The packages Wide_Environment_Variables and Wide_Wide_Environment_Variables allow a program to read or modify environment variables. Static Semantics The specification of package Wide_Environment_Variables is the same as for Environment_Variables, except that each occurrence of String is replaced by Wide_String. The specification of package Wide_Wide_Environment_Variables is the same as for Environment_Variables, except that each occurrence of String is replaced by Wide_Wide_String. !discussion These issues defy an easy solution. Changing the behavior of the existing routines would break existing workarounds (which on some targets, like most Linux systems, have no problems with directly using UTF-8 strings) and other commonly used functionality (like encoding binary data in exception messages). Adding even more Wide_Wide_ packages and routines is a combinational explosion. The crux of this problem is that the semantics and representation of strings have become co-mingled. What we really need to do is to separate these; the problem with that mostly with retaining adequate performance. The way-out solution would be to declare a semi-magic Root_String interface (or perhaps an abstract type); the magic would be string literals (since "lvalue"s and indexing already can be supported with existing Ada 2012 facilities). Something on the line of: package General_Strings is type Root_String is interface with Constant_Indexing => Get_Char, Variable_Indexing => Set_Char, String_Literal => Assign; -- A string literal calls this procedure. function Get_Char (A : Root_String; I : Positive) return Wide_Wide_Character; -- Returns the Ith character of A (regardless of the representation of A). type LValue (D : access Wide_Wide_Character) with Implicit_Dereferencing => D is null record; function Set_Char (A : in out Root_String; I : Positive) return LValue; -- Returns a reference to the Ith character of A (regardless of the -- representation of A). function Slice (A : Root_String; L : Positive; R : Natural) return Wide_Wide_String; -- Returns a slice of the string. (Not sure if this is a good idea, or best left out.) -- One imagines an LValue slice as well; I'll leave that as an exercise for the reader. procedure Assign (Trg : in out Root_String; Src : in Wide_Wide_String); -- Assigns Src to Trg, converting the representation as needed. function Value (Obj : Root_String) return Wide_Wide_String; -- Retrieves the value of Obj as a Wide_Wide_String. -- "&" defined in the expected way. -- The stream attributes would be expected to work for these; perhaps they'd need -- to be part of the interface. end General_Strings; [Note: I didn't try to think of good names for these routines and parameters; that would need to done, of course.] Then we'd have a bunch of concrete instances: type Latin_1_String (L, R : Positive) is new General_Strings.Root_String with Obj : String (L .. R); -- The obvious implementations of the routines. type Bounded_UTF_8_String (Byte_Len : Natural) is new General_Strings.Root_String with Obj : UTF_8_String (1 .. Byte_Len); -- The not-quite-so-obvious implementations of the routines. and so on for every interesting representation. In addition, we'd have Ada.Strings.General (which would have approximately the contents of Ada.Strings.Fixed, with all of the String parameters converted to Root_String'Class). And most of the IO routines that take strings would have versions that would take Root_String'Class (these would need different names or packages, unfortunately, to avoid ambiguity). Similarly for exception messages, and so on. The real key here is that the string types would carry their representation along when passed into routines (which have to be new for this reason). Once that is available, then any problems can be dealt with by simply using whatever representation is appropriate for the target system. ==== The above was discussed at the Lisbon 2018 meeting and considered too ambitious for the Ada 2020 timescales. It was considered useful though to add child packages Wide_File_Names and Wide_Wide_File_Names for each I/O package, containing just those operations that take a filename as a parameter, and Wide_ and Wide_Wide_ versions of Ada.Directories, Ada.Command_Line and Ada.Environment_Variables. Also, UTF-8 encoding should be recommended for exception messages. !ACATS test ACATS C-Tests are needed for the new packages (nested and otherwise). !appendix This AI was created from the ashes of AI05-0286-1 (that is, those portions that defied an easy solution). **************************************************************** From: Gautier de Montmollin Sent: Thursday, March 14, 2013 5:51 AM !topic Ada.Directories: Form parameter for all subprograms with file or directory names !reference Ada 2012 RM A.16 !from Author Gautier de Montmollin 2013-03-14 !keywords directories !discussion In Ada.Directories, only a few subprograms of those having a String for file or directory name provide a Form parameter. It prevents an implementation providing the same as for Ada.Text_IO, for instance recognizing the "encoding=utf-8" sub-string. As a result Ada.Directories becomes practically useless for software meant to run on file systems with international character sets. **************************************************************** From: Adam Beneschan Sent: Thursday, March 14, 2013 10:23 AM The Form parameter is implementation-dependent, so any solution based on adding a Form parameter is going to be implementation-dependent. Because of this, it might be better to request AdaCore or whoever your compiler vendor is to add their own package Ada.Directories.Extensions to provide the functionality you want, since it's going to be implementation-dependent anyway. It appears that the ARG is starting to think about a more permanent solution, to the general problem of allowing Wide_String and Wide_Wide_String in places where only String is currently allowed. See AI12-0021. Personally, I think something like this ought to be done. The current "solution", in which a String is used to hold UTF-8 sequences in some situations, is an obnoxious hack. A String is an array of characters; and to me, the idea that a String whose 'Length is (say) 23 can be used to represent a string that really has 19 characters in it, is an abuse. It's been allowed as a temporary compromise, because something was needed and a real solution is difficult. But it's still an abuse. If we're going to entrench the idea of using String types to hold non-String data such as UTF-8 bit encodings, we might as well give up and start programming in C. So I'm not in favor of adding anything like this to the language standard. **************************************************************** From: Jeff Cousins Sent: Thursday, March 14, 2013 10:56 AM Thanks for replying on this topic Adam. It does seem to be something that it is going to be very hard to make rules about, it looks like it's going to be up to whatever implementation is used to offer something sensible for the platform used. The discussion of AI05-0286-1/02 in the minutes of the 46th ARG meeting, publicly available at http://www.ada-auth.org/arg-minutes.html, might show that it has been thought about, but how hard an issue it is to tackle. **************************************************************** From: Gautier de Montmollin Sent: Thursday, March 14, 2013 3:04 PM Anyway, a Wide_String version with no rule is fundamentally better than the String version with no rule! For Wide_String, implementors will follow the UTF-16 de facto standard in place for 20 years at least and that's it. For the String version Ada.Directories is just dysfunctional... So please don't wait too long... **************************************************************** From: Adam Beneschan Sent: Thursday, March 14, 2013 6:33 PM I think there's some widespread and fundamental confusion when it comes to UTF and encodings. A Wide_String is just an array of Wide_Characters. A Wide_Character is, fundamentally, just a number between 0 and 65535, where each number represents a character that has been assigned that number in the Unicode Basic Multilingual Plane. A Wide_String is an array of those numbers. There should be no "encoding" involved, UTF-16 or otherwise. A Wide_String will normally be represented as just an array of 16-bit integers that mean themselves. This means that a Wide_Character in a Wide_String can't represent a character in a different plane, i.e. from U+10000 and up. But that's what Wide_Wide_String is for. And I am certain that the ARG will not produce a solution that allows Wide_Strings as file names that doesn't also allow Wide_Wide_Strings. So UTF-16 has no place in this discussion. If Ada.Directories allows Wide_Strings and Wide_Wide_Strings, the implementation may need to convert them to UTF-8 in order to communicate with the OS, but the Ada program that uses Ada.Directories doesn't need to know about this implementation detail. I feel like there are a lot of people who aren't clear on the distinctions between the concepts (and the use of things like charset="utf-8" in HTML files just adds to the confusion, since UTF-8 is really an encoding algorithm and not a character set). Hopefully I've helped clear up some confusion among a few people, but I feel like this is a losing battle. **************************************************************** From: Gautier de Montmollin Sent: Thursday, March 14, 2013 7:10 PM Could not agree more! Go for Ada.Directories with String's, Wide_String's and Wide_Wide_String's ! **************************************************************** From: Randy Brukardt Sent: Thursday, March 14, 2013 7:54 PM That's easy, but it doesn't fix anything. That's because you have to be able to Create and Open files with the result. And pass Forms that contain file names. And retrieve Names and Forms. And on and on. Note that this is completely orthogonal to what kind of I/O the package supports: it has nothing to do with Wide_Text_IO, for instance. To follow the Wide_xxx and Wide_Wide_xxx to it's logical limit, you'd have to add Wide_ and Wide_Wide_ versions of all of the file manipulation routines in *every* existing I/O package. Which is insane (no, we do not want "Wide_Wide_Open" and "Wide_Wide_Name"). Moreover, we would have to decide what happens if the name of a file that contains 32-bit characters name is retrieved via Name. And recall that Name is required to return a "name that uniquely identifies the file", which usually means including the full path. In which case, there could be 32-bit characters in the result returned by Name even if the simple name only is ASCII (if for instance the user's login name and thus home directory contained such characters). The obvious solution to this problem is to raise an exception -- but that would be incompatible with existing practice on Linux (where UTF-8 can be used in type String without any interpretation) as well as practice involving Form parameters. And it would be incompatible for anyone unfortunate enough to run their program from a directory named using characters above position 256. We need to avoid run-time incompatibilities if at all possible (because there is no automatic way to detect them); while this particular case mostly involves implementation-defined behavior, the effect would be just as dangerous to programs that depend on it. It was cases like these that caused the ARG to discard the rough proposal that I had made for Ada 2012 and decide to defer any change until the next version of Ada (whenever that will be). As far as I can tell, the only real solution is to blow it all up and start over using a tagged root type (tentatively named Root_String'Class, although maybe we'd use it only for file names in which case it would get an appropriate name for that). If the tagged root type allowed string literals (the only real change needed to the language), there wouldn't be much user-level change, and the implementations could include properly typed UTF-8 and UTF-16 strings, along with anything else that might make sense. Note that there are similar problems with Ada.Command_Line, Ada.Environment_Variables, Ada.Exceptions (the exception message part), and probably other packages that we haven't thought about. This makes blowing it up more attractive, because adding dozens of new routines that hardly anyone will want to use, and adding incompatibilities as well, does not seem like a good plan. I don't know if there will be the will to "blow it up", but in any case, there is nothing simple or easy about this problem, and it does everyone a disservice to claim that there is an "easy" solution. **************************************************************** From: Robert Leif Sent: Friday, March 15, 2013 11:25 AM I believe that an alternative solution to the problem is to proceed one step up in abstraction. A linear array generic type could be made that included the string operations of Text_IO. It could be instantiated with any type of character including 4 bit characters or even 1 bit characters. Then it could be the basis of Root_String'Class or whatever you want to call it. If anyone is interested, I have been spent my last years writing XML schemas (CytometryML.org) written in the XML Schema Definition Language (XSD). XSD 1.1 includes assertions and restriction (generics). I basically fake datatype declarations in Ada specifications. **************************************************************** From: Gautier de Montmollin Sent: Saturday, March 16, 2013 8:51 AM Another idea would be not to change the standard at all about this, and persuade at least one major compiler vendor to use utf-8 for file or directory names in Ada.Directories. For instance GNAT is applying the equivalent tactic for arguments in Ada.Command_Line, since 2008. From the Devlopment Log, NF-62-HB07-027-gnat: "Unicode characters on Windows command line On Windows Ada.Command_Line now supports Unicode characters. Arguments are returned encoded in UTF-8 allowing better handling of Unicode file names names as arguments." **************************************************************** From: Vadim Godunko Sent: Thursday, April 25, 2013 9:00 AM > Another idea would be not to change the standard at all about this, > and persuade at least one major compiler vendor to use utf-8 for file > or directory names in Ada.Directories. Use of some UTF-XX is fine for Windows and MacOSX which is UTF-based. On POSIX systems any encoding can be selected by user and it is important to use it consistently for each call to imported libraries and to do input/output operations. **************************************************************** From: Florian Weimer Sent: Sunday, April 28, 2013 9:28 AM There's also an expectation that it's possible to access files whose names are not in the encoding range of the current locale. **************************************************************** From: Randy Brukardt Sent: Wednesday, May 1, 2013 1:13 PM Huh? UTF-8 covers all locales as all possible characters are in it; there should be no adjustment afterwards or there is something quite wrong going on. Locales only apply to pure 8-bit encodings, and that's impossible to do anything sensible with. There is an issue on Windows about file name equivalence, but that's something that simply never, even should be used, because it's impossible to work out (in part because of the locale issue). Linux and Unix don't have that problem. Anyway, there are two problems here, and they're at cross-purposes. One is the desire to let IO routines open and manipulate any file that could exist on the system. Second is the desire to portably be able to create and manipulate the *name* of any file that Text_IO can *create*. On a system with no file name rules, it's clearly not possible to do both (you've got to have some rules in order to portable manipulation). The purpose of Ada.Directories is exclusively the second - *portable* manipulation of files and names. That means that by definition it will have to be more limited than "everything the system can do". Thus, using UTF-8 exclusively would be sufficient for it. OTOH, we probably don't want such a restriction in Text_IO.Open (for example). That seems OK to me, as those "old" interfaces aren't going anywhere even if a new set is created using Root_String'Class or Wide_Wide_String. If you need bizarre capabilities on Linux (such as EBCDIC file names), use the old interfaces. **************************************************************** From: Yannick Duchene Sent: Wednesday, May 1, 2013 1:47 PM While it is true Unicode covers most languages and locales characters requirements, it does cover everything possibly needed. There are two main reasons: the first, Unicode is always defining new characters as the standard evolves (which implies all possible characters are not necessarily in it), the second, Unicode is not so much welcome in some countries (like Japan) where some lobbies (official or not) do all they can to preserve their own encoding as the official encoding, arguing Unicode is missing too many specificities of their writing system. But Unicode also has private use areas, which enable enough additional local definitions (this requires a local agreements between the parties involved, an issue the Ada standard does not have to bother with). Unicode is the good choice, but will not make every one happy before a long time. I would say well-formed UTF-8, with the requirement to be transparent with code-points from private use areas: no attempt to transform, interpret or decide if whether or not such a code-point is valid or not for a file-name and always accept it as valid. (hope I did not missed the point, as I have not read all the mails on this issue) **************************************************************** From: Randy Brukardt Sent: Wednesday, May 1, 2013 6:08 PM ... > > Huh? UTF-8 covers all locales as all possible characters are in it; > > While it is true Unicode covers most languages and locales characters > requirements, it does cover everything possibly needed. There are two > main > reasons: the first, Unicode is always defining new characters as the > standard evolves (which implies all possible characters are not > necessarily in it), If they're not in Unicode, they're not anywhere. In any case, added characters are not an issue. > ... the second, Unicode is > not so much welcome in some countries (like Japan) where some lobbies > (official or not) do all they can to preserve their own encoding as > the official encoding, arguing Unicode is missing too many > specificities of their writing system. But Unicode also has private > use areas, which enable enough additional local definitions (this > requires a local agreements between the parties involved, an issue the > Ada standard does not have to bother with). > > Unicode is the good choice, but will not make every one happy before a > long time. > > I would say well-formed UTF-8, with the requirement to be transparent > with code-points from private use areas: no attempt to transform, > interpret or decide if whether or not such a code-point is valid or > not for a file-name and always accept it as valid. What's a legal file name is implementation-defined, and I certainly don't see that changing. Some characters are not allowed in Windows file names, for example, and the Ada standard cannot try to insist that they're allowed. So I find this irrelevant -- indeed, if there is any support at all for UTF-8 file names will always be implementation-defined. The problem now is that we don't have any sane way to *allow* it -- there will never be a *requirement* to support it. ****************************************************************