CVS difference for ai12s/ai12-0021-1.txt

Differences between 1.1 and version 1.2
Log of other versions for file ai12s/ai12-0021-1.txt

--- ai12s/ai12-0021-1.txt	2012/03/13 07:01:10	1.1
+++ ai12s/ai12-0021-1.txt	2013/05/09 23:26:40	1.2
@@ -28,22 +28,22 @@
 
 !discussion
 
-These issues defy an easy solution. Changing the behavior of the existing routines
-would break existing workarounds (which on some targets, like most Linux systems,
-have no problems with directly using UTF-8 strings) and other commonly used
-functionality (like encoding binary data in exception messages).
-
-Adding even more Wide_Wide_ packages and routines is madness, and may not even make
-sense on many targets.
-
-The crux of this problem is that the semantics and representation of strings have
-become co-mingled. What we really need to do is to separate these; the problem with
-that mostly with retaining adequate performance.
+These issues defy an easy solution. Changing the behavior of the existing
+routines would break existing workarounds (which on some targets, like most
+Linux systems, have no problems with directly using UTF-8 strings) and other
+commonly used functionality (like encoding binary data in exception messages).
+
+Adding even more Wide_Wide_ packages and routines is madness, and may not even
+make sense on many targets.
+
+The crux of this problem is that the semantics and representation of strings
+have become co-mingled. What we really need to do is to separate these; the
+problem with that mostly with retaining adequate performance.
 
 The way-out solution would be to declare a semi-magic Root_String interface (or
-perhaps an abstract type); the
-magic would be string literals (since "lvalue"s and indexing already can be
-supported with existing Ada 2012 facilities). Something on the line of:
+perhaps an abstract type); the magic would be string literals (since "lvalue"s
+and indexing already can be supported with existing Ada 2012 facilities).
+Something on the line of:
 
         package General_Strings is
             type Root_String is interface with
@@ -82,8 +82,8 @@
         end General_Strings;
 
 
-[Note: I didn't try to think of good names for these routines and parameters; that would need to
-done, of course.]
+[Note: I didn't try to think of good names for these routines and parameters;
+that would need to done, of course.]
 
 Then we'd have a bunch of concrete instances:
 
@@ -92,7 +92,7 @@
 
         -- The obvious implementations of the routines.
 
-        
+
         type Bounded_UTF_8_String (Byte_Len : Natural) is new General_Strings.Root_String with
             Obj : UTF_8_String (1 .. Byte_Len);
 
@@ -100,15 +100,17 @@
 
 and so on for every interesting representation.
 
-In addition, we'd have Ada.Strings.General (which would have approximately the contents of
-Ada.Strings.Fixed, with all of the String parameters converted to Root_String'Class). And
-most of the IO routines that take strings would have versions that would take Root_String'Class
-(these would need different names or packages, unfortunately, to avoid ambiguity). Similarly
-for exception messages, and so on.
-
-The real key here is that the string types would carry their representation along when passed
-into routines (which have to be new for this reason). Once that is available, then any problems
-can be dealt with by simply using whatever representation is appropriate for the target system.
+In addition, we'd have Ada.Strings.General (which would have approximately the
+contents of Ada.Strings.Fixed, with all of the String parameters converted to
+Root_String'Class). And most of the IO routines that take strings would have
+versions that would take Root_String'Class (these would need different names or
+packages, unfortunately, to avoid ambiguity). Similarly for exception messages,
+and so on.
+
+The real key here is that the string types would carry their representation
+along when passed into routines (which have to be new for this reason). Once
+that is available, then any problems can be dealt with by simply using whatever
+representation is appropriate for the target system.
 
 This clearly will need a lot of work...
 
@@ -120,6 +122,325 @@
 
 This AI was created from the ashes of AI05-0286-1 (that is, those portions
 that defied an easy solution).
+
+****************************************************************
+
+From: Gautier de Montmollin
+Sent: Thursday, March 14, 2013  5:51 AM
+
+!topic Ada.Directories: Form parameter for all subprograms with file or directory names
+!reference Ada 2012 RM A.16
+!from Author Gautier de Montmollin 2013-03-14
+!keywords directories
+!discussion
+
+In Ada.Directories, only a few subprograms of those having a String for file or
+directory name provide a Form parameter. It prevents an implementation providing
+the same as for Ada.Text_IO, for instance recognizing the "encoding=utf-8"
+sub-string. As a result Ada.Directories becomes practically useless for software
+meant to run on file systems with international character sets.
+
+****************************************************************
+
+From: Adam Beneschan
+Sent: Thursday, March 14, 2013  10:23 AM
+
+The Form parameter is implementation-dependent, so any solution based on adding
+a Form parameter is going to be implementation-dependent. Because of this, it
+might be better to request AdaCore or whoever your compiler vendor is to add
+their own package Ada.Directories.Extensions to provide the functionality you
+want, since it's going to be implementation-dependent anyway.
+
+It appears that the ARG is starting to think about a more permanent solution, to
+the general problem of allowing Wide_String and Wide_Wide_String in places where
+only String is currently allowed. See AI12-0021.  Personally, I think something
+like this ought to be done.  The current "solution", in which a String is used
+to hold UTF-8 sequences in some situations, is an obnoxious hack.  A String is
+an array of characters; and to me, the idea that a String whose 'Length is (say)
+23 can be used to represent a string that really has 19 characters in it, is an
+abuse.  It's been allowed as a temporary compromise, because something was
+needed and a real solution is difficult.  But it's still an abuse.  If we're
+going to entrench the idea of using String types to hold non-String data such as
+UTF-8 bit encodings, we might as well give up and start programming in C.
+
+So I'm not in favor of adding anything like this to the language standard.
+
+****************************************************************
+
+From: Jeff Cousins
+Sent: Thursday, March 14, 2013  10:56 AM
+
+Thanks for replying on this topic Adam.  It does seem to be something that it is
+going to be very hard to make rules about, it looks like it's going to be up to
+whatever implementation is used to offer something sensible for the platform
+used.  The discussion of AI05-0286-1/02 in the minutes of the 46th ARG meeting,
+publicly available at http://www.ada-auth.org/arg-minutes.html, might show that
+it has been thought about, but how hard an issue it is to tackle.
+
+****************************************************************
+
+From: Gautier de Montmollin
+Sent: Thursday, March 14, 2013  3:04 PM
+
+Anyway, a Wide_String version with no rule is fundamentally better than the
+String version with no rule! For Wide_String, implementors will follow the
+UTF-16 de facto standard in place for 20 years at least and that's it. For the
+String version Ada.Directories is just dysfunctional... So please don't wait too
+long...
+
+****************************************************************
+
+From: Adam Beneschan
+Sent: Thursday, March 14, 2013  6:33 PM
+
+I think there's some widespread and fundamental confusion when it comes to UTF
+and encodings.  A Wide_String is just an array of Wide_Characters.  A
+Wide_Character is, fundamentally, just a number between 0 and 65535, where each
+number represents a character that has been assigned that number in the Unicode
+Basic Multilingual Plane.  A Wide_String is an array of those numbers.  There
+should be no "encoding" involved, UTF-16 or otherwise.  A Wide_String will
+normally be represented as just an array of 16-bit integers that mean
+themselves.
+
+This means that a Wide_Character in a Wide_String can't represent a character in
+a different plane, i.e. from U+10000 and up.  But that's what Wide_Wide_String
+is for.  And I am certain that the ARG will not produce a solution that allows
+Wide_Strings as file names that doesn't also allow Wide_Wide_Strings.
+
+So UTF-16 has no place in this discussion.  If Ada.Directories allows
+Wide_Strings and Wide_Wide_Strings, the implementation may need to convert them
+to UTF-8 in order to communicate with the OS, but the Ada program that uses
+Ada.Directories doesn't need to know about this implementation detail.
+
+I feel like there are a lot of people who aren't clear on the distinctions
+between the concepts (and the use of things like charset="utf-8" in HTML files
+just adds to the confusion, since UTF-8 is really an encoding algorithm and not
+a character set).  Hopefully I've helped clear up some confusion among a few
+people, but I feel like this is a losing battle.
+
+****************************************************************
+
+From: Gautier de Montmollin
+Sent: Thursday, March 14, 2013  7:10 PM
+
+Could not agree more!
+Go for Ada.Directories with String's, Wide_String's and Wide_Wide_String's !
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Thursday, March 14, 2013  7:54 PM
+
+That's easy, but it doesn't fix anything. That's because you have to be able to
+Create and Open files with the result. And pass Forms that contain file names.
+And retrieve Names and Forms. And on and on.
+
+Note that this is completely orthogonal to what kind of I/O the package
+supports: it has nothing to do with Wide_Text_IO, for instance.
+
+To follow the Wide_xxx and Wide_Wide_xxx to it's logical limit, you'd have to
+add Wide_ and Wide_Wide_ versions of all of the file manipulation routines in
+*every* existing I/O package. Which is insane (no, we do not want
+"Wide_Wide_Open" and "Wide_Wide_Name").
+
+Moreover, we would have to decide what happens if the name of a file that
+contains 32-bit characters name is retrieved via Name. And recall that Name is
+required to return a "name that uniquely identifies the file", which usually
+means including the full path. In which case, there could be 32-bit characters
+in the result returned by Name even if the simple name only is ASCII (if for
+instance the user's login name and thus home directory contained such
+characters).
+
+The obvious solution to this problem is to raise an exception -- but that would
+be incompatible with existing practice on Linux (where UTF-8 can be used in type
+String without any interpretation) as well as practice involving Form
+parameters. And it would be incompatible for anyone unfortunate enough to run
+their program from a directory named using characters above position 256. We
+need to avoid run-time incompatibilities if at all possible (because there is no
+automatic way to detect them); while this particular case mostly involves
+implementation-defined behavior, the effect would be just as dangerous to
+programs that depend on it.
+
+It was cases like these that caused the ARG to discard the rough proposal that I
+had made for Ada 2012 and decide to defer any change until the next version of
+Ada (whenever that will be).
+
+As far as I can tell, the only real solution is to blow it all up and start over
+using a tagged root type (tentatively named Root_String'Class, although maybe
+we'd use it only for file names in which case it would get an appropriate name
+for that). If the tagged root type allowed string literals (the only real change
+needed to the language), there wouldn't be much user-level change, and the
+implementations could include properly typed UTF-8 and UTF-16 strings, along
+with anything else that might make sense.
+
+Note that there are similar problems with Ada.Command_Line,
+Ada.Environment_Variables, Ada.Exceptions (the exception message part), and
+probably other packages that we haven't thought about. This makes blowing it up
+more attractive, because adding dozens of new routines that hardly anyone will
+want to use, and adding incompatibilities as well, does not seem like a good
+plan.
+
+I don't know if there will be the will to "blow it up", but in any case, there
+is nothing simple or easy about this problem, and it does everyone a disservice
+to claim that there is an "easy" solution.
+
+****************************************************************
+
+From: Robert Leif
+Sent: Friday, March 15, 2013  11:25 AM
+
+I believe that an alternative solution to the problem is to proceed one step up
+in abstraction. A linear array generic type could be made that included the
+string operations of Text_IO. It could be instantiated with any type of
+character including 4 bit characters or even 1 bit characters. Then it could be
+the basis of Root_String'Class or whatever you want to call it.
+
+If anyone is interested, I have been spent my last years writing XML schemas
+(CytometryML.org) written in the XML Schema Definition Language (XSD). XSD
+1.1 includes assertions and restriction (generics). I basically fake datatype
+declarations in Ada specifications.
+
+****************************************************************
+
+From: Gautier de Montmollin
+Sent: Saturday, March 16, 2013  8:51 AM
+
+Another idea would be not to change the standard at all about this, and persuade
+at least one major compiler vendor to use utf-8 for file or directory names in
+Ada.Directories. For instance GNAT is applying the equivalent tactic for
+arguments in Ada.Command_Line, since 2008. From the Devlopment Log,
+NF-62-HB07-027-gnat:
+
+"Unicode characters on Windows command line On Windows Ada.Command_Line now
+supports Unicode characters. Arguments are returned encoded in UTF-8 allowing
+better handling of Unicode file names names as arguments."
+
+****************************************************************
+
+From: Vadim Godunko
+Sent: Thursday, April 25, 2013  9:00 AM
+
+> Another idea would be not to change the standard at all about this,
+> and persuade at least one major compiler vendor to use utf-8 for file
+> or directory names in Ada.Directories.
+
+Use of some UTF-XX is fine for Windows and MacOSX which is UTF-based. On POSIX
+systems any encoding can be selected by user and it is important to use it
+consistently for each call to imported libraries and to do input/output
+operations.
+
+****************************************************************
+
+From: Florian Weimer
+Sent: Sunday, April 28, 2013  9:28 AM
+
+There's also an expectation that it's possible to access files whose names are
+not in the encoding range of the current locale.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Wednesday, May  1, 2013  1:13 PM
+
+Huh? UTF-8 covers all locales as all possible characters are in it; there should
+be no adjustment afterwards or there is something quite wrong going on. Locales
+only apply to pure 8-bit encodings, and that's impossible to do anything
+sensible with. There is an issue on Windows about file name equivalence, but
+that's something that simply never, even should be used, because it's impossible
+to work out (in part because of the locale issue). Linux and Unix don't have
+that problem.
+
+Anyway, there are two problems here, and they're at cross-purposes.
+
+One is the desire to let IO routines open and manipulate any file that could
+exist on the system.
+
+Second is the desire to portably be able to create and manipulate the *name* of
+any file that Text_IO can *create*.
+
+On a system with no file name rules, it's clearly not possible to do both
+(you've got to have some rules in order to portable manipulation).
+
+The purpose of Ada.Directories is exclusively the second - *portable*
+manipulation of files and names. That means that by definition it will have to
+be more limited than "everything the system can do". Thus, using UTF-8
+exclusively would be sufficient for it.
+
+OTOH, we probably don't want such a restriction in Text_IO.Open (for example).
+That seems OK to me, as those "old" interfaces aren't going anywhere even if a
+new set is created using Root_String'Class or Wide_Wide_String. If you need
+bizarre capabilities on Linux (such as EBCDIC file names), use the old
+interfaces.
+
+****************************************************************
+
+From: Yannick Duchene
+Sent: Wednesday, May  1, 2013  1:47 PM
+
+While it is true Unicode covers most languages and locales characters
+requirements, it does cover everything possibly needed. There are two main
+reasons: the first, Unicode is always defining new characters as the standard
+evolves (which implies all possible characters are not necessarily in it), the
+second, Unicode is not so much welcome in some countries (like Japan) where some
+lobbies (official or not) do all they can to preserve their own encoding as the
+official encoding, arguing Unicode is missing too many specificities of their
+writing system. But Unicode also has private use areas, which enable enough
+additional local definitions (this requires a local agreements between the
+parties involved, an issue the Ada standard does not have to bother with).
+
+Unicode is the good choice, but will not make every one happy before a long
+time.
+
+I would say well-formed UTF-8, with the requirement to be transparent with
+code-points from private use areas: no attempt to transform, interpret or decide
+if whether or not such a code-point is valid or not for a file-name and always
+accept it as valid.
+
+(hope I did not missed the point, as I have not read all the mails on this
+issue)
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Wednesday, May  1, 2013  6:08 PM
+
+...
+> > Huh? UTF-8 covers all locales as all possible characters are in it;
+>
+> While it is true Unicode covers most languages and locales characters
+> requirements, it does cover everything possibly needed. There are two
+> main
+> reasons: the first, Unicode is always defining new characters as the
+> standard evolves (which implies all possible characters are not
+> necessarily in it),
+
+If they're not in Unicode, they're not anywhere. In any case, added characters
+are not an issue.
+
+> ... the second, Unicode is
+> not so much welcome in some countries (like Japan) where some lobbies
+> (official or not) do all they can to preserve their own encoding as
+> the official encoding, arguing Unicode is missing too many
+> specificities of their writing system. But Unicode also has private
+> use areas, which enable enough additional local definitions (this
+> requires a local agreements between the parties involved, an issue the
+> Ada standard does not have to bother with).
+>
+> Unicode is the good choice, but will not make every one happy before a
+> long time.
+>
+> I would say well-formed UTF-8, with the requirement to be transparent
+> with code-points from private use areas: no attempt to transform,
+> interpret or decide if whether or not such a code-point is valid or
+> not for a file-name and always accept it as valid.
+
+What's a legal file name is implementation-defined, and I certainly don't see
+that changing. Some characters are not allowed in Windows file names, for
+example, and the Ada standard cannot try to insist that they're allowed. So I
+find this irrelevant -- indeed, if there is any support at all for UTF-8 file
+names will always be implementation-defined. The problem now is that we don't
+have any sane way to *allow* it -- there will never be a *requirement* to
+support it.
 
 ****************************************************************
 

Questions? Ask the ACAA Technical Agent