Version 1.2 of ai12s/ai12-0021-1.txt

Unformatted version of ai12s/ai12-0021-1.txt version 1.2
Other versions for file ai12s/ai12-0021-1.txt

!standard A.16(0/3)          12-03-13 AI12-0021-1/01
!class Amendment 12-03-13
!status work item 12-02-25
!status received 12-02-25
!priority High
!difficulty Hard
!subject Additional internationalization of Ada
!summary
**TBD.
!proposal
In addition to the facilities already provided,
(1) File and directory operations should support Unicode characters (presuming that the target file system does so);
(2) Exception messages and exception information should support Unicode characters;
(3) Command lines should support Unicode characters (presuming that the target system allows these).
!wording
** TBD.
!discussion
These issues defy an easy solution. Changing the behavior of the existing routines would break existing workarounds (which on some targets, like most Linux systems, have no problems with directly using UTF-8 strings) and other commonly used functionality (like encoding binary data in exception messages).
Adding even more Wide_Wide_ packages and routines is madness, and may not even make sense on many targets.
The crux of this problem is that the semantics and representation of strings have become co-mingled. What we really need to do is to separate these; the problem with that mostly with retaining adequate performance.
The way-out solution would be to declare a semi-magic Root_String interface (or perhaps an abstract type); the magic would be string literals (since "lvalue"s and indexing already can be supported with existing Ada 2012 facilities). Something on the line of:
package General_Strings is type Root_String is interface with Constant_Indexing => Get_Char, Variable_Indexing => Set_Char, String_Literal => Assign; -- A string literal calls this procedure.
function Get_Char (A : Root_String; I : Positive) return Wide_Wide_Character; -- Returns the Ith character of A (regardless of the representation of A).
type LValue (D : access Wide_Wide_Character) with Implicit_Dereferencing => D is null record;
function Set_Char (A : in out Root_String; I : Positive) return LValue; -- Returns a reference to the Ith character of A (regardless of the -- representation of A).
function Slice (A : Root_String; L : Positive; R : Natural) return Wide_Wide_String; -- Returns a slice of the string. (Not sure if this is a good idea, or best left out.)
-- One imagines an LValue slice as well; I'll leave that as an exercise for the reader.
procedure Assign (Trg : in out Root_String; Src : in Wide_Wide_String); -- Assigns Src to Trg, converting the representation as needed.
function Value (Obj : Root_String) return Wide_Wide_String; -- Retrieves the value of Obj as a Wide_Wide_String.
-- "&" defined in the expected way.
-- The stream attributes would be expected to work for these; perhaps they'd need -- to be part of the interface.
end General_Strings;
[Note: I didn't try to think of good names for these routines and parameters; that would need to done, of course.]
Then we'd have a bunch of concrete instances:
type Latin_1_String (L, R : Positive) is new General_Strings.Root_String with Obj : String (L .. R);
-- The obvious implementations of the routines.
type Bounded_UTF_8_String (Byte_Len : Natural) is new General_Strings.Root_String with Obj : UTF_8_String (1 .. Byte_Len);
-- The not-quite-so-obvious implementations of the routines.
and so on for every interesting representation.
In addition, we'd have Ada.Strings.General (which would have approximately the contents of Ada.Strings.Fixed, with all of the String parameters converted to Root_String'Class). And most of the IO routines that take strings would have versions that would take Root_String'Class (these would need different names or packages, unfortunately, to avoid ambiguity). Similarly for exception messages, and so on.
The real key here is that the string types would carry their representation along when passed into routines (which have to be new for this reason). Once that is available, then any problems can be dealt with by simply using whatever representation is appropriate for the target system.
This clearly will need a lot of work...
!ACATS test
** TBD.
!appendix

This AI was created from the ashes of AI05-0286-1 (that is, those portions
that defied an easy solution).

****************************************************************

From: Gautier de Montmollin
Sent: Thursday, March 14, 2013  5:51 AM

!topic Ada.Directories: Form parameter for all subprograms with file or directory names
!reference Ada 2012 RM A.16
!from Author Gautier de Montmollin 2013-03-14
!keywords directories
!discussion

In Ada.Directories, only a few subprograms of those having a String for file or
directory name provide a Form parameter. It prevents an implementation providing
the same as for Ada.Text_IO, for instance recognizing the "encoding=utf-8"
sub-string. As a result Ada.Directories becomes practically useless for software
meant to run on file systems with international character sets.

****************************************************************

From: Adam Beneschan
Sent: Thursday, March 14, 2013  10:23 AM

The Form parameter is implementation-dependent, so any solution based on adding
a Form parameter is going to be implementation-dependent. Because of this, it
might be better to request AdaCore or whoever your compiler vendor is to add
their own package Ada.Directories.Extensions to provide the functionality you
want, since it's going to be implementation-dependent anyway.

It appears that the ARG is starting to think about a more permanent solution, to
the general problem of allowing Wide_String and Wide_Wide_String in places where
only String is currently allowed. See AI12-0021.  Personally, I think something
like this ought to be done.  The current "solution", in which a String is used
to hold UTF-8 sequences in some situations, is an obnoxious hack.  A String is
an array of characters; and to me, the idea that a String whose 'Length is (say)
23 can be used to represent a string that really has 19 characters in it, is an
abuse.  It's been allowed as a temporary compromise, because something was
needed and a real solution is difficult.  But it's still an abuse.  If we're
going to entrench the idea of using String types to hold non-String data such as
UTF-8 bit encodings, we might as well give up and start programming in C.

So I'm not in favor of adding anything like this to the language standard.

****************************************************************

From: Jeff Cousins
Sent: Thursday, March 14, 2013  10:56 AM

Thanks for replying on this topic Adam.  It does seem to be something that it is
going to be very hard to make rules about, it looks like it's going to be up to
whatever implementation is used to offer something sensible for the platform
used.  The discussion of AI05-0286-1/02 in the minutes of the 46th ARG meeting,
publicly available at http://www.ada-auth.org/arg-minutes.html, might show that
it has been thought about, but how hard an issue it is to tackle.

****************************************************************

From: Gautier de Montmollin
Sent: Thursday, March 14, 2013  3:04 PM

Anyway, a Wide_String version with no rule is fundamentally better than the
String version with no rule! For Wide_String, implementors will follow the
UTF-16 de facto standard in place for 20 years at least and that's it. For the
String version Ada.Directories is just dysfunctional... So please don't wait too
long...

****************************************************************

From: Adam Beneschan
Sent: Thursday, March 14, 2013  6:33 PM

I think there's some widespread and fundamental confusion when it comes to UTF
and encodings.  A Wide_String is just an array of Wide_Characters.  A
Wide_Character is, fundamentally, just a number between 0 and 65535, where each
number represents a character that has been assigned that number in the Unicode
Basic Multilingual Plane.  A Wide_String is an array of those numbers.  There
should be no "encoding" involved, UTF-16 or otherwise.  A Wide_String will
normally be represented as just an array of 16-bit integers that mean
themselves.

This means that a Wide_Character in a Wide_String can't represent a character in
a different plane, i.e. from U+10000 and up.  But that's what Wide_Wide_String
is for.  And I am certain that the ARG will not produce a solution that allows
Wide_Strings as file names that doesn't also allow Wide_Wide_Strings.

So UTF-16 has no place in this discussion.  If Ada.Directories allows
Wide_Strings and Wide_Wide_Strings, the implementation may need to convert them
to UTF-8 in order to communicate with the OS, but the Ada program that uses
Ada.Directories doesn't need to know about this implementation detail.

I feel like there are a lot of people who aren't clear on the distinctions
between the concepts (and the use of things like charset="utf-8" in HTML files
just adds to the confusion, since UTF-8 is really an encoding algorithm and not
a character set).  Hopefully I've helped clear up some confusion among a few
people, but I feel like this is a losing battle.

****************************************************************

From: Gautier de Montmollin
Sent: Thursday, March 14, 2013  7:10 PM

Could not agree more!
Go for Ada.Directories with String's, Wide_String's and Wide_Wide_String's !

****************************************************************

From: Randy Brukardt
Sent: Thursday, March 14, 2013  7:54 PM

That's easy, but it doesn't fix anything. That's because you have to be able to
Create and Open files with the result. And pass Forms that contain file names.
And retrieve Names and Forms. And on and on.

Note that this is completely orthogonal to what kind of I/O the package
supports: it has nothing to do with Wide_Text_IO, for instance.

To follow the Wide_xxx and Wide_Wide_xxx to it's logical limit, you'd have to
add Wide_ and Wide_Wide_ versions of all of the file manipulation routines in
*every* existing I/O package. Which is insane (no, we do not want
"Wide_Wide_Open" and "Wide_Wide_Name").

Moreover, we would have to decide what happens if the name of a file that
contains 32-bit characters name is retrieved via Name. And recall that Name is
required to return a "name that uniquely identifies the file", which usually
means including the full path. In which case, there could be 32-bit characters
in the result returned by Name even if the simple name only is ASCII (if for
instance the user's login name and thus home directory contained such
characters).

The obvious solution to this problem is to raise an exception -- but that would
be incompatible with existing practice on Linux (where UTF-8 can be used in type
String without any interpretation) as well as practice involving Form
parameters. And it would be incompatible for anyone unfortunate enough to run
their program from a directory named using characters above position 256. We
need to avoid run-time incompatibilities if at all possible (because there is no
automatic way to detect them); while this particular case mostly involves
implementation-defined behavior, the effect would be just as dangerous to
programs that depend on it.

It was cases like these that caused the ARG to discard the rough proposal that I
had made for Ada 2012 and decide to defer any change until the next version of
Ada (whenever that will be).

As far as I can tell, the only real solution is to blow it all up and start over
using a tagged root type (tentatively named Root_String'Class, although maybe
we'd use it only for file names in which case it would get an appropriate name
for that). If the tagged root type allowed string literals (the only real change
needed to the language), there wouldn't be much user-level change, and the
implementations could include properly typed UTF-8 and UTF-16 strings, along
with anything else that might make sense.

Note that there are similar problems with Ada.Command_Line,
Ada.Environment_Variables, Ada.Exceptions (the exception message part), and
probably other packages that we haven't thought about. This makes blowing it up
more attractive, because adding dozens of new routines that hardly anyone will
want to use, and adding incompatibilities as well, does not seem like a good
plan.

I don't know if there will be the will to "blow it up", but in any case, there
is nothing simple or easy about this problem, and it does everyone a disservice
to claim that there is an "easy" solution.

****************************************************************

From: Robert Leif
Sent: Friday, March 15, 2013  11:25 AM

I believe that an alternative solution to the problem is to proceed one step up
in abstraction. A linear array generic type could be made that included the
string operations of Text_IO. It could be instantiated with any type of
character including 4 bit characters or even 1 bit characters. Then it could be
the basis of Root_String'Class or whatever you want to call it.

If anyone is interested, I have been spent my last years writing XML schemas
(CytometryML.org) written in the XML Schema Definition Language (XSD). XSD
1.1 includes assertions and restriction (generics). I basically fake datatype
declarations in Ada specifications.

****************************************************************

From: Gautier de Montmollin
Sent: Saturday, March 16, 2013  8:51 AM

Another idea would be not to change the standard at all about this, and persuade
at least one major compiler vendor to use utf-8 for file or directory names in
Ada.Directories. For instance GNAT is applying the equivalent tactic for
arguments in Ada.Command_Line, since 2008. From the Devlopment Log,
NF-62-HB07-027-gnat:

"Unicode characters on Windows command line On Windows Ada.Command_Line now
supports Unicode characters. Arguments are returned encoded in UTF-8 allowing
better handling of Unicode file names names as arguments."

****************************************************************

From: Vadim Godunko
Sent: Thursday, April 25, 2013  9:00 AM

> Another idea would be not to change the standard at all about this,
> and persuade at least one major compiler vendor to use utf-8 for file
> or directory names in Ada.Directories.

Use of some UTF-XX is fine for Windows and MacOSX which is UTF-based. On POSIX
systems any encoding can be selected by user and it is important to use it
consistently for each call to imported libraries and to do input/output
operations.

****************************************************************

From: Florian Weimer
Sent: Sunday, April 28, 2013  9:28 AM

There's also an expectation that it's possible to access files whose names are
not in the encoding range of the current locale.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, May  1, 2013  1:13 PM

Huh? UTF-8 covers all locales as all possible characters are in it; there should
be no adjustment afterwards or there is something quite wrong going on. Locales
only apply to pure 8-bit encodings, and that's impossible to do anything
sensible with. There is an issue on Windows about file name equivalence, but
that's something that simply never, even should be used, because it's impossible
to work out (in part because of the locale issue). Linux and Unix don't have
that problem.

Anyway, there are two problems here, and they're at cross-purposes.

One is the desire to let IO routines open and manipulate any file that could
exist on the system.

Second is the desire to portably be able to create and manipulate the *name* of
any file that Text_IO can *create*.

On a system with no file name rules, it's clearly not possible to do both
(you've got to have some rules in order to portable manipulation).

The purpose of Ada.Directories is exclusively the second - *portable*
manipulation of files and names. That means that by definition it will have to
be more limited than "everything the system can do". Thus, using UTF-8
exclusively would be sufficient for it.

OTOH, we probably don't want such a restriction in Text_IO.Open (for example).
That seems OK to me, as those "old" interfaces aren't going anywhere even if a
new set is created using Root_String'Class or Wide_Wide_String. If you need
bizarre capabilities on Linux (such as EBCDIC file names), use the old
interfaces.

****************************************************************

From: Yannick Duchene
Sent: Wednesday, May  1, 2013  1:47 PM

While it is true Unicode covers most languages and locales characters
requirements, it does cover everything possibly needed. There are two main
reasons: the first, Unicode is always defining new characters as the standard
evolves (which implies all possible characters are not necessarily in it), the
second, Unicode is not so much welcome in some countries (like Japan) where some
lobbies (official or not) do all they can to preserve their own encoding as the
official encoding, arguing Unicode is missing too many specificities of their
writing system. But Unicode also has private use areas, which enable enough
additional local definitions (this requires a local agreements between the
parties involved, an issue the Ada standard does not have to bother with).

Unicode is the good choice, but will not make every one happy before a long
time.

I would say well-formed UTF-8, with the requirement to be transparent with
code-points from private use areas: no attempt to transform, interpret or decide
if whether or not such a code-point is valid or not for a file-name and always
accept it as valid.

(hope I did not missed the point, as I have not read all the mails on this
issue)

****************************************************************

From: Randy Brukardt
Sent: Wednesday, May  1, 2013  6:08 PM

...
> > Huh? UTF-8 covers all locales as all possible characters are in it;
>
> While it is true Unicode covers most languages and locales characters
> requirements, it does cover everything possibly needed. There are two
> main
> reasons: the first, Unicode is always defining new characters as the
> standard evolves (which implies all possible characters are not
> necessarily in it),

If they're not in Unicode, they're not anywhere. In any case, added characters
are not an issue.

> ... the second, Unicode is
> not so much welcome in some countries (like Japan) where some lobbies
> (official or not) do all they can to preserve their own encoding as
> the official encoding, arguing Unicode is missing too many
> specificities of their writing system. But Unicode also has private
> use areas, which enable enough additional local definitions (this
> requires a local agreements between the parties involved, an issue the
> Ada standard does not have to bother with).
>
> Unicode is the good choice, but will not make every one happy before a
> long time.
>
> I would say well-formed UTF-8, with the requirement to be transparent
> with code-points from private use areas: no attempt to transform,
> interpret or decide if whether or not such a code-point is valid or
> not for a file-name and always accept it as valid.

What's a legal file name is implementation-defined, and I certainly don't see
that changing. Some characters are not allowed in Windows file names, for
example, and the Ada standard cannot try to insist that they're allowed. So I
find this irrelevant -- indeed, if there is any support at all for UTF-8 file
names will always be implementation-defined. The problem now is that we don't
have any sane way to *allow* it -- there will never be a *requirement* to
support it.

****************************************************************


Questions? Ask the ACAA Technical Agent