Version 1.2 of ais/ai-00285.txt

Unformatted version of ais/ai-00285.txt version 1.2
Other versions for file ais/ai-00285.txt

!standard A.3.2(49)          02-01-23 AI95-00285/00
!class amendment 02-01-23
!status received 02-01-15
!priority Medium
!difficulty Hard
!subject Latin-9 and Ada.Characters.Handling
!summary
!problem
Latin-9 has been introduced.
!proposal
!discussion
!example
!ACATS test
!appendix

From: Gary Dismukes
Sent: Tuesday, January 15, 2002  4:14 PM

Ben Brosgol recently pointed out to us (ACT) the introduction of a
variant of the Latin 1 character set that is designated Latin 9.

A web page describing Latin 9 can be viewed at:

  http://www.cs.tut.fi/~jkorpela/latin9.html

Here's the summary blurb on that page describing the relatively minor
differences between Latin 1 and Latin 9:

  ISO Latin 9 as compared with ISO Latin 1

  The ISO Latin 9 (ISO 8859-15) character set differs from the well-known
  ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro
  sign and some national letters used e.g. in French and Finnish have been
  introduced and some rarely used special characters omitted.

We've added a new package to the GNAT library named Ada.Characters.Latin_9,
analogous to Ada.Characters.Latin_1, to define character constants for this
new character set.

Robert Dewar asked me to post the following remarks from him
re Latin-9 and Ada.Characters.Handling:

----------

Note that the Ada package Latin-1 did not exactly follow the official
names of all characters, and I have copied its abbreviated naming style
for the new characters in Latin-9.

I have a gripe with the RM here. The setup for Ada.Characters.Latin_1 is
to have separate packages for separate character sets, which makes perfectly
good sense:

27   An implementation may provide additional packages as children of
Ada.Characters, to declare names for the symbols of the local character set
or other character sets.

But for Characters.Handling, we have the odd statement:

49   If an implementation provides a localized definition of Character or
Wide_Character, then the effects of the subprograms in Characters.Handling
should reflect the localizations.  See also 3.5.2.

which implies that some mysterious transformation happens on this package
(under what circumstnaces?) I think this is a bad idea for two reasons:

a) it requires specialized mechanisms in the compiler, and it seems odd
for the meaning of this package to depend on some compiler switch etc.

b) it precludes handling multiple character sets in the same program,
whereas the design for Ada.Characters.Latin_1 etc seems to accomodate this.

My recommendation is that an implementation generate separate packages,
called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling
being a renaming of Ada.Characters.Handling_Latin_1 perhaps?)

Robert Dewar

*************************************************************

From: Pascal Leroy
Sent: Tuesday, January 15, 2002  5:05 PM

>   The ISO Latin 9 (ISO 8859-15) character set differs from the well-known
>   ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro
>   sign and some national letters used e.g. in French and Finnish have been
>   introduced and some rarely used special characters omitted.

Oh boy, good to see that the OE and oe ligatures are now available, and that
we now can write French without having to use Unicode!

*************************************************************

From: John Barnes
Sent: Wednesday, January 16, 2002  1:44 AM

Better put that on the agenda for the next ARG. Ada 2005
should use Latin 9 rather than Latin 1.  A minor change.
Might be a few incompatibilities.

*************************************************************

From: Pascal Leroy
Sent: Wednesday, January 16, 2002  12:53 PM

As I mentioned in a mail yesterday, the fact that you can use Latin 9 to
write French makes it look very interesting to me.

On the other hand, it is not too useful for Ada to support Latin 9 if the
OSes don't: if I emit the character OE and it print out as 1/4 on my screen,
I didn't gain much.

So while I agree that we should consider supporting Latin 9 _in_addition_ to
Latin 1 in Ada 05, I don't think Latin 9 should _replace_ Latin 1, because I
am ready to bet that we will still have Latin 1 OSes ten years from now.

*************************************************************

From: John Barnes
Sent: Thursday, January 17, 2002  1:33 AM

It was somewhat of a jokey suggestion as I am sure you are aware.

Indeed I had a big problem when writing my book and
displaying the type Character. I wrote it in QuarkXpress on
a PC and it was fine. The publishers moved it to a Mac
before printing and some characters came out wrong.  One of
them came out as a picture of an apple. Moreover, someone
had bitten a lump out of it. So much for standards I
thought.

But supporting Latin-9 would be nice. All those adverts on
the Paris Metro for eating an oeuf can then be printed
properly.

*************************************************************

From: Bob Duff
Sent: Thursday, January 17, 2002  1:14 PM

> Indeed I had a big problem when writing my book and
> displaying the type Character.

I had a great deal of trouble writing the part of the Reference Manual
where type Character lives.  I think Randy had some trouble with the
updated RM, too.  At least we didn't try to show type Wide_Character in
its full glory.  ;-)

7-bit ascii will live forever, I suppose.

*************************************************************

From: Bob Duff
Sent: Wednesday, January 16, 2002  2:15 PM

> Ben Brosgol recently pointed out to us (ACT) the introduction of a
> variant of the Latin 1 character set that is designated Latin 9.

The nice thing about standards is that there are so many to choose
from.  ;-)

> My recommendation is that an implementation generate separate packages,
> called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling
> being a renaming of Ada.Characters.Handling_Latin_1 perhaps?)

That makes sense.

But I think the RM statement you complain about is envisioning a
nonstandard version of Standard.[Wide_]Character, which is a separate
issue.  I don't see that as a big deal -- if you don't think it's a good
idea, don't implement any such thing.  I tend to agree that compiler
switches and the like shouldn't normally be meddling with the semantics
of packages Standard and Characters.Handling without a very good reason.

*************************************************************

From: Florian Weimer
Sent: Friday, January 18, 2002  6:58 AM

> But I think the RM statement you complain about is envisioning a
> nonstandard version of Standard.[Wide_]Character, which is a separate
> issue.

If you use Latin 9 for Standard.Character, this is certainly a
non-standard version, and Ada.Characters.Handling has to be modified
to remain useful.

*************************************************************

From: Florian Weimer
Sent: Friday, January 18, 2002  6:58 AM

> Better put that on the agenda for the next ARG. Ada 2005
> should use Latin 9 rather than Latin 1.  A minor change.
> Might be a few incompatibilities.

I disagree.  With Latin 9, the mapping from Character to
Wide_Character is less straightforward, and this could have unexpected
results.

OTOH, it seems that Wide_Character is not widely used (unless you are
forced to do so by ASIS), so this might not matter much.

In addition, we really should add Wide_Wide_Character (which covers
the sixteen additional planes), or make Wide_Character itself wider.
Otherwise, using Unicode with standard Ada will be rather painful.

*************************************************************

From: Florian Weimer
Sent: Saturday, April 20, 2002  3:18 AM

ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits,
and 10646-2:2001 allocates characters outside the Basic Multilingual
Plane.

Not too long ago, quite a few people assumed that characters beyond
the BMP would be interesting only for rather esoteric scholarly use
(Linear B is a perfect example).  However, we now have got at least
different sets of code positions outside the BMP which will see more
widespread use eventually: the mathematical alphabets and Plane 14
Language Tags (which are required to make some Japanese people happy
who fear that Japanese characters are rendered using Chinese glyphs).

Therefore, I think Ada 200X should somehow support characters outside
the BMP.

A few random thoughts (sorry, I'm probably not using strict ISO 10646
terminology):

  * Several major vendors have adopted ISO 10646-1:1993 early, using a
    16 bit representation for characters (i.e. wchar_t in C is 16
    bits).

These vendors include Sun (Java) and Microsoft (Windows), and probably
most proprietary UNIX vendors.  These vendor implementations now cover
the code positions beyond the BMP using UTF-16, which uses surrogate
pairs (a single character is represented using two 16 bit values from
reserved ranges in the BMP).

UTF-16 has got a few drawbacks: the ordering (in terms UCS code
positions) is no longer lexicographic (which leads us to such brain
damage as CESU-8), dealing with individual characters is complicated,
and you cannot implement the C wide character functions properly.

For Ada, numerous changes would be required if we want to expose the
UTF-16 representation to programmers, for example by declaring
Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no
longer be arrays of characters indexed by position).

GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS
characters in a single 32 bit value, that is, UTF-32), and while this
is certainly not the "industry standard" (it is encouraged by ISO
9899:1999, though), I really hope we can use this approach (UTF-32
internal representation) for Ada, as it simplifies things
considerably, especially if we want to add character properties
support (see below).

  * We could add Wide_Wide_Character and Wide_Wide_String types to
    pacakge Standard (and extending the Ada.Strings hierarchy), which
    are encoded in UTF-32.

I don't know if this is necessary.  IIRC, Robert Dewar once told that
the only applications using Wide_Character are based on ASIS, where
using Wide_Character is not really voluntarily.  Maybe it is possible
to bump Wide_Character'Size to 32 bits instead, without really
breaking backwards compatibility.

Of course, we would need a way to converted UTF-32 strings to UTF-16
strings and vice versa (the UTF-16 string type could become a
second-class citizen, though, without full support in the Ada.Strings
hierarchy).

  * External representation of UCS characters is rapidly moving
    towards UTF-8 (especially in Internet standards).

Ada should provide an interface for converting between the wide string
type(s) and UTF-8 octet sequences.  It should be possible to use
string literals where UTF-8 strings are expected.

  * Supporting higher levels of Unicode (e.g. accessing the character
    properties database, normalization forms) would be interesting,
    too.

Such documents will eventually follow in the ISO 10646 series, but I
don't know if the ISO standard will be ready for Ada 200X.  Currently,
only the Unicode Consortium has standardized or documented issues like
character properties or terminal behavior in detail.

I don't know how ISO reacts if ISO standards refer to competing
standardization efforts.  IEEE POSIX.1 (and probably, or already, ISO
POSIX) standardizes the BSD sockets interface, and not OSI, so maybe
this isn't an issue.

In any case, this point is mostly a library issue which can be
addressed by a community implementation effort, it does not require
changes in the Ada language (adding Wide_Wide_Character does, for
example).

*************************************************************

From: Pascal Leroy
Sent: Monday, April 22, 2002  8:32 AM

> ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits,
> and 10646-2:2001 allocates characters outside the Basic Multilingual
> Plane.
>
> Therefore, I think Ada 200X should somehow support characters outside
> the BMP.

The normalization of new character sets (both as part of 10646 and of 8859)
was actually discussed at the last ARG meeting, and I was given an action
item to somehow integrate them in the language, probably as some kind of
amendment AI.

> A few random thoughts (sorry, I'm probably not using strict ISO 10646
> terminology):
>
>   * Several major vendors have adopted ISO 10646-1:1993 early, using a
>     16 bit representation for characters (i.e. wchar_t in C is 16
>     bits).

Which is fine as it maps directly to Ada's wide character.  I still think
that we want to retain the capacity of using 16-bit blobs to represent
characters in the BMP, as 99.5% of practical applications will only need the
BMP.

> For Ada, numerous changes would be required if we want to expose the
> UTF-16 representation to programmers, for example by declaring
> Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no
> longer be arrays of characters indexed by position).

Changes to Wide_Character and Wide_String are pretty much out of the
question.  On the other hand, the type that is intended for interfacing with
C is Interfaces.C.wchar_array, and it would be straightforward to provide
(in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit
Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C
compiler does).

> I really hope we can use this approach (UTF-32
> internal representation) for Ada, as it simplifies things
> considerably, especially if we want to add character properties
> support (see below).

I would think that we would want to use UCS-4, since it's an ISO standard.
Moreover, UTF-32 has a number of consistency rules (eg code points below
16#10ffff#) which seem irrelevant for internal manipulation of strings.

>   * We could add Wide_Wide_Character and Wide_Wide_String types to
>     pacakge Standard (and extending the Ada.Strings hierarchy), which
>     are encoded in UTF-32.

Wide_Wide_ types seem like the natural way to add this capability to the
language, except that some compilers may not be quite prepared to deal with
enumeration types with 2 ** 32 literals (ours isn't).

> (the UTF-16 string type could become a
> second-class citizen, though, without full support in the Ada.Strings
> hierarchy).

As far as I can tell, there is no support for UTF-16, only for UCS-2.
Anyway, I don't think it is reasonable to force applications to go to the
full 32-bit overhead just because they use, say, the french OE ligature.

>   * External representation of UCS characters is rapidly moving
>     towards UTF-8 (especially in Internet standards).
>
> Ada should provide an interface for converting between the wide string
> type(s) and UTF-8 octet sequences.  It should be possible to use
> string literals where UTF-8 strings are expected.

External representation is best handled by Text_IO and friends, typically by
using a form parameter to specify the encoding (and there are many more
encodings than just UCS and UTF).  The ARG won't get into the business of
specifying the details of the form parameter, so this is something that will
remain non-portable for the foreseeable future.  (Where do we stop?  Do we
want to require all validated compilers to support UTF-8?  What about the
chinese Big5 or the JIS encodings?)

>   * Supporting higher levels of Unicode (e.g. accessing the character
>     properties database, normalization forms) would be interesting,
>     too.

We certainly don't want to get into that business.  The designers of Ada 95
wisely decided to lump all of the characters in the range 16#0100# ..
16#FFFD# into the category special_character, so that they don't have to
decide which is a letter, a number, etc.  Similarly they didn't provide
classification functions or upper/lower conversions for wide characters.
This seems reasonable if we don't want to have to amend Ada each time a
bunch of characters are added to 10646.

*************************************************************

From: Nick Roberts
Sent: Wednesday, April 24, 2002  7:31 PM

> Therefore, I think Ada 200X should somehow support characters outside
> the BMP.

I agree.

> GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS
> characters in a single 32 bit value, that is, UTF-32), and while this is
> certainly not the "industry standard" (it is encouraged by ISO 9899:1999,
> though), I really hope we can use this approach (UTF-32 internal
> representation) for Ada, as it simplifies things considerably, especially
> if we want to add character properties support (see below).

I agree very strongly!

>   * We could add Wide_Wide_Character and Wide_Wide_String types to
>     pacakge Standard (and extending the Ada.Strings hierarchy), which
>     are encoded in UTF-32.

I must say I would prefer the identifiers Universal_Character and
Universal_String. I see the logic of Wide_Wide_ but it seems clumsy!

> I don't know if this is necessary.  IIRC, Robert Dewar once told that
> the only applications using Wide_Character are based on ASIS, where
> using Wide_Character is not really voluntarily.  Maybe it is possible
> to bump Wide_Character'Size to 32 bits instead, without really
> breaking backwards compatibility.

I disagree with this idea.

> Of course, we would need a way to converted UTF-32 strings to UTF-16
> strings and vice versa (the UTF-16 string type could become a
> second-class citizen, though, without full support in the Ada.Strings
> hierarchy).

Possibly these support packages should be in an optional annex.

>   * External representation of UCS characters is rapidly moving
>     towards UTF-8 (especially in Internet standards).
>
> Ada should provide an interface for converting between the wide string
> type(s) and UTF-8 octet sequences.  It should be possible to use string
> literals where UTF-8 strings are expected.
>
>   * Supporting higher levels of Unicode (e.g. accessing the character
>     properties database, normalization forms) would be interesting,
>     too.

Again, perhaps all this should really be in (or moved into) an optional annex.

*************************************************************

From: Robert Dewar
Sent: Wednesday, April 24, 2002  9:50 PM

I suspect that the work on wide_wide_character will in practice turn
out to be nearly useless in the short or medium term. We certainly
put in a lot of work in GNAT in implementing wide character with many
different representation schemes, but this feature has been very little
used (ASIS being the main use :-). In practice I think the 16-bit character
type defined in Ada now will be adequate for almost all use, and I see no
reason in requring implementations to go beyond this in the absence of
real market demand.

Yes, it's fun to talk about character set issues (after all I was chair of
the CRG, so I appreciate this), but there is no point in increasing
implementation burdens unless it's really valuable.

I would just give clear permission for an implementation to add additional
character types in standard (indeed that permission exists today in Ada 95),
and leave it at that.

*************************************************************

From: John Barnes
Sent: Thursday, April 25, 2002  1:46 AM

The BSI is looking at character set issues across languages
and your message reminded me of the CRG. Was there ever a
final report that I could refer to?

*************************************************************

From: Robert Dewar
Sent: Thursday, April 26, 2002  10:25 PM

I think there was a final report, perhaps Jim could track it down.

*************************************************************

From: Randy Brukardt
Sent: Thursday, April 25, 2002  3:44 PM

> We certainly put in a lot of work in GNAT in implementing wide
> character with many different representation schemes, but this
> feature has been very little used (ASIS being the main use :-).

To add another data point: Claw was designed so that a wide character version
could be easily created. But we've never implemented that version, mainly
because we've never had a paying customer ask for it. So I have to wonder how
important "Really_Wide_Character" would be.

*************************************************************



Questions? Ask the ACAA Technical Agent