Version 1.4 of ais/ai-00285.txt

Unformatted version of ais/ai-00285.txt version 1.4
Other versions for file ais/ai-00285.txt

!standard A.3.2(49)          02-09-24 AI95-00285/01
!class amendment 02-01-23
!status work item 02-09-24
!status received 02-01-15
!priority Medium
!difficulty Hard
!subject Latin-9, Ada.Characters.Handling, and 32-bit characters
!summary
!problem
Latin-9 has been introduced by ISO/IEC 8859-15:1999. Moreover, the working draft of ISO/IEC 10646:2003 makes use of planes other than the BMP.
!proposal
In order to support Latin-9, we allow an implementation to provide a package named Ada.Characters.Latin_9, but we strictly restrict its contents to correspond to the characters defined by ISO/IEC 8859-15:1999.
If an application chooses to use both Latin-1 and Latin-9, the package Ada.Characters.Handling is quite problematic, as it seems to assume that Character always corresponds to Latin-1, and it would get the classifications and the conversions wrong for some Latin-9 characters. We deal with this by recognizing that this package is really specific to the character encoding being used, and making it a child of the proper Latin_n package. A library-level renaming is provided for compatibility with existing applications.
The constant Character_Set in Ada.Strings.Wide_Maps.Wide_Constants is similarly problematic as it is defined by reference to Ada.Characters.Handling. We define both the Latin-1 and the Latin-9 sets in Wide_Constants, and provide a renaming for compatibility. <<We should probably give permission to add more constants if an implementation wants to support other character encodings.>>
ISO/IEC 8859-15 defines as a letter and as a ligature. There seems to be no reason to make a distinction between the two, so for the purposes of Ada.Characters.Latin_9.Handling, we classify both as letters.
Note that we are not proposing to change the lexical rules of the language, so it's still the case that only characters from row 00 of the BMP are allowed in identifiers (row 00 of the BMP is not affected by Latin-9).
In order to support 32-bit characters, we allow an implementation to add new declarations to Standard. If it does, it must provide the appropriate predefined units for 32-bit characters, and new attributes to convert discrete values to and from 32-bit strings.
Again, we are not proposing to change the lexical rules of the language, so the character and string literals appearing in the program can only make use of the graphic symbols from the BMP. The characters from other planes cannot be represented in literals; they must be obtained by evaluating more complex expressions; for instance, by evaluating Wide_Wide_Character'Val (16#1001B#) it is possible to access the Linear B syllable NI.
!wording
An implementation is allowed to provide a library package named Ada.Characters.Latin_9. This package shall be identical to Ada.Characters.Latin_1, except for the following differences:
- It doesn't declare the constants Currency_Sign, Broken_Bar, Diaeresis, Acute, Cedilla, Fraction_One_Quarter, Fraction_One_Half, and Fraction_Three_Quarter.
- It declares the following constants:
Euro_Sign : constant Character := ''; -- Character'Val (164) UC_S_Caron : constant Character := ''; -- Character'Val (166) LC_S_Caron : constant Character := ''; -- Character'Val (168) UC_Z_Caron : constant Character := ''; -- Character'Val (180) LC_Z_Caron : constant Character := ''; -- Character'Val (184) UC_OE_Diphthong : constant Character := ''; -- Character'Val (188) LC_OE_Diphthong : constant Character := ''; -- Character'Val (189) UC_Y_Diaeresis : constant Character := ''; -- Character'Val (190)
In addition, an implementation which provides Ada.Characters.Latin_9 must provide two library packages named Ada.Characters.Latin_1.Handling and Ada.Characters.Latin_9.Handling, respectively. Ada.Characters.Latin_1.Handling must have the same contents and semantics as the package Ada.Characters.Handling defined in section A.3.2 of ISO/IEC 8652:1995 with COR.1:2000 (except of course for the library unit name). For compatibility with existing applications, the following library-level renaming must also be provided:
package Ada.Characters.Handling renames
Ada.Characters.Latin_1.Handling;
The package Ada.Characters.Latin_9.Handling has the same specification as Ada.Characters.Latin_1.Handling, but the following semantic differences:
- The function Is_Letter returns True for the characters at positions 166, 168, 180, 184, 188, 189 and 190 (in addition to those for which it is defined to return True in A.3.2(24)).
- The function Is_Lower returns True for the characters at positions 168, 184 and 189 (in addition to those for which it is defined to return True in A.3.2(25)).
- The function Is_Upper returns True for the characters at positions 166, 180, 188 and 190 (in addition to those for which it is defined to return True in A.3.2(26)).
- The function Is_Basic return True for the characters at positions 188 and 189 (in addition to those for which it is defined to return True in A.3.2 (27)).
- The upper-case form of '' is '' for the purposes of function To_Upper.
- The function Is_Character return true if the Wide_Character Item has a name in ISO/IEC 10646 which is the name of some Character in ISO/IEC 8859-15.
- The function To_Character returns the Character which has the same name in ISO/IEC 8859-15 as the Wide_Character Item in ISO/IEC 10646.
- The function To_Wide_Character returns the Wide_Character which has the same name in ISO/IEC 10646 as the Character Item in ISO/IEC 8859-15.
The declaration of Character_Set in Ada.Strings.Wide_Maps.Wide_Constants is removed. It is replaced by:
Latin_1_Character_Set : constant Wide_Maps.Wide_Character_Set; Latin_9_Character_Set : constant Wide_Maps.Wide_Character_Set; Character_Set : Wide_Maps.Wide_Character_Set renames Latin_1_Character_Set;
An implementation is allowed to add the following declarations to package Standard:
type Wide_Wide_Character is (nul, soh, ..., FFFE, FFFF, 00010000, ..., 7FFFFFFF); type Wide_Wide_String is array (Positive range <>) of Wide_Wide_Character; pragma Pack (Wide_Wide_String);
The type Wide_Wide_Character has 2 ** 31 values. Its first 2 ** 16 positions must have the same contents as type Wide_Character.
If an implementation provides these two types, it must also provide the following packages:
Ada.Strings.Wide_Wide_Bounded Ada.Strings.Wide_Wide_Fixed Ada.Strings.Wide_Wide_Maps Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants Ada.Strings.Wide_Wide_Unbounded Ada.Wide_Wide_Text_IO
These packages are similar to their Wide_ equivalents, with Wide_Wide_ substituted for Wide_ everywhere. In addition the following declaration is present in Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants:
Wide_Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set;
It contains each Wide_Wide_Character value in the BMP of ISO/IEC 10646.
The attributes Wide_Wide_Image, Wide_Wide_Value and Wide_Wide_Width must also be provided. Their definition is similar to that of Wide_Image, Wide_Value and Wide_Width, respectively, with Wide_Character and Wide_String replaced by Wide_Wide_Character and Wide_Wide_String.
The semantics of Wide_Image are modified as follows: the image has the same sequence of graphic characters as that defined for S'Wide_Wide_Image if all the graphic characters are defined in Wide_Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of Arg).
!discussion
See proposal.
!example
!ACATS test
!appendix

From: Gary Dismukes
Sent: Tuesday, January 15, 2002  4:14 PM

Ben Brosgol recently pointed out to us (ACT) the introduction of a
variant of the Latin 1 character set that is designated Latin 9.

A web page describing Latin 9 can be viewed at:

  http://www.cs.tut.fi/~jkorpela/latin9.html

Here's the summary blurb on that page describing the relatively minor
differences between Latin 1 and Latin 9:

  ISO Latin 9 as compared with ISO Latin 1

  The ISO Latin 9 (ISO 8859-15) character set differs from the well-known
  ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro
  sign and some national letters used e.g. in French and Finnish have been
  introduced and some rarely used special characters omitted.

We've added a new package to the GNAT library named Ada.Characters.Latin_9,
analogous to Ada.Characters.Latin_1, to define character constants for this
new character set.

Robert Dewar asked me to post the following remarks from him
re Latin-9 and Ada.Characters.Handling:

----------

Note that the Ada package Latin-1 did not exactly follow the official
names of all characters, and I have copied its abbreviated naming style
for the new characters in Latin-9.

I have a gripe with the RM here. The setup for Ada.Characters.Latin_1 is
to have separate packages for separate character sets, which makes perfectly
good sense:

27   An implementation may provide additional packages as children of
Ada.Characters, to declare names for the symbols of the local character set
or other character sets.

But for Characters.Handling, we have the odd statement:

49   If an implementation provides a localized definition of Character or
Wide_Character, then the effects of the subprograms in Characters.Handling
should reflect the localizations.  See also 3.5.2.

which implies that some mysterious transformation happens on this package
(under what circumstnaces?) I think this is a bad idea for two reasons:

a) it requires specialized mechanisms in the compiler, and it seems odd
for the meaning of this package to depend on some compiler switch etc.

b) it precludes handling multiple character sets in the same program,
whereas the design for Ada.Characters.Latin_1 etc seems to accomodate this.

My recommendation is that an implementation generate separate packages,
called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling
being a renaming of Ada.Characters.Handling_Latin_1 perhaps?)

Robert Dewar

*************************************************************

From: Pascal Leroy
Sent: Tuesday, January 15, 2002  5:05 PM

>   The ISO Latin 9 (ISO 8859-15) character set differs from the well-known
>   ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro
>   sign and some national letters used e.g. in French and Finnish have been
>   introduced and some rarely used special characters omitted.

Oh boy, good to see that the OE and oe ligatures are now available, and that
we now can write French without having to use Unicode!

*************************************************************

From: John Barnes
Sent: Wednesday, January 16, 2002  1:44 AM

Better put that on the agenda for the next ARG. Ada 2005
should use Latin 9 rather than Latin 1.  A minor change.
Might be a few incompatibilities.

*************************************************************

From: Pascal Leroy
Sent: Wednesday, January 16, 2002  12:53 PM

As I mentioned in a mail yesterday, the fact that you can use Latin 9 to
write French makes it look very interesting to me.

On the other hand, it is not too useful for Ada to support Latin 9 if the
OSes don't: if I emit the character OE and it print out as 1/4 on my screen,
I didn't gain much.

So while I agree that we should consider supporting Latin 9 _in_addition_ to
Latin 1 in Ada 05, I don't think Latin 9 should _replace_ Latin 1, because I
am ready to bet that we will still have Latin 1 OSes ten years from now.

*************************************************************

From: John Barnes
Sent: Thursday, January 17, 2002  1:33 AM

It was somewhat of a jokey suggestion as I am sure you are aware.

Indeed I had a big problem when writing my book and
displaying the type Character. I wrote it in QuarkXpress on
a PC and it was fine. The publishers moved it to a Mac
before printing and some characters came out wrong.  One of
them came out as a picture of an apple. Moreover, someone
had bitten a lump out of it. So much for standards I
thought.

But supporting Latin-9 would be nice. All those adverts on
the Paris Metro for eating an oeuf can then be printed
properly.

*************************************************************

From: Bob Duff
Sent: Thursday, January 17, 2002  1:14 PM

> Indeed I had a big problem when writing my book and
> displaying the type Character.

I had a great deal of trouble writing the part of the Reference Manual
where type Character lives.  I think Randy had some trouble with the
updated RM, too.  At least we didn't try to show type Wide_Character in
its full glory.  ;-)

7-bit ascii will live forever, I suppose.

*************************************************************

From: Bob Duff
Sent: Wednesday, January 16, 2002  2:15 PM

> Ben Brosgol recently pointed out to us (ACT) the introduction of a
> variant of the Latin 1 character set that is designated Latin 9.

The nice thing about standards is that there are so many to choose
from.  ;-)

> My recommendation is that an implementation generate separate packages,
> called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling
> being a renaming of Ada.Characters.Handling_Latin_1 perhaps?)

That makes sense.

But I think the RM statement you complain about is envisioning a
nonstandard version of Standard.[Wide_]Character, which is a separate
issue.  I don't see that as a big deal -- if you don't think it's a good
idea, don't implement any such thing.  I tend to agree that compiler
switches and the like shouldn't normally be meddling with the semantics
of packages Standard and Characters.Handling without a very good reason.

*************************************************************

From: Florian Weimer
Sent: Friday, January 18, 2002  6:58 AM

> But I think the RM statement you complain about is envisioning a
> nonstandard version of Standard.[Wide_]Character, which is a separate
> issue.

If you use Latin 9 for Standard.Character, this is certainly a
non-standard version, and Ada.Characters.Handling has to be modified
to remain useful.

*************************************************************

From: Florian Weimer
Sent: Friday, January 18, 2002  6:58 AM

> Better put that on the agenda for the next ARG. Ada 2005
> should use Latin 9 rather than Latin 1.  A minor change.
> Might be a few incompatibilities.

I disagree.  With Latin 9, the mapping from Character to
Wide_Character is less straightforward, and this could have unexpected
results.

OTOH, it seems that Wide_Character is not widely used (unless you are
forced to do so by ASIS), so this might not matter much.

In addition, we really should add Wide_Wide_Character (which covers
the sixteen additional planes), or make Wide_Character itself wider.
Otherwise, using Unicode with standard Ada will be rather painful.

*************************************************************

From: Florian Weimer
Sent: Saturday, April 20, 2002  3:18 AM

ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits,
and 10646-2:2001 allocates characters outside the Basic Multilingual
Plane.

Not too long ago, quite a few people assumed that characters beyond
the BMP would be interesting only for rather esoteric scholarly use
(Linear B is a perfect example).  However, we now have got at least
different sets of code positions outside the BMP which will see more
widespread use eventually: the mathematical alphabets and Plane 14
Language Tags (which are required to make some Japanese people happy
who fear that Japanese characters are rendered using Chinese glyphs).

Therefore, I think Ada 200X should somehow support characters outside
the BMP.

A few random thoughts (sorry, I'm probably not using strict ISO 10646
terminology):

  * Several major vendors have adopted ISO 10646-1:1993 early, using a
    16 bit representation for characters (i.e. wchar_t in C is 16
    bits).

These vendors include Sun (Java) and Microsoft (Windows), and probably
most proprietary UNIX vendors.  These vendor implementations now cover
the code positions beyond the BMP using UTF-16, which uses surrogate
pairs (a single character is represented using two 16 bit values from
reserved ranges in the BMP).

UTF-16 has got a few drawbacks: the ordering (in terms UCS code
positions) is no longer lexicographic (which leads us to such brain
damage as CESU-8), dealing with individual characters is complicated,
and you cannot implement the C wide character functions properly.

For Ada, numerous changes would be required if we want to expose the
UTF-16 representation to programmers, for example by declaring
Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no
longer be arrays of characters indexed by position).

GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS
characters in a single 32 bit value, that is, UTF-32), and while this
is certainly not the "industry standard" (it is encouraged by ISO
9899:1999, though), I really hope we can use this approach (UTF-32
internal representation) for Ada, as it simplifies things
considerably, especially if we want to add character properties
support (see below).

  * We could add Wide_Wide_Character and Wide_Wide_String types to
    pacakge Standard (and extending the Ada.Strings hierarchy), which
    are encoded in UTF-32.

I don't know if this is necessary.  IIRC, Robert Dewar once told that
the only applications using Wide_Character are based on ASIS, where
using Wide_Character is not really voluntarily.  Maybe it is possible
to bump Wide_Character'Size to 32 bits instead, without really
breaking backwards compatibility.

Of course, we would need a way to converted UTF-32 strings to UTF-16
strings and vice versa (the UTF-16 string type could become a
second-class citizen, though, without full support in the Ada.Strings
hierarchy).

  * External representation of UCS characters is rapidly moving
    towards UTF-8 (especially in Internet standards).

Ada should provide an interface for converting between the wide string
type(s) and UTF-8 octet sequences.  It should be possible to use
string literals where UTF-8 strings are expected.

  * Supporting higher levels of Unicode (e.g. accessing the character
    properties database, normalization forms) would be interesting,
    too.

Such documents will eventually follow in the ISO 10646 series, but I
don't know if the ISO standard will be ready for Ada 200X.  Currently,
only the Unicode Consortium has standardized or documented issues like
character properties or terminal behavior in detail.

I don't know how ISO reacts if ISO standards refer to competing
standardization efforts.  IEEE POSIX.1 (and probably, or already, ISO
POSIX) standardizes the BSD sockets interface, and not OSI, so maybe
this isn't an issue.

In any case, this point is mostly a library issue which can be
addressed by a community implementation effort, it does not require
changes in the Ada language (adding Wide_Wide_Character does, for
example).

*************************************************************

From: Pascal Leroy
Sent: Monday, April 22, 2002  8:32 AM

> ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits,
> and 10646-2:2001 allocates characters outside the Basic Multilingual
> Plane.
>
> Therefore, I think Ada 200X should somehow support characters outside
> the BMP.

The normalization of new character sets (both as part of 10646 and of 8859)
was actually discussed at the last ARG meeting, and I was given an action
item to somehow integrate them in the language, probably as some kind of
amendment AI.

> A few random thoughts (sorry, I'm probably not using strict ISO 10646
> terminology):
>
>   * Several major vendors have adopted ISO 10646-1:1993 early, using a
>     16 bit representation for characters (i.e. wchar_t in C is 16
>     bits).

Which is fine as it maps directly to Ada's wide character.  I still think
that we want to retain the capacity of using 16-bit blobs to represent
characters in the BMP, as 99.5% of practical applications will only need the
BMP.

> For Ada, numerous changes would be required if we want to expose the
> UTF-16 representation to programmers, for example by declaring
> Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no
> longer be arrays of characters indexed by position).

Changes to Wide_Character and Wide_String are pretty much out of the
question.  On the other hand, the type that is intended for interfacing with
C is Interfaces.C.wchar_array, and it would be straightforward to provide
(in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit
Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C
compiler does).

> I really hope we can use this approach (UTF-32
> internal representation) for Ada, as it simplifies things
> considerably, especially if we want to add character properties
> support (see below).

I would think that we would want to use UCS-4, since it's an ISO standard.
Moreover, UTF-32 has a number of consistency rules (eg code points below
16#10ffff#) which seem irrelevant for internal manipulation of strings.

>   * We could add Wide_Wide_Character and Wide_Wide_String types to
>     pacakge Standard (and extending the Ada.Strings hierarchy), which
>     are encoded in UTF-32.

Wide_Wide_ types seem like the natural way to add this capability to the
language, except that some compilers may not be quite prepared to deal with
enumeration types with 2 ** 32 literals (ours isn't).

> (the UTF-16 string type could become a
> second-class citizen, though, without full support in the Ada.Strings
> hierarchy).

As far as I can tell, there is no support for UTF-16, only for UCS-2.
Anyway, I don't think it is reasonable to force applications to go to the
full 32-bit overhead just because they use, say, the french OE ligature.

>   * External representation of UCS characters is rapidly moving
>     towards UTF-8 (especially in Internet standards).
>
> Ada should provide an interface for converting between the wide string
> type(s) and UTF-8 octet sequences.  It should be possible to use
> string literals where UTF-8 strings are expected.

External representation is best handled by Text_IO and friends, typically by
using a form parameter to specify the encoding (and there are many more
encodings than just UCS and UTF).  The ARG won't get into the business of
specifying the details of the form parameter, so this is something that will
remain non-portable for the foreseeable future.  (Where do we stop?  Do we
want to require all validated compilers to support UTF-8?  What about the
chinese Big5 or the JIS encodings?)

>   * Supporting higher levels of Unicode (e.g. accessing the character
>     properties database, normalization forms) would be interesting,
>     too.

We certainly don't want to get into that business.  The designers of Ada 95
wisely decided to lump all of the characters in the range 16#0100# ..
16#FFFD# into the category special_character, so that they don't have to
decide which is a letter, a number, etc.  Similarly they didn't provide
classification functions or upper/lower conversions for wide characters.
This seems reasonable if we don't want to have to amend Ada each time a
bunch of characters are added to 10646.

*************************************************************

From: Nick Roberts
Sent: Wednesday, April 24, 2002  7:31 PM

> Therefore, I think Ada 200X should somehow support characters outside
> the BMP.

I agree.

> GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS
> characters in a single 32 bit value, that is, UTF-32), and while this is
> certainly not the "industry standard" (it is encouraged by ISO 9899:1999,
> though), I really hope we can use this approach (UTF-32 internal
> representation) for Ada, as it simplifies things considerably, especially
> if we want to add character properties support (see below).

I agree very strongly!

>   * We could add Wide_Wide_Character and Wide_Wide_String types to
>     pacakge Standard (and extending the Ada.Strings hierarchy), which
>     are encoded in UTF-32.

I must say I would prefer the identifiers Universal_Character and
Universal_String. I see the logic of Wide_Wide_ but it seems clumsy!

> I don't know if this is necessary.  IIRC, Robert Dewar once told that
> the only applications using Wide_Character are based on ASIS, where
> using Wide_Character is not really voluntarily.  Maybe it is possible
> to bump Wide_Character'Size to 32 bits instead, without really
> breaking backwards compatibility.

I disagree with this idea.

> Of course, we would need a way to converted UTF-32 strings to UTF-16
> strings and vice versa (the UTF-16 string type could become a
> second-class citizen, though, without full support in the Ada.Strings
> hierarchy).

Possibly these support packages should be in an optional annex.

>   * External representation of UCS characters is rapidly moving
>     towards UTF-8 (especially in Internet standards).
>
> Ada should provide an interface for converting between the wide string
> type(s) and UTF-8 octet sequences.  It should be possible to use string
> literals where UTF-8 strings are expected.
>
>   * Supporting higher levels of Unicode (e.g. accessing the character
>     properties database, normalization forms) would be interesting,
>     too.

Again, perhaps all this should really be in (or moved into) an optional annex.

*************************************************************

From: Robert Dewar
Sent: Wednesday, April 24, 2002  9:50 PM

I suspect that the work on wide_wide_character will in practice turn
out to be nearly useless in the short or medium term. We certainly
put in a lot of work in GNAT in implementing wide character with many
different representation schemes, but this feature has been very little
used (ASIS being the main use :-). In practice I think the 16-bit character
type defined in Ada now will be adequate for almost all use, and I see no
reason in requring implementations to go beyond this in the absence of
real market demand.

Yes, it's fun to talk about character set issues (after all I was chair of
the CRG, so I appreciate this), but there is no point in increasing
implementation burdens unless it's really valuable.

I would just give clear permission for an implementation to add additional
character types in standard (indeed that permission exists today in Ada 95),
and leave it at that.

*************************************************************

From: John Barnes
Sent: Thursday, April 25, 2002  1:46 AM

The BSI is looking at character set issues across languages
and your message reminded me of the CRG. Was there ever a
final report that I could refer to?

*************************************************************

From: Robert Dewar
Sent: Thursday, April 26, 2002  10:25 PM

I think there was a final report, perhaps Jim could track it down.

*************************************************************

From: Randy Brukardt
Sent: Thursday, April 25, 2002  3:44 PM

> We certainly put in a lot of work in GNAT in implementing wide
> character with many different representation schemes, but this
> feature has been very little used (ASIS being the main use :-).

To add another data point: Claw was designed so that a wide character version
could be easily created. But we've never implemented that version, mainly
because we've never had a paying customer ask for it. So I have to wonder how
important "Really_Wide_Character" would be.

*************************************************************

From: Florian Weimer
Sent: Saturday, May 18, 2002  5:41 AM

> I suspect that the work on wide_wide_character will in practice turn
> out to be nearly useless in the short or medium term.

Using Ada for internationalized applications on GNU systems (using GNU
facilities) almost requires 32 bit Wide_Wide_Character support, since
GNU uses a 32 bit wchar_t internally.

(See a similar discussion on the GCC development list.)

*************************************************************

From: Robert Dewar
Sent: Saturday, May 18, 2002  7:32 AM

We have seen zero demand for such functionality, so would not invest any time
at all in either design or implementation work here. If such a feature is
added to Ada, I would definitely suggest it be optional.

*************************************************************

From: Florian Weimer
Sent: Saturday, May 18, 2002  6:00 AM

>>   * Several major vendors have adopted ISO 10646-1:1993 early, using a
>>     16 bit representation for characters (i.e. wchar_t in C is 16
>>     bits).
>
> Which is fine as it maps directly to Ada's wide character.  I still think
> that we want to retain the capacity of using 16-bit blobs to represent
> characters in the BMP, as 99.5% of practical applications will only need the
> BMP.

Quite a few people have already changed their minds about the 99.5%
figure (mathematical characters and Plane 14 Language being the
reason).  Maybe it's true for the character count, but I doubt it for
the application count.

> Changes to Wide_Character and Wide_String are pretty much out of the
> question.

Okay, accepted.

> On the other hand, the type that is intended for interfacing with
> C is Interfaces.C.wchar_array, and it would be straightforward to provide
> (in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit
> Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C
> compiler does).

I doubt that C compilers can use UTF-16 for wchar_t.  You cannot apply
iswlower() to a single surrogate character. :-/

> I would think that we would want to use UCS-4, since it's an ISO standard.
> Moreover, UTF-32 has a number of consistency rules (eg code points below
> 16#10ffff#) which seem irrelevant for internal manipulation of strings.

Yes, UCS-4 is indeed the correct encoding form to use.

>>   * We could add Wide_Wide_Character and Wide_Wide_String types to
>>     pacakge Standard (and extending the Ada.Strings hierarchy), which
>>     are encoded in UTF-32.
>
> Wide_Wide_ types seem like the natural way to add this capability to the
> language, except that some compilers may not be quite prepared to deal with
> enumeration types with 2 ** 32 literals (ours isn't).

Ah, this could be a problem indeed, together with the large
universal_integer returned by Wide_Wide_Character'Pos.

>> (the UTF-16 string type could become a
>> second-class citizen, though, without full support in the Ada.Strings
>> hierarchy).
>
> As far as I can tell, there is no support for UTF-16, only for UCS-2.

At the moment, yes, but I think we need some UTF-16 support, too,
because many operating system interfaces use it.

> Anyway, I don't think it is reasonable to force applications to go to the
> full 32-bit overhead just because they use, say, the french OE ligature.

Most people apparently refuse to use Wide_Character, too, for the same
reason.  They either go for ISO 8859-15 or Windows 1252, or don't use
the OE ligature at all.

> External representation is best handled by Text_IO and friends, typically by
> using a form parameter to specify the encoding (and there are many more
> encodings than just UCS and UTF).

There was a recent discussion to add other I/O facilities.  UTF-8 is
becoming more and more common in the Internet context, and often, you
can determine the encoding of a file only after reading the first
couple of lines (think of a MIME-encoded mail message).  Furthermore,
UTF-8 already plays an important role in interacting with other
libraries (not written in Ada).

> (Where do we stop?  Do we want to require all validated compilers to
> support UTF-8?

Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?

> What about the chinese Big5 or the JIS encodings?)

If there is support for UCS-4, handling these encodings could be
performed by a mechanism similar to POSIX iconv().

*************************************************************

From: Robert Dewar
Sent: Saturday, May 18, 2002  7:43 AM

> Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?

Why not = because there is no real demand. Especially this time around we need
to be very careful not to require things that no one is really interested in.
If we do this, the vendors will simply ignore any new standard. In fact I
think if there is a new standard, it will only be implemented as a result of
direct customer interest in features in this standard. The value of formal
conformance and validation has largely disappeared from the Ada marketplace
at this stage (in terms of customer demand). That's not to say that the Ada
marketplace is not very vital and dynamic, we get dozens of requests for
enhancements from our users every month, but there is precious little
intersection between the things users seem to need and want and these
kind of discussions.

In GNAT, we put a lot of effort into implementing multiple character sets
(we just added the new Latin set with the Euro symbol, because customers
needed that for example). Some of it has been useful (like this Euro
addition), but mostly these features are of entertainment and advertising
value only. In fact the only serious user that we have for Wide_Character
and Wide_String is us (from ASIS :-)

One thing to remember here is that very little is needed in the way of
language support for fancy character sets (most of the effort in GNAT
for example for 8-bit sets is in csets, which gives proper case mapping
for identifiers, and it is easy enough to add new tables to this -- someone
contributed a new Cyrillic table just a few months ago). Most of the issues
are representational issues, and the Ada standard has nothing to say about
source representation (and this should not change in any new standard).

*************************************************************

From: Pascal Leroy
Sent: Tuesday, May 21, 2002  4:03 AM

> > Which is fine as it maps directly to Ada's wide character.  I still think
> > that we want to retain the capacity of using 16-bit blobs to represent
> > characters in the BMP, as 99.5% of practical applications will only need the
> > BMP.
>
> Quite a few people have already changed their minds about the 99.5%
> figure (mathematical characters and Plane 14 Language being the
> reason).  Maybe it's true for the character count, but I doubt it for
> the application count.

Remember, we are talking Ada applications here.  There are probably many
applications out there that deal with mathematical symbols or with Tengwar, but
I doubt that they are written in Ada.

> > External representation is best handled by Text_IO and friends, typically by
> > using a form parameter to specify the encoding (and there are many more
> > encodings than just UCS and UTF).
>
> There was a recent discussion to add other I/O facilities.  UTF-8 is
> becoming more and more common in the Internet context, and often, you
> can determine the encoding of a file only after reading the first
> couple of lines (think of a MIME-encoded mail message).  Furthermore,
> UTF-8 already plays an important role in interacting with other
> libraries (not written in Ada).

Maybe we need a predefined unit to convert UCS-2 to/from UTF-8.  But then such
conversion functions could easily be written by the user, too, or provided by
some public domain stuff.

> > (Where do we stop?  Do we want to require all validated compilers to
> > support UTF-8?
>
> Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?

You don't sell many compilers if you don't support 8859-1.  As for UCS-2, well,
that's pretty much the default representation of wide characters anyway.  Other
than that, it would seem that we should let the market decide.  Speaking for
Rational, we have had wide character support for about 7 years, and I don't
recall seeing a single bug report or request for enhancement on this topic.
This may indicate that our technology is perfect, but there are other
explanation ;-) .  (As a matter of fact we probably have very few licenses
installed in countries where 8859-1 is not sufficient to write the native
language -- ignoring the problem with the OE ligature in French.)

One option would be to add Wide_Wide_Character in a new annex, and let users
decide if they want their vendors to support this annex. Of course, chances are
that nobody would care, in which case that would be a lot of standardization
effort for nothing.

*************************************************************

From: Robert Dewar
Sent: Tuesday, May 21, 2002  4:39 AM

I agree with everything Pascal had to say about wide character. We do have
one Japanese customer using wide characters, and as I mentioned earlier,
ASIS uses wide strings to represent source texts, but other than that,
we have heard very little about wide strings. The only real input we have
got from customers on character set issues was the request to support
Latin-9 with the new Euro symbol and we got contributed tables for
Cyrillic from a Russian enthusiast (not a customer, but it seemed a
harmless addition :-)

*************************************************************

From: Florian Weimer
Sent: Tuesday, May 21, 2002  1:42 PM

> I agree with everything Pascal had to say about wide character. We do have
> one Japanese customer using wide characters, and as I mentioned earlier,
> ASIS uses wide strings to represent source texts, but other than that,
> we have heard very little about wide strings.

I guess this customer doesn't use Wide_Character in the way it was
intended (for storing ISO 10646 code position), so this example is a
bit dubious.

> The only real input we have got from customers on character set
> issues was the request to support Latin-9 with the new Euro symbol

Even in this rather innocent case, Wide_Character is no longer using
UCS-2 with GNAT.

*************************************************************



Questions? Ask the ACAA Technical Agent