CVS difference for ais/ai-00285.txt

Differences between 1.1 and version 1.2
Log of other versions for file ais/ai-00285.txt

--- ais/ai-00285.txt	2002/01/24 04:54:13	1.1
+++ ais/ai-00285.txt	2002/04/26 20:15:17	1.2
@@ -214,3 +214,295 @@
 
 *************************************************************
 
+From: Florian Weimer
+Sent: Saturday, April 20, 2002  3:18 AM
+
+ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits,
+and 10646-2:2001 allocates characters outside the Basic Multilingual
+Plane.
+
+Not too long ago, quite a few people assumed that characters beyond
+the BMP would be interesting only for rather esoteric scholarly use
+(Linear B is a perfect example).  However, we now have got at least
+different sets of code positions outside the BMP which will see more
+widespread use eventually: the mathematical alphabets and Plane 14
+Language Tags (which are required to make some Japanese people happy
+who fear that Japanese characters are rendered using Chinese glyphs).
+
+Therefore, I think Ada 200X should somehow support characters outside
+the BMP.
+
+A few random thoughts (sorry, I'm probably not using strict ISO 10646
+terminology):
+
+  * Several major vendors have adopted ISO 10646-1:1993 early, using a
+    16 bit representation for characters (i.e. wchar_t in C is 16
+    bits).
+
+These vendors include Sun (Java) and Microsoft (Windows), and probably
+most proprietary UNIX vendors.  These vendor implementations now cover
+the code positions beyond the BMP using UTF-16, which uses surrogate
+pairs (a single character is represented using two 16 bit values from
+reserved ranges in the BMP).
+
+UTF-16 has got a few drawbacks: the ordering (in terms UCS code
+positions) is no longer lexicographic (which leads us to such brain
+damage as CESU-8), dealing with individual characters is complicated,
+and you cannot implement the C wide character functions properly.
+
+For Ada, numerous changes would be required if we want to expose the
+UTF-16 representation to programmers, for example by declaring
+Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no
+longer be arrays of characters indexed by position).
+
+GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS
+characters in a single 32 bit value, that is, UTF-32), and while this
+is certainly not the "industry standard" (it is encouraged by ISO
+9899:1999, though), I really hope we can use this approach (UTF-32
+internal representation) for Ada, as it simplifies things
+considerably, especially if we want to add character properties
+support (see below).
+
+  * We could add Wide_Wide_Character and Wide_Wide_String types to
+    pacakge Standard (and extending the Ada.Strings hierarchy), which
+    are encoded in UTF-32.
+
+I don't know if this is necessary.  IIRC, Robert Dewar once told that
+the only applications using Wide_Character are based on ASIS, where
+using Wide_Character is not really voluntarily.  Maybe it is possible
+to bump Wide_Character'Size to 32 bits instead, without really
+breaking backwards compatibility.
+
+Of course, we would need a way to converted UTF-32 strings to UTF-16
+strings and vice versa (the UTF-16 string type could become a
+second-class citizen, though, without full support in the Ada.Strings
+hierarchy).
+
+  * External representation of UCS characters is rapidly moving
+    towards UTF-8 (especially in Internet standards).
+
+Ada should provide an interface for converting between the wide string
+type(s) and UTF-8 octet sequences.  It should be possible to use
+string literals where UTF-8 strings are expected.
+
+  * Supporting higher levels of Unicode (e.g. accessing the character
+    properties database, normalization forms) would be interesting,
+    too.
+
+Such documents will eventually follow in the ISO 10646 series, but I
+don't know if the ISO standard will be ready for Ada 200X.  Currently,
+only the Unicode Consortium has standardized or documented issues like
+character properties or terminal behavior in detail.
+
+I don't know how ISO reacts if ISO standards refer to competing
+standardization efforts.  IEEE POSIX.1 (and probably, or already, ISO
+POSIX) standardizes the BSD sockets interface, and not OSI, so maybe
+this isn't an issue.
+
+In any case, this point is mostly a library issue which can be
+addressed by a community implementation effort, it does not require
+changes in the Ada language (adding Wide_Wide_Character does, for
+example).
+
+*************************************************************
+
+From: Pascal Leroy
+Sent: Monday, April 22, 2002  8:32 AM
+
+> ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits,
+> and 10646-2:2001 allocates characters outside the Basic Multilingual
+> Plane.
+>
+> Therefore, I think Ada 200X should somehow support characters outside
+> the BMP.
+
+The normalization of new character sets (both as part of 10646 and of 8859)
+was actually discussed at the last ARG meeting, and I was given an action
+item to somehow integrate them in the language, probably as some kind of
+amendment AI.
+
+> A few random thoughts (sorry, I'm probably not using strict ISO 10646
+> terminology):
+>
+>   * Several major vendors have adopted ISO 10646-1:1993 early, using a
+>     16 bit representation for characters (i.e. wchar_t in C is 16
+>     bits).
+
+Which is fine as it maps directly to Ada's wide character.  I still think
+that we want to retain the capacity of using 16-bit blobs to represent
+characters in the BMP, as 99.5% of practical applications will only need the
+BMP.
+
+> For Ada, numerous changes would be required if we want to expose the
+> UTF-16 representation to programmers, for example by declaring
+> Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no
+> longer be arrays of characters indexed by position).
+
+Changes to Wide_Character and Wide_String are pretty much out of the
+question.  On the other hand, the type that is intended for interfacing with
+C is Interfaces.C.wchar_array, and it would be straightforward to provide
+(in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit
+Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C
+compiler does).
+
+> I really hope we can use this approach (UTF-32
+> internal representation) for Ada, as it simplifies things
+> considerably, especially if we want to add character properties
+> support (see below).
+
+I would think that we would want to use UCS-4, since it's an ISO standard.
+Moreover, UTF-32 has a number of consistency rules (eg code points below
+16#10ffff#) which seem irrelevant for internal manipulation of strings.
+
+>   * We could add Wide_Wide_Character and Wide_Wide_String types to
+>     pacakge Standard (and extending the Ada.Strings hierarchy), which
+>     are encoded in UTF-32.
+
+Wide_Wide_ types seem like the natural way to add this capability to the
+language, except that some compilers may not be quite prepared to deal with
+enumeration types with 2 ** 32 literals (ours isn't).
+
+> (the UTF-16 string type could become a
+> second-class citizen, though, without full support in the Ada.Strings
+> hierarchy).
+
+As far as I can tell, there is no support for UTF-16, only for UCS-2.
+Anyway, I don't think it is reasonable to force applications to go to the
+full 32-bit overhead just because they use, say, the french OE ligature.
+
+>   * External representation of UCS characters is rapidly moving
+>     towards UTF-8 (especially in Internet standards).
+>
+> Ada should provide an interface for converting between the wide string
+> type(s) and UTF-8 octet sequences.  It should be possible to use
+> string literals where UTF-8 strings are expected.
+
+External representation is best handled by Text_IO and friends, typically by
+using a form parameter to specify the encoding (and there are many more
+encodings than just UCS and UTF).  The ARG won't get into the business of
+specifying the details of the form parameter, so this is something that will
+remain non-portable for the foreseeable future.  (Where do we stop?  Do we
+want to require all validated compilers to support UTF-8?  What about the
+chinese Big5 or the JIS encodings?)
+
+>   * Supporting higher levels of Unicode (e.g. accessing the character
+>     properties database, normalization forms) would be interesting,
+>     too.
+
+We certainly don't want to get into that business.  The designers of Ada 95
+wisely decided to lump all of the characters in the range 16#0100# ..
+16#FFFD# into the category special_character, so that they don't have to
+decide which is a letter, a number, etc.  Similarly they didn't provide
+classification functions or upper/lower conversions for wide characters.
+This seems reasonable if we don't want to have to amend Ada each time a
+bunch of characters are added to 10646.
+
+*************************************************************
+
+From: Nick Roberts
+Sent: Wednesday, April 24, 2002  7:31 PM
+
+> Therefore, I think Ada 200X should somehow support characters outside
+> the BMP.
+
+I agree.
+
+> GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS
+> characters in a single 32 bit value, that is, UTF-32), and while this is
+> certainly not the "industry standard" (it is encouraged by ISO 9899:1999,
+> though), I really hope we can use this approach (UTF-32 internal
+> representation) for Ada, as it simplifies things considerably, especially
+> if we want to add character properties support (see below).
+
+I agree very strongly!
+
+>   * We could add Wide_Wide_Character and Wide_Wide_String types to
+>     pacakge Standard (and extending the Ada.Strings hierarchy), which
+>     are encoded in UTF-32.
+
+I must say I would prefer the identifiers Universal_Character and
+Universal_String. I see the logic of Wide_Wide_ but it seems clumsy!
+
+> I don't know if this is necessary.  IIRC, Robert Dewar once told that
+> the only applications using Wide_Character are based on ASIS, where
+> using Wide_Character is not really voluntarily.  Maybe it is possible
+> to bump Wide_Character'Size to 32 bits instead, without really
+> breaking backwards compatibility.
+
+I disagree with this idea.
+
+> Of course, we would need a way to converted UTF-32 strings to UTF-16
+> strings and vice versa (the UTF-16 string type could become a
+> second-class citizen, though, without full support in the Ada.Strings
+> hierarchy).
+
+Possibly these support packages should be in an optional annex.
+
+>   * External representation of UCS characters is rapidly moving
+>     towards UTF-8 (especially in Internet standards).
+>
+> Ada should provide an interface for converting between the wide string
+> type(s) and UTF-8 octet sequences.  It should be possible to use string
+> literals where UTF-8 strings are expected.
+>
+>   * Supporting higher levels of Unicode (e.g. accessing the character
+>     properties database, normalization forms) would be interesting,
+>     too.
+
+Again, perhaps all this should really be in (or moved into) an optional annex.
+
+*************************************************************
+
+From: Robert Dewar
+Sent: Wednesday, April 24, 2002  9:50 PM
+
+I suspect that the work on wide_wide_character will in practice turn
+out to be nearly useless in the short or medium term. We certainly
+put in a lot of work in GNAT in implementing wide character with many
+different representation schemes, but this feature has been very little
+used (ASIS being the main use :-). In practice I think the 16-bit character
+type defined in Ada now will be adequate for almost all use, and I see no
+reason in requring implementations to go beyond this in the absence of
+real market demand.
+
+Yes, it's fun to talk about character set issues (after all I was chair of
+the CRG, so I appreciate this), but there is no point in increasing
+implementation burdens unless it's really valuable.
+
+I would just give clear permission for an implementation to add additional
+character types in standard (indeed that permission exists today in Ada 95),
+and leave it at that.
+
+*************************************************************
+
+From: John Barnes
+Sent: Thursday, April 25, 2002  1:46 AM
+
+The BSI is looking at character set issues across languages
+and your message reminded me of the CRG. Was there ever a
+final report that I could refer to?
+
+*************************************************************
+
+From: Robert Dewar
+Sent: Thursday, April 26, 2002  10:25 PM
+
+I think there was a final report, perhaps Jim could track it down.
+
+*************************************************************
+
+From: Randy Brukardt
+Sent: Thursday, April 25, 2002  3:44 PM
+
+> We certainly put in a lot of work in GNAT in implementing wide
+> character with many different representation schemes, but this
+> feature has been very little used (ASIS being the main use :-).
+
+To add another data point: Claw was designed so that a wide character version
+could be easily created. But we've never implemented that version, mainly
+because we've never had a paying customer ask for it. So I have to wonder how
+important "Really_Wide_Character" would be.
+
+*************************************************************
+
+

Questions? Ask the ACAA Technical Agent