CVS difference for ais/ai-00285.txt

Differences between 1.2 and version 1.3
Log of other versions for file ais/ai-00285.txt

--- ais/ai-00285.txt	2002/04/26 20:15:17	1.2
+++ ais/ai-00285.txt	2002/05/25 03:42:19	1.3
@@ -505,4 +505,236 @@
 
 *************************************************************
 
+From: Florian Weimer
+Sent: Saturday, May 18, 2002  5:41 AM
+
+> I suspect that the work on wide_wide_character will in practice turn
+> out to be nearly useless in the short or medium term.
+
+Using Ada for internationalized applications on GNU systems (using GNU
+facilities) almost requires 32 bit Wide_Wide_Character support, since
+GNU uses a 32 bit wchar_t internally.
+
+(See a similar discussion on the GCC development list.)
+
+*************************************************************
+
+From: Robert Dewar
+Sent: Saturday, May 18, 2002  7:32 AM
+
+We have seen zero demand for such functionality, so would not invest any time
+at all in either design or implementation work here. If such a feature is
+added to Ada, I would definitely suggest it be optional.
+
+*************************************************************
+
+From: Florian Weimer
+Sent: Saturday, May 18, 2002  6:00 AM
+
+>>   * Several major vendors have adopted ISO 10646-1:1993 early, using a
+>>     16 bit representation for characters (i.e. wchar_t in C is 16
+>>     bits).
+>
+> Which is fine as it maps directly to Ada's wide character.  I still think
+> that we want to retain the capacity of using 16-bit blobs to represent
+> characters in the BMP, as 99.5% of practical applications will only need the
+> BMP.
+
+Quite a few people have already changed their minds about the 99.5%
+figure (mathematical characters and Plane 14 Language being the
+reason).  Maybe it's true for the character count, but I doubt it for
+the application count.
+
+> Changes to Wide_Character and Wide_String are pretty much out of the
+> question.
+
+Okay, accepted.
+
+> On the other hand, the type that is intended for interfacing with
+> C is Interfaces.C.wchar_array, and it would be straightforward to provide
+> (in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit
+> Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C
+> compiler does).
+
+I doubt that C compilers can use UTF-16 for wchar_t.  You cannot apply
+iswlower() to a single surrogate character. :-/
+
+> I would think that we would want to use UCS-4, since it's an ISO standard.
+> Moreover, UTF-32 has a number of consistency rules (eg code points below
+> 16#10ffff#) which seem irrelevant for internal manipulation of strings.
+
+Yes, UCS-4 is indeed the correct encoding form to use.
+
+>>   * We could add Wide_Wide_Character and Wide_Wide_String types to
+>>     pacakge Standard (and extending the Ada.Strings hierarchy), which
+>>     are encoded in UTF-32.
+>
+> Wide_Wide_ types seem like the natural way to add this capability to the
+> language, except that some compilers may not be quite prepared to deal with
+> enumeration types with 2 ** 32 literals (ours isn't).
+
+Ah, this could be a problem indeed, together with the large
+universal_integer returned by Wide_Wide_Character'Pos.
+
+>> (the UTF-16 string type could become a
+>> second-class citizen, though, without full support in the Ada.Strings
+>> hierarchy).
+>
+> As far as I can tell, there is no support for UTF-16, only for UCS-2.
+
+At the moment, yes, but I think we need some UTF-16 support, too,
+because many operating system interfaces use it.
+
+> Anyway, I don't think it is reasonable to force applications to go to the
+> full 32-bit overhead just because they use, say, the french OE ligature.
+
+Most people apparently refuse to use Wide_Character, too, for the same
+reason.  They either go for ISO 8859-15 or Windows 1252, or don't use
+the OE ligature at all.
+
+> External representation is best handled by Text_IO and friends, typically by
+> using a form parameter to specify the encoding (and there are many more
+> encodings than just UCS and UTF).
+
+There was a recent discussion to add other I/O facilities.  UTF-8 is
+becoming more and more common in the Internet context, and often, you
+can determine the encoding of a file only after reading the first
+couple of lines (think of a MIME-encoded mail message).  Furthermore,
+UTF-8 already plays an important role in interacting with other
+libraries (not written in Ada).
+
+> (Where do we stop?  Do we want to require all validated compilers to
+> support UTF-8?
+
+Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?
+
+> What about the chinese Big5 or the JIS encodings?)
+
+If there is support for UCS-4, handling these encodings could be
+performed by a mechanism similar to POSIX iconv().
+
+*************************************************************
+
+From: Robert Dewar
+Sent: Saturday, May 18, 2002  7:43 AM
+
+> Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?
+
+Why not = because there is no real demand. Especially this time around we need
+to be very careful not to require things that no one is really interested in.
+If we do this, the vendors will simply ignore any new standard. In fact I
+think if there is a new standard, it will only be implemented as a result of
+direct customer interest in features in this standard. The value of formal
+conformance and validation has largely disappeared from the Ada marketplace
+at this stage (in terms of customer demand). That's not to say that the Ada
+marketplace is not very vital and dynamic, we get dozens of requests for
+enhancements from our users every month, but there is precious little
+intersection between the things users seem to need and want and these
+kind of discussions.
+
+In GNAT, we put a lot of effort into implementing multiple character sets
+(we just added the new Latin set with the Euro symbol, because customers
+needed that for example). Some of it has been useful (like this Euro
+addition), but mostly these features are of entertainment and advertising
+value only. In fact the only serious user that we have for Wide_Character
+and Wide_String is us (from ASIS :-)
+
+One thing to remember here is that very little is needed in the way of
+language support for fancy character sets (most of the effort in GNAT
+for example for 8-bit sets is in csets, which gives proper case mapping
+for identifiers, and it is easy enough to add new tables to this -- someone
+contributed a new Cyrillic table just a few months ago). Most of the issues
+are representational issues, and the Ada standard has nothing to say about
+source representation (and this should not change in any new standard).
+
+*************************************************************
+
+From: Pascal Leroy
+Sent: Tuesday, May 21, 2002  4:03 AM
+
+> > Which is fine as it maps directly to Ada's wide character.  I still think
+> > that we want to retain the capacity of using 16-bit blobs to represent
+> > characters in the BMP, as 99.5% of practical applications will only need the
+> > BMP.
+>
+> Quite a few people have already changed their minds about the 99.5%
+> figure (mathematical characters and Plane 14 Language being the
+> reason).  Maybe it's true for the character count, but I doubt it for
+> the application count.
+
+Remember, we are talking Ada applications here.  There are probably many
+applications out there that deal with mathematical symbols or with Tengwar, but
+I doubt that they are written in Ada.
+
+> > External representation is best handled by Text_IO and friends, typically by
+> > using a form parameter to specify the encoding (and there are many more
+> > encodings than just UCS and UTF).
+>
+> There was a recent discussion to add other I/O facilities.  UTF-8 is
+> becoming more and more common in the Internet context, and often, you
+> can determine the encoding of a file only after reading the first
+> couple of lines (think of a MIME-encoded mail message).  Furthermore,
+> UTF-8 already plays an important role in interacting with other
+> libraries (not written in Ada).
+
+Maybe we need a predefined unit to convert UCS-2 to/from UTF-8.  But then such
+conversion functions could easily be written by the user, too, or provided by
+some public domain stuff.
+
+> > (Where do we stop?  Do we want to require all validated compilers to
+> > support UTF-8?
+>
+> Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?
+
+You don't sell many compilers if you don't support 8859-1.  As for UCS-2, well,
+that's pretty much the default representation of wide characters anyway.  Other
+than that, it would seem that we should let the market decide.  Speaking for
+Rational, we have had wide character support for about 7 years, and I don't
+recall seeing a single bug report or request for enhancement on this topic.
+This may indicate that our technology is perfect, but there are other
+explanation ;-) .  (As a matter of fact we probably have very few licenses
+installed in countries where 8859-1 is not sufficient to write the native
+language -- ignoring the problem with the OE ligature in French.)
+
+One option would be to add Wide_Wide_Character in a new annex, and let users
+decide if they want their vendors to support this annex. Of course, chances are
+that nobody would care, in which case that would be a lot of standardization
+effort for nothing.
+
+*************************************************************
+
+From: Robert Dewar
+Sent: Tuesday, May 21, 2002  4:39 AM
+
+I agree with everything Pascal had to say about wide character. We do have
+one Japanese customer using wide characters, and as I mentioned earlier,
+ASIS uses wide strings to represent source texts, but other than that,
+we have heard very little about wide strings. The only real input we have
+got from customers on character set issues was the request to support
+Latin-9 with the new Euro symbol and we got contributed tables for
+Cyrillic from a Russian enthusiast (not a customer, but it seemed a
+harmless addition :-)
+
+*************************************************************
+
+From: Florian Weimer
+Sent: Tuesday, May 21, 2002  1:42 PM
+
+> I agree with everything Pascal had to say about wide character. We do have
+> one Japanese customer using wide characters, and as I mentioned earlier,
+> ASIS uses wide strings to represent source texts, but other than that,
+> we have heard very little about wide strings.
+
+I guess this customer doesn't use Wide_Character in the way it was
+intended (for storing ISO 10646 code position), so this example is a
+bit dubious.
+
+> The only real input we have got from customers on character set
+> issues was the request to support Latin-9 with the new Euro symbol
+
+Even in this rather innocent case, Wide_Character is no longer using
+UCS-2 with GNAT.
+
+*************************************************************
+
 

Questions? Ask the ACAA Technical Agent