!standard A.3.2(49) 02-01-23 AI95-00285/00 !class amendment 02-01-23 !status received 02-01-15 !priority Medium !difficulty Hard !subject Latin-9 and Ada.Characters.Handling !summary !problem Latin-9 has been introduced. !proposal !discussion !example !ACATS test !appendix From: Gary Dismukes Sent: Tuesday, January 15, 2002 4:14 PM Ben Brosgol recently pointed out to us (ACT) the introduction of a variant of the Latin 1 character set that is designated Latin 9. A web page describing Latin 9 can be viewed at: http://www.cs.tut.fi/~jkorpela/latin9.html Here's the summary blurb on that page describing the relatively minor differences between Latin 1 and Latin 9: ISO Latin 9 as compared with ISO Latin 1 The ISO Latin 9 (ISO 8859-15) character set differs from the well-known ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro sign and some national letters used e.g. in French and Finnish have been introduced and some rarely used special characters omitted. We've added a new package to the GNAT library named Ada.Characters.Latin_9, analogous to Ada.Characters.Latin_1, to define character constants for this new character set. Robert Dewar asked me to post the following remarks from him re Latin-9 and Ada.Characters.Handling: ---------- Note that the Ada package Latin-1 did not exactly follow the official names of all characters, and I have copied its abbreviated naming style for the new characters in Latin-9. I have a gripe with the RM here. The setup for Ada.Characters.Latin_1 is to have separate packages for separate character sets, which makes perfectly good sense: 27 An implementation may provide additional packages as children of Ada.Characters, to declare names for the symbols of the local character set or other character sets. But for Characters.Handling, we have the odd statement: 49 If an implementation provides a localized definition of Character or Wide_Character, then the effects of the subprograms in Characters.Handling should reflect the localizations. See also 3.5.2. which implies that some mysterious transformation happens on this package (under what circumstnaces?) I think this is a bad idea for two reasons: a) it requires specialized mechanisms in the compiler, and it seems odd for the meaning of this package to depend on some compiler switch etc. b) it precludes handling multiple character sets in the same program, whereas the design for Ada.Characters.Latin_1 etc seems to accomodate this. My recommendation is that an implementation generate separate packages, called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling being a renaming of Ada.Characters.Handling_Latin_1 perhaps?) Robert Dewar ************************************************************* From: Pascal Leroy Sent: Tuesday, January 15, 2002 5:05 PM > The ISO Latin 9 (ISO 8859-15) character set differs from the well-known > ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro > sign and some national letters used e.g. in French and Finnish have been > introduced and some rarely used special characters omitted. Oh boy, good to see that the OE and oe ligatures are now available, and that we now can write French without having to use Unicode! ************************************************************* From: John Barnes Sent: Wednesday, January 16, 2002 1:44 AM Better put that on the agenda for the next ARG. Ada 2005 should use Latin 9 rather than Latin 1. A minor change. Might be a few incompatibilities. ************************************************************* From: Pascal Leroy Sent: Wednesday, January 16, 2002 12:53 PM As I mentioned in a mail yesterday, the fact that you can use Latin 9 to write French makes it look very interesting to me. On the other hand, it is not too useful for Ada to support Latin 9 if the OSes don't: if I emit the character OE and it print out as 1/4 on my screen, I didn't gain much. So while I agree that we should consider supporting Latin 9 _in_addition_ to Latin 1 in Ada 05, I don't think Latin 9 should _replace_ Latin 1, because I am ready to bet that we will still have Latin 1 OSes ten years from now. ************************************************************* From: John Barnes Sent: Thursday, January 17, 2002 1:33 AM It was somewhat of a jokey suggestion as I am sure you are aware. Indeed I had a big problem when writing my book and displaying the type Character. I wrote it in QuarkXpress on a PC and it was fine. The publishers moved it to a Mac before printing and some characters came out wrong. One of them came out as a picture of an apple. Moreover, someone had bitten a lump out of it. So much for standards I thought. But supporting Latin-9 would be nice. All those adverts on the Paris Metro for eating an oeuf can then be printed properly. ************************************************************* From: Bob Duff Sent: Thursday, January 17, 2002 1:14 PM > Indeed I had a big problem when writing my book and > displaying the type Character. I had a great deal of trouble writing the part of the Reference Manual where type Character lives. I think Randy had some trouble with the updated RM, too. At least we didn't try to show type Wide_Character in its full glory. ;-) 7-bit ascii will live forever, I suppose. ************************************************************* From: Bob Duff Sent: Wednesday, January 16, 2002 2:15 PM > Ben Brosgol recently pointed out to us (ACT) the introduction of a > variant of the Latin 1 character set that is designated Latin 9. The nice thing about standards is that there are so many to choose from. ;-) > My recommendation is that an implementation generate separate packages, > called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling > being a renaming of Ada.Characters.Handling_Latin_1 perhaps?) That makes sense. But I think the RM statement you complain about is envisioning a nonstandard version of Standard.[Wide_]Character, which is a separate issue. I don't see that as a big deal -- if you don't think it's a good idea, don't implement any such thing. I tend to agree that compiler switches and the like shouldn't normally be meddling with the semantics of packages Standard and Characters.Handling without a very good reason. ************************************************************* From: Florian Weimer Sent: Friday, January 18, 2002 6:58 AM > But I think the RM statement you complain about is envisioning a > nonstandard version of Standard.[Wide_]Character, which is a separate > issue. If you use Latin 9 for Standard.Character, this is certainly a non-standard version, and Ada.Characters.Handling has to be modified to remain useful. ************************************************************* From: Florian Weimer Sent: Friday, January 18, 2002 6:58 AM > Better put that on the agenda for the next ARG. Ada 2005 > should use Latin 9 rather than Latin 1. A minor change. > Might be a few incompatibilities. I disagree. With Latin 9, the mapping from Character to Wide_Character is less straightforward, and this could have unexpected results. OTOH, it seems that Wide_Character is not widely used (unless you are forced to do so by ASIS), so this might not matter much. In addition, we really should add Wide_Wide_Character (which covers the sixteen additional planes), or make Wide_Character itself wider. Otherwise, using Unicode with standard Ada will be rather painful. ************************************************************* From: Florian Weimer Sent: Saturday, April 20, 2002 3:18 AM ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits, and 10646-2:2001 allocates characters outside the Basic Multilingual Plane. Not too long ago, quite a few people assumed that characters beyond the BMP would be interesting only for rather esoteric scholarly use (Linear B is a perfect example). However, we now have got at least different sets of code positions outside the BMP which will see more widespread use eventually: the mathematical alphabets and Plane 14 Language Tags (which are required to make some Japanese people happy who fear that Japanese characters are rendered using Chinese glyphs). Therefore, I think Ada 200X should somehow support characters outside the BMP. A few random thoughts (sorry, I'm probably not using strict ISO 10646 terminology): * Several major vendors have adopted ISO 10646-1:1993 early, using a 16 bit representation for characters (i.e. wchar_t in C is 16 bits). These vendors include Sun (Java) and Microsoft (Windows), and probably most proprietary UNIX vendors. These vendor implementations now cover the code positions beyond the BMP using UTF-16, which uses surrogate pairs (a single character is represented using two 16 bit values from reserved ranges in the BMP). UTF-16 has got a few drawbacks: the ordering (in terms UCS code positions) is no longer lexicographic (which leads us to such brain damage as CESU-8), dealing with individual characters is complicated, and you cannot implement the C wide character functions properly. For Ada, numerous changes would be required if we want to expose the UTF-16 representation to programmers, for example by declaring Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no longer be arrays of characters indexed by position). GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS characters in a single 32 bit value, that is, UTF-32), and while this is certainly not the "industry standard" (it is encouraged by ISO 9899:1999, though), I really hope we can use this approach (UTF-32 internal representation) for Ada, as it simplifies things considerably, especially if we want to add character properties support (see below). * We could add Wide_Wide_Character and Wide_Wide_String types to pacakge Standard (and extending the Ada.Strings hierarchy), which are encoded in UTF-32. I don't know if this is necessary. IIRC, Robert Dewar once told that the only applications using Wide_Character are based on ASIS, where using Wide_Character is not really voluntarily. Maybe it is possible to bump Wide_Character'Size to 32 bits instead, without really breaking backwards compatibility. Of course, we would need a way to converted UTF-32 strings to UTF-16 strings and vice versa (the UTF-16 string type could become a second-class citizen, though, without full support in the Ada.Strings hierarchy). * External representation of UCS characters is rapidly moving towards UTF-8 (especially in Internet standards). Ada should provide an interface for converting between the wide string type(s) and UTF-8 octet sequences. It should be possible to use string literals where UTF-8 strings are expected. * Supporting higher levels of Unicode (e.g. accessing the character properties database, normalization forms) would be interesting, too. Such documents will eventually follow in the ISO 10646 series, but I don't know if the ISO standard will be ready for Ada 200X. Currently, only the Unicode Consortium has standardized or documented issues like character properties or terminal behavior in detail. I don't know how ISO reacts if ISO standards refer to competing standardization efforts. IEEE POSIX.1 (and probably, or already, ISO POSIX) standardizes the BSD sockets interface, and not OSI, so maybe this isn't an issue. In any case, this point is mostly a library issue which can be addressed by a community implementation effort, it does not require changes in the Ada language (adding Wide_Wide_Character does, for example). ************************************************************* From: Pascal Leroy Sent: Monday, April 22, 2002 8:32 AM > ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits, > and 10646-2:2001 allocates characters outside the Basic Multilingual > Plane. > > Therefore, I think Ada 200X should somehow support characters outside > the BMP. The normalization of new character sets (both as part of 10646 and of 8859) was actually discussed at the last ARG meeting, and I was given an action item to somehow integrate them in the language, probably as some kind of amendment AI. > A few random thoughts (sorry, I'm probably not using strict ISO 10646 > terminology): > > * Several major vendors have adopted ISO 10646-1:1993 early, using a > 16 bit representation for characters (i.e. wchar_t in C is 16 > bits). Which is fine as it maps directly to Ada's wide character. I still think that we want to retain the capacity of using 16-bit blobs to represent characters in the BMP, as 99.5% of practical applications will only need the BMP. > For Ada, numerous changes would be required if we want to expose the > UTF-16 representation to programmers, for example by declaring > Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no > longer be arrays of characters indexed by position). Changes to Wide_Character and Wide_String are pretty much out of the question. On the other hand, the type that is intended for interfacing with C is Interfaces.C.wchar_array, and it would be straightforward to provide (in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C compiler does). > I really hope we can use this approach (UTF-32 > internal representation) for Ada, as it simplifies things > considerably, especially if we want to add character properties > support (see below). I would think that we would want to use UCS-4, since it's an ISO standard. Moreover, UTF-32 has a number of consistency rules (eg code points below 16#10ffff#) which seem irrelevant for internal manipulation of strings. > * We could add Wide_Wide_Character and Wide_Wide_String types to > pacakge Standard (and extending the Ada.Strings hierarchy), which > are encoded in UTF-32. Wide_Wide_ types seem like the natural way to add this capability to the language, except that some compilers may not be quite prepared to deal with enumeration types with 2 ** 32 literals (ours isn't). > (the UTF-16 string type could become a > second-class citizen, though, without full support in the Ada.Strings > hierarchy). As far as I can tell, there is no support for UTF-16, only for UCS-2. Anyway, I don't think it is reasonable to force applications to go to the full 32-bit overhead just because they use, say, the french OE ligature. > * External representation of UCS characters is rapidly moving > towards UTF-8 (especially in Internet standards). > > Ada should provide an interface for converting between the wide string > type(s) and UTF-8 octet sequences. It should be possible to use > string literals where UTF-8 strings are expected. External representation is best handled by Text_IO and friends, typically by using a form parameter to specify the encoding (and there are many more encodings than just UCS and UTF). The ARG won't get into the business of specifying the details of the form parameter, so this is something that will remain non-portable for the foreseeable future. (Where do we stop? Do we want to require all validated compilers to support UTF-8? What about the chinese Big5 or the JIS encodings?) > * Supporting higher levels of Unicode (e.g. accessing the character > properties database, normalization forms) would be interesting, > too. We certainly don't want to get into that business. The designers of Ada 95 wisely decided to lump all of the characters in the range 16#0100# .. 16#FFFD# into the category special_character, so that they don't have to decide which is a letter, a number, etc. Similarly they didn't provide classification functions or upper/lower conversions for wide characters. This seems reasonable if we don't want to have to amend Ada each time a bunch of characters are added to 10646. ************************************************************* From: Nick Roberts Sent: Wednesday, April 24, 2002 7:31 PM > Therefore, I think Ada 200X should somehow support characters outside > the BMP. I agree. > GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS > characters in a single 32 bit value, that is, UTF-32), and while this is > certainly not the "industry standard" (it is encouraged by ISO 9899:1999, > though), I really hope we can use this approach (UTF-32 internal > representation) for Ada, as it simplifies things considerably, especially > if we want to add character properties support (see below). I agree very strongly! > * We could add Wide_Wide_Character and Wide_Wide_String types to > pacakge Standard (and extending the Ada.Strings hierarchy), which > are encoded in UTF-32. I must say I would prefer the identifiers Universal_Character and Universal_String. I see the logic of Wide_Wide_ but it seems clumsy! > I don't know if this is necessary. IIRC, Robert Dewar once told that > the only applications using Wide_Character are based on ASIS, where > using Wide_Character is not really voluntarily. Maybe it is possible > to bump Wide_Character'Size to 32 bits instead, without really > breaking backwards compatibility. I disagree with this idea. > Of course, we would need a way to converted UTF-32 strings to UTF-16 > strings and vice versa (the UTF-16 string type could become a > second-class citizen, though, without full support in the Ada.Strings > hierarchy). Possibly these support packages should be in an optional annex. > * External representation of UCS characters is rapidly moving > towards UTF-8 (especially in Internet standards). > > Ada should provide an interface for converting between the wide string > type(s) and UTF-8 octet sequences. It should be possible to use string > literals where UTF-8 strings are expected. > > * Supporting higher levels of Unicode (e.g. accessing the character > properties database, normalization forms) would be interesting, > too. Again, perhaps all this should really be in (or moved into) an optional annex. ************************************************************* From: Robert Dewar Sent: Wednesday, April 24, 2002 9:50 PM I suspect that the work on wide_wide_character will in practice turn out to be nearly useless in the short or medium term. We certainly put in a lot of work in GNAT in implementing wide character with many different representation schemes, but this feature has been very little used (ASIS being the main use :-). In practice I think the 16-bit character type defined in Ada now will be adequate for almost all use, and I see no reason in requring implementations to go beyond this in the absence of real market demand. Yes, it's fun to talk about character set issues (after all I was chair of the CRG, so I appreciate this), but there is no point in increasing implementation burdens unless it's really valuable. I would just give clear permission for an implementation to add additional character types in standard (indeed that permission exists today in Ada 95), and leave it at that. ************************************************************* From: John Barnes Sent: Thursday, April 25, 2002 1:46 AM The BSI is looking at character set issues across languages and your message reminded me of the CRG. Was there ever a final report that I could refer to? ************************************************************* From: Robert Dewar Sent: Thursday, April 26, 2002 10:25 PM I think there was a final report, perhaps Jim could track it down. ************************************************************* From: Randy Brukardt Sent: Thursday, April 25, 2002 3:44 PM > We certainly put in a lot of work in GNAT in implementing wide > character with many different representation schemes, but this > feature has been very little used (ASIS being the main use :-). To add another data point: Claw was designed so that a wide character version could be easily created. But we've never implemented that version, mainly because we've never had a paying customer ask for it. So I have to wonder how important "Really_Wide_Character" would be. ************************************************************* From: Florian Weimer Sent: Saturday, May 18, 2002 5:41 AM > I suspect that the work on wide_wide_character will in practice turn > out to be nearly useless in the short or medium term. Using Ada for internationalized applications on GNU systems (using GNU facilities) almost requires 32 bit Wide_Wide_Character support, since GNU uses a 32 bit wchar_t internally. (See a similar discussion on the GCC development list.) ************************************************************* From: Robert Dewar Sent: Saturday, May 18, 2002 7:32 AM We have seen zero demand for such functionality, so would not invest any time at all in either design or implementation work here. If such a feature is added to Ada, I would definitely suggest it be optional. ************************************************************* From: Florian Weimer Sent: Saturday, May 18, 2002 6:00 AM >> * Several major vendors have adopted ISO 10646-1:1993 early, using a >> 16 bit representation for characters (i.e. wchar_t in C is 16 >> bits). > > Which is fine as it maps directly to Ada's wide character. I still think > that we want to retain the capacity of using 16-bit blobs to represent > characters in the BMP, as 99.5% of practical applications will only need the > BMP. Quite a few people have already changed their minds about the 99.5% figure (mathematical characters and Plane 14 Language being the reason). Maybe it's true for the character count, but I doubt it for the application count. > Changes to Wide_Character and Wide_String are pretty much out of the > question. Okay, accepted. > On the other hand, the type that is intended for interfacing with > C is Interfaces.C.wchar_array, and it would be straightforward to provide > (in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit > Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C > compiler does). I doubt that C compilers can use UTF-16 for wchar_t. You cannot apply iswlower() to a single surrogate character. :-/ > I would think that we would want to use UCS-4, since it's an ISO standard. > Moreover, UTF-32 has a number of consistency rules (eg code points below > 16#10ffff#) which seem irrelevant for internal manipulation of strings. Yes, UCS-4 is indeed the correct encoding form to use. >> * We could add Wide_Wide_Character and Wide_Wide_String types to >> pacakge Standard (and extending the Ada.Strings hierarchy), which >> are encoded in UTF-32. > > Wide_Wide_ types seem like the natural way to add this capability to the > language, except that some compilers may not be quite prepared to deal with > enumeration types with 2 ** 32 literals (ours isn't). Ah, this could be a problem indeed, together with the large universal_integer returned by Wide_Wide_Character'Pos. >> (the UTF-16 string type could become a >> second-class citizen, though, without full support in the Ada.Strings >> hierarchy). > > As far as I can tell, there is no support for UTF-16, only for UCS-2. At the moment, yes, but I think we need some UTF-16 support, too, because many operating system interfaces use it. > Anyway, I don't think it is reasonable to force applications to go to the > full 32-bit overhead just because they use, say, the french OE ligature. Most people apparently refuse to use Wide_Character, too, for the same reason. They either go for ISO 8859-15 or Windows 1252, or don't use the OE ligature at all. > External representation is best handled by Text_IO and friends, typically by > using a form parameter to specify the encoding (and there are many more > encodings than just UCS and UTF). There was a recent discussion to add other I/O facilities. UTF-8 is becoming more and more common in the Internet context, and often, you can determine the encoding of a file only after reading the first couple of lines (think of a MIME-encoded mail message). Furthermore, UTF-8 already plays an important role in interacting with other libraries (not written in Ada). > (Where do we stop? Do we want to require all validated compilers to > support UTF-8? Yes, why not? Why shall all compilers support ISO 8859-1? Why UCS-2? > What about the chinese Big5 or the JIS encodings?) If there is support for UCS-4, handling these encodings could be performed by a mechanism similar to POSIX iconv(). ************************************************************* From: Robert Dewar Sent: Saturday, May 18, 2002 7:43 AM > Yes, why not? Why shall all compilers support ISO 8859-1? Why UCS-2? Why not = because there is no real demand. Especially this time around we need to be very careful not to require things that no one is really interested in. If we do this, the vendors will simply ignore any new standard. In fact I think if there is a new standard, it will only be implemented as a result of direct customer interest in features in this standard. The value of formal conformance and validation has largely disappeared from the Ada marketplace at this stage (in terms of customer demand). That's not to say that the Ada marketplace is not very vital and dynamic, we get dozens of requests for enhancements from our users every month, but there is precious little intersection between the things users seem to need and want and these kind of discussions. In GNAT, we put a lot of effort into implementing multiple character sets (we just added the new Latin set with the Euro symbol, because customers needed that for example). Some of it has been useful (like this Euro addition), but mostly these features are of entertainment and advertising value only. In fact the only serious user that we have for Wide_Character and Wide_String is us (from ASIS :-) One thing to remember here is that very little is needed in the way of language support for fancy character sets (most of the effort in GNAT for example for 8-bit sets is in csets, which gives proper case mapping for identifiers, and it is easy enough to add new tables to this -- someone contributed a new Cyrillic table just a few months ago). Most of the issues are representational issues, and the Ada standard has nothing to say about source representation (and this should not change in any new standard). ************************************************************* From: Pascal Leroy Sent: Tuesday, May 21, 2002 4:03 AM > > Which is fine as it maps directly to Ada's wide character. I still think > > that we want to retain the capacity of using 16-bit blobs to represent > > characters in the BMP, as 99.5% of practical applications will only need the > > BMP. > > Quite a few people have already changed their minds about the 99.5% > figure (mathematical characters and Plane 14 Language being the > reason). Maybe it's true for the character count, but I doubt it for > the application count. Remember, we are talking Ada applications here. There are probably many applications out there that deal with mathematical symbols or with Tengwar, but I doubt that they are written in Ada. > > External representation is best handled by Text_IO and friends, typically by > > using a form parameter to specify the encoding (and there are many more > > encodings than just UCS and UTF). > > There was a recent discussion to add other I/O facilities. UTF-8 is > becoming more and more common in the Internet context, and often, you > can determine the encoding of a file only after reading the first > couple of lines (think of a MIME-encoded mail message). Furthermore, > UTF-8 already plays an important role in interacting with other > libraries (not written in Ada). Maybe we need a predefined unit to convert UCS-2 to/from UTF-8. But then such conversion functions could easily be written by the user, too, or provided by some public domain stuff. > > (Where do we stop? Do we want to require all validated compilers to > > support UTF-8? > > Yes, why not? Why shall all compilers support ISO 8859-1? Why UCS-2? You don't sell many compilers if you don't support 8859-1. As for UCS-2, well, that's pretty much the default representation of wide characters anyway. Other than that, it would seem that we should let the market decide. Speaking for Rational, we have had wide character support for about 7 years, and I don't recall seeing a single bug report or request for enhancement on this topic. This may indicate that our technology is perfect, but there are other explanation ;-) . (As a matter of fact we probably have very few licenses installed in countries where 8859-1 is not sufficient to write the native language -- ignoring the problem with the OE ligature in French.) One option would be to add Wide_Wide_Character in a new annex, and let users decide if they want their vendors to support this annex. Of course, chances are that nobody would care, in which case that would be a lot of standardization effort for nothing. ************************************************************* From: Robert Dewar Sent: Tuesday, May 21, 2002 4:39 AM I agree with everything Pascal had to say about wide character. We do have one Japanese customer using wide characters, and as I mentioned earlier, ASIS uses wide strings to represent source texts, but other than that, we have heard very little about wide strings. The only real input we have got from customers on character set issues was the request to support Latin-9 with the new Euro symbol and we got contributed tables for Cyrillic from a Russian enthusiast (not a customer, but it seemed a harmless addition :-) ************************************************************* From: Florian Weimer Sent: Tuesday, May 21, 2002 1:42 PM > I agree with everything Pascal had to say about wide character. We do have > one Japanese customer using wide characters, and as I mentioned earlier, > ASIS uses wide strings to represent source texts, but other than that, > we have heard very little about wide strings. I guess this customer doesn't use Wide_Character in the way it was intended (for storing ISO 10646 code position), so this example is a bit dubious. > The only real input we have got from customers on character set > issues was the request to support Latin-9 with the new Euro symbol Even in this rather innocent case, Wide_Character is no longer using UCS-2 with GNAT. *************************************************************