!standard A.3.2(49) 02-01-23 AI95-00285/00 !class amendment 02-01-23 !status received 02-01-15 !priority Medium !difficulty Hard !subject Latin-9 and Ada.Characters.Handling !summary !problem Latin-9 has been introduced. !proposal !discussion !example !ACATS test !appendix From: Gary Dismukes Sent: Tuesday, January 15, 2002 4:14 PM Ben Brosgol recently pointed out to us (ACT) the introduction of a variant of the Latin 1 character set that is designated Latin 9. A web page describing Latin 9 can be viewed at: http://www.cs.tut.fi/~jkorpela/latin9.html Here's the summary blurb on that page describing the relatively minor differences between Latin 1 and Latin 9: ISO Latin 9 as compared with ISO Latin 1 The ISO Latin 9 (ISO 8859-15) character set differs from the well-known ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro sign and some national letters used e.g. in French and Finnish have been introduced and some rarely used special characters omitted. We've added a new package to the GNAT library named Ada.Characters.Latin_9, analogous to Ada.Characters.Latin_1, to define character constants for this new character set. Robert Dewar asked me to post the following remarks from him re Latin-9 and Ada.Characters.Handling: ---------- Note that the Ada package Latin-1 did not exactly follow the official names of all characters, and I have copied its abbreviated naming style for the new characters in Latin-9. I have a gripe with the RM here. The setup for Ada.Characters.Latin_1 is to have separate packages for separate character sets, which makes perfectly good sense: 27 An implementation may provide additional packages as children of Ada.Characters, to declare names for the symbols of the local character set or other character sets. But for Characters.Handling, we have the odd statement: 49 If an implementation provides a localized definition of Character or Wide_Character, then the effects of the subprograms in Characters.Handling should reflect the localizations. See also 3.5.2. which implies that some mysterious transformation happens on this package (under what circumstnaces?) I think this is a bad idea for two reasons: a) it requires specialized mechanisms in the compiler, and it seems odd for the meaning of this package to depend on some compiler switch etc. b) it precludes handling multiple character sets in the same program, whereas the design for Ada.Characters.Latin_1 etc seems to accomodate this. My recommendation is that an implementation generate separate packages, called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling being a renaming of Ada.Characters.Handling_Latin_1 perhaps?) Robert Dewar ************************************************************* From: Pascal Leroy Sent: Tuesday, January 15, 2002 5:05 PM > The ISO Latin 9 (ISO 8859-15) character set differs from the well-known > ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro > sign and some national letters used e.g. in French and Finnish have been > introduced and some rarely used special characters omitted. Oh boy, good to see that the OE and oe ligatures are now available, and that we now can write French without having to use Unicode! ************************************************************* From: John Barnes Sent: Wednesday, January 16, 2002 1:44 AM Better put that on the agenda for the next ARG. Ada 2005 should use Latin 9 rather than Latin 1. A minor change. Might be a few incompatibilities. ************************************************************* From: Pascal Leroy Sent: Wednesday, January 16, 2002 12:53 PM As I mentioned in a mail yesterday, the fact that you can use Latin 9 to write French makes it look very interesting to me. On the other hand, it is not too useful for Ada to support Latin 9 if the OSes don't: if I emit the character OE and it print out as 1/4 on my screen, I didn't gain much. So while I agree that we should consider supporting Latin 9 _in_addition_ to Latin 1 in Ada 05, I don't think Latin 9 should _replace_ Latin 1, because I am ready to bet that we will still have Latin 1 OSes ten years from now. ************************************************************* From: John Barnes Sent: Thursday, January 17, 2002 1:33 AM It was somewhat of a jokey suggestion as I am sure you are aware. Indeed I had a big problem when writing my book and displaying the type Character. I wrote it in QuarkXpress on a PC and it was fine. The publishers moved it to a Mac before printing and some characters came out wrong. One of them came out as a picture of an apple. Moreover, someone had bitten a lump out of it. So much for standards I thought. But supporting Latin-9 would be nice. All those adverts on the Paris Metro for eating an oeuf can then be printed properly. ************************************************************* From: Bob Duff Sent: Thursday, January 17, 2002 1:14 PM > Indeed I had a big problem when writing my book and > displaying the type Character. I had a great deal of trouble writing the part of the Reference Manual where type Character lives. I think Randy had some trouble with the updated RM, too. At least we didn't try to show type Wide_Character in its full glory. ;-) 7-bit ascii will live forever, I suppose. ************************************************************* From: Bob Duff Sent: Wednesday, January 16, 2002 2:15 PM > Ben Brosgol recently pointed out to us (ACT) the introduction of a > variant of the Latin 1 character set that is designated Latin 9. The nice thing about standards is that there are so many to choose from. ;-) > My recommendation is that an implementation generate separate packages, > called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling > being a renaming of Ada.Characters.Handling_Latin_1 perhaps?) That makes sense. But I think the RM statement you complain about is envisioning a nonstandard version of Standard.[Wide_]Character, which is a separate issue. I don't see that as a big deal -- if you don't think it's a good idea, don't implement any such thing. I tend to agree that compiler switches and the like shouldn't normally be meddling with the semantics of packages Standard and Characters.Handling without a very good reason. ************************************************************* From: Florian Weimer Sent: Friday, January 18, 2002 6:58 AM > But I think the RM statement you complain about is envisioning a > nonstandard version of Standard.[Wide_]Character, which is a separate > issue. If you use Latin 9 for Standard.Character, this is certainly a non-standard version, and Ada.Characters.Handling has to be modified to remain useful. ************************************************************* From: Florian Weimer Sent: Friday, January 18, 2002 6:58 AM > Better put that on the agenda for the next ARG. Ada 2005 > should use Latin 9 rather than Latin 1. A minor change. > Might be a few incompatibilities. I disagree. With Latin 9, the mapping from Character to Wide_Character is less straightforward, and this could have unexpected results. OTOH, it seems that Wide_Character is not widely used (unless you are forced to do so by ASIS), so this might not matter much. In addition, we really should add Wide_Wide_Character (which covers the sixteen additional planes), or make Wide_Character itself wider. Otherwise, using Unicode with standard Ada will be rather painful. ************************************************************* From: Florian Weimer Sent: Saturday, April 20, 2002 3:18 AM ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits, and 10646-2:2001 allocates characters outside the Basic Multilingual Plane. Not too long ago, quite a few people assumed that characters beyond the BMP would be interesting only for rather esoteric scholarly use (Linear B is a perfect example). However, we now have got at least different sets of code positions outside the BMP which will see more widespread use eventually: the mathematical alphabets and Plane 14 Language Tags (which are required to make some Japanese people happy who fear that Japanese characters are rendered using Chinese glyphs). Therefore, I think Ada 200X should somehow support characters outside the BMP. A few random thoughts (sorry, I'm probably not using strict ISO 10646 terminology): * Several major vendors have adopted ISO 10646-1:1993 early, using a 16 bit representation for characters (i.e. wchar_t in C is 16 bits). These vendors include Sun (Java) and Microsoft (Windows), and probably most proprietary UNIX vendors. These vendor implementations now cover the code positions beyond the BMP using UTF-16, which uses surrogate pairs (a single character is represented using two 16 bit values from reserved ranges in the BMP). UTF-16 has got a few drawbacks: the ordering (in terms UCS code positions) is no longer lexicographic (which leads us to such brain damage as CESU-8), dealing with individual characters is complicated, and you cannot implement the C wide character functions properly. For Ada, numerous changes would be required if we want to expose the UTF-16 representation to programmers, for example by declaring Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no longer be arrays of characters indexed by position). GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS characters in a single 32 bit value, that is, UTF-32), and while this is certainly not the "industry standard" (it is encouraged by ISO 9899:1999, though), I really hope we can use this approach (UTF-32 internal representation) for Ada, as it simplifies things considerably, especially if we want to add character properties support (see below). * We could add Wide_Wide_Character and Wide_Wide_String types to pacakge Standard (and extending the Ada.Strings hierarchy), which are encoded in UTF-32. I don't know if this is necessary. IIRC, Robert Dewar once told that the only applications using Wide_Character are based on ASIS, where using Wide_Character is not really voluntarily. Maybe it is possible to bump Wide_Character'Size to 32 bits instead, without really breaking backwards compatibility. Of course, we would need a way to converted UTF-32 strings to UTF-16 strings and vice versa (the UTF-16 string type could become a second-class citizen, though, without full support in the Ada.Strings hierarchy). * External representation of UCS characters is rapidly moving towards UTF-8 (especially in Internet standards). Ada should provide an interface for converting between the wide string type(s) and UTF-8 octet sequences. It should be possible to use string literals where UTF-8 strings are expected. * Supporting higher levels of Unicode (e.g. accessing the character properties database, normalization forms) would be interesting, too. Such documents will eventually follow in the ISO 10646 series, but I don't know if the ISO standard will be ready for Ada 200X. Currently, only the Unicode Consortium has standardized or documented issues like character properties or terminal behavior in detail. I don't know how ISO reacts if ISO standards refer to competing standardization efforts. IEEE POSIX.1 (and probably, or already, ISO POSIX) standardizes the BSD sockets interface, and not OSI, so maybe this isn't an issue. In any case, this point is mostly a library issue which can be addressed by a community implementation effort, it does not require changes in the Ada language (adding Wide_Wide_Character does, for example). ************************************************************* From: Pascal Leroy Sent: Monday, April 22, 2002 8:32 AM > ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits, > and 10646-2:2001 allocates characters outside the Basic Multilingual > Plane. > > Therefore, I think Ada 200X should somehow support characters outside > the BMP. The normalization of new character sets (both as part of 10646 and of 8859) was actually discussed at the last ARG meeting, and I was given an action item to somehow integrate them in the language, probably as some kind of amendment AI. > A few random thoughts (sorry, I'm probably not using strict ISO 10646 > terminology): > > * Several major vendors have adopted ISO 10646-1:1993 early, using a > 16 bit representation for characters (i.e. wchar_t in C is 16 > bits). Which is fine as it maps directly to Ada's wide character. I still think that we want to retain the capacity of using 16-bit blobs to represent characters in the BMP, as 99.5% of practical applications will only need the BMP. > For Ada, numerous changes would be required if we want to expose the > UTF-16 representation to programmers, for example by declaring > Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no > longer be arrays of characters indexed by position). Changes to Wide_Character and Wide_String are pretty much out of the question. On the other hand, the type that is intended for interfacing with C is Interfaces.C.wchar_array, and it would be straightforward to provide (in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C compiler does). > I really hope we can use this approach (UTF-32 > internal representation) for Ada, as it simplifies things > considerably, especially if we want to add character properties > support (see below). I would think that we would want to use UCS-4, since it's an ISO standard. Moreover, UTF-32 has a number of consistency rules (eg code points below 16#10ffff#) which seem irrelevant for internal manipulation of strings. > * We could add Wide_Wide_Character and Wide_Wide_String types to > pacakge Standard (and extending the Ada.Strings hierarchy), which > are encoded in UTF-32. Wide_Wide_ types seem like the natural way to add this capability to the language, except that some compilers may not be quite prepared to deal with enumeration types with 2 ** 32 literals (ours isn't). > (the UTF-16 string type could become a > second-class citizen, though, without full support in the Ada.Strings > hierarchy). As far as I can tell, there is no support for UTF-16, only for UCS-2. Anyway, I don't think it is reasonable to force applications to go to the full 32-bit overhead just because they use, say, the french OE ligature. > * External representation of UCS characters is rapidly moving > towards UTF-8 (especially in Internet standards). > > Ada should provide an interface for converting between the wide string > type(s) and UTF-8 octet sequences. It should be possible to use > string literals where UTF-8 strings are expected. External representation is best handled by Text_IO and friends, typically by using a form parameter to specify the encoding (and there are many more encodings than just UCS and UTF). The ARG won't get into the business of specifying the details of the form parameter, so this is something that will remain non-portable for the foreseeable future. (Where do we stop? Do we want to require all validated compilers to support UTF-8? What about the chinese Big5 or the JIS encodings?) > * Supporting higher levels of Unicode (e.g. accessing the character > properties database, normalization forms) would be interesting, > too. We certainly don't want to get into that business. The designers of Ada 95 wisely decided to lump all of the characters in the range 16#0100# .. 16#FFFD# into the category special_character, so that they don't have to decide which is a letter, a number, etc. Similarly they didn't provide classification functions or upper/lower conversions for wide characters. This seems reasonable if we don't want to have to amend Ada each time a bunch of characters are added to 10646. ************************************************************* From: Nick Roberts Sent: Wednesday, April 24, 2002 7:31 PM > Therefore, I think Ada 200X should somehow support characters outside > the BMP. I agree. > GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS > characters in a single 32 bit value, that is, UTF-32), and while this is > certainly not the "industry standard" (it is encouraged by ISO 9899:1999, > though), I really hope we can use this approach (UTF-32 internal > representation) for Ada, as it simplifies things considerably, especially > if we want to add character properties support (see below). I agree very strongly! > * We could add Wide_Wide_Character and Wide_Wide_String types to > pacakge Standard (and extending the Ada.Strings hierarchy), which > are encoded in UTF-32. I must say I would prefer the identifiers Universal_Character and Universal_String. I see the logic of Wide_Wide_ but it seems clumsy! > I don't know if this is necessary. IIRC, Robert Dewar once told that > the only applications using Wide_Character are based on ASIS, where > using Wide_Character is not really voluntarily. Maybe it is possible > to bump Wide_Character'Size to 32 bits instead, without really > breaking backwards compatibility. I disagree with this idea. > Of course, we would need a way to converted UTF-32 strings to UTF-16 > strings and vice versa (the UTF-16 string type could become a > second-class citizen, though, without full support in the Ada.Strings > hierarchy). Possibly these support packages should be in an optional annex. > * External representation of UCS characters is rapidly moving > towards UTF-8 (especially in Internet standards). > > Ada should provide an interface for converting between the wide string > type(s) and UTF-8 octet sequences. It should be possible to use string > literals where UTF-8 strings are expected. > > * Supporting higher levels of Unicode (e.g. accessing the character > properties database, normalization forms) would be interesting, > too. Again, perhaps all this should really be in (or moved into) an optional annex. ************************************************************* From: Robert Dewar Sent: Wednesday, April 24, 2002 9:50 PM I suspect that the work on wide_wide_character will in practice turn out to be nearly useless in the short or medium term. We certainly put in a lot of work in GNAT in implementing wide character with many different representation schemes, but this feature has been very little used (ASIS being the main use :-). In practice I think the 16-bit character type defined in Ada now will be adequate for almost all use, and I see no reason in requring implementations to go beyond this in the absence of real market demand. Yes, it's fun to talk about character set issues (after all I was chair of the CRG, so I appreciate this), but there is no point in increasing implementation burdens unless it's really valuable. I would just give clear permission for an implementation to add additional character types in standard (indeed that permission exists today in Ada 95), and leave it at that. ************************************************************* From: John Barnes Sent: Thursday, April 25, 2002 1:46 AM The BSI is looking at character set issues across languages and your message reminded me of the CRG. Was there ever a final report that I could refer to? ************************************************************* From: Robert Dewar Sent: Thursday, April 26, 2002 10:25 PM I think there was a final report, perhaps Jim could track it down. ************************************************************* From: Randy Brukardt Sent: Thursday, April 25, 2002 3:44 PM > We certainly put in a lot of work in GNAT in implementing wide > character with many different representation schemes, but this > feature has been very little used (ASIS being the main use :-). To add another data point: Claw was designed so that a wide character version could be easily created. But we've never implemented that version, mainly because we've never had a paying customer ask for it. So I have to wonder how important "Really_Wide_Character" would be. *************************************************************