Version 1.5 of ais/ai-00285.txt

Unformatted version of ais/ai-00285.txt version 1.5
Other versions for file ais/ai-00285.txt

!standard A.3.2(49)          02-09-24 AI95-00285/01
!class amendment 02-01-23
!status work item 02-09-24
!status received 02-01-15
!priority Medium
!difficulty Hard
!subject Latin-9, Ada.Characters.Handling, and 32-bit characters
!summary
!problem
Latin-9 has been introduced by ISO/IEC 8859-15:1999. Moreover, the working draft of ISO/IEC 10646:2003 makes use of planes other than the BMP.
!proposal
In order to support Latin-9, we allow an implementation to provide a package named Ada.Characters.Latin_9, but we strictly restrict its contents to correspond to the characters defined by ISO/IEC 8859-15:1999.
If an application chooses to use both Latin-1 and Latin-9, the package Ada.Characters.Handling is quite problematic, as it seems to assume that Character always corresponds to Latin-1, and it would get the classifications and the conversions wrong for some Latin-9 characters. We deal with this by recognizing that this package is really specific to the character encoding being used, and making it a child of the proper Latin_n package. A library-level renaming is provided for compatibility with existing applications.
The constant Character_Set in Ada.Strings.Wide_Maps.Wide_Constants is similarly problematic as it is defined by reference to Ada.Characters.Handling. We define both the Latin-1 and the Latin-9 sets in Wide_Constants, and provide a renaming for compatibility. <<We should probably give permission to add more constants if an implementation wants to support other character encodings.>>
ISO/IEC 8859-15 defines as a letter and as a ligature. There seems to be no reason to make a distinction between the two, so for the purposes of Ada.Characters.Latin_9.Handling, we classify both as letters.
Note that we are not proposing to change the lexical rules of the language, so it's still the case that only characters from row 00 of the BMP are allowed in identifiers (row 00 of the BMP is not affected by Latin-9).
In order to support 32-bit characters, we allow an implementation to add new declarations to Standard. If it does, it must provide the appropriate predefined units for 32-bit characters, and new attributes to convert discrete values to and from 32-bit strings.
Again, we are not proposing to change the lexical rules of the language, so the character and string literals appearing in the program can only make use of the graphic symbols from the BMP. The characters from other planes cannot be represented in literals; they must be obtained by evaluating more complex expressions; for instance, by evaluating Wide_Wide_Character'Val (16#1001B#) it is possible to access the Linear B syllable NI.
!wording
An implementation is allowed to provide a library package named Ada.Characters.Latin_9. This package shall be identical to Ada.Characters.Latin_1, except for the following differences:
- It doesn't declare the constants Currency_Sign, Broken_Bar, Diaeresis, Acute, Cedilla, Fraction_One_Quarter, Fraction_One_Half, and Fraction_Three_Quarter.
- It declares the following constants:
Euro_Sign : constant Character := ''; -- Character'Val (164) UC_S_Caron : constant Character := ''; -- Character'Val (166) LC_S_Caron : constant Character := ''; -- Character'Val (168) UC_Z_Caron : constant Character := ''; -- Character'Val (180) LC_Z_Caron : constant Character := ''; -- Character'Val (184) UC_OE_Diphthong : constant Character := ''; -- Character'Val (188) LC_OE_Diphthong : constant Character := ''; -- Character'Val (189) UC_Y_Diaeresis : constant Character := ''; -- Character'Val (190)
In addition, an implementation which provides Ada.Characters.Latin_9 must provide two library packages named Ada.Characters.Latin_1.Handling and Ada.Characters.Latin_9.Handling, respectively. Ada.Characters.Latin_1.Handling must have the same contents and semantics as the package Ada.Characters.Handling defined in section A.3.2 of ISO/IEC 8652:1995 with COR.1:2000 (except of course for the library unit name). For compatibility with existing applications, the following library-level renaming must also be provided:
package Ada.Characters.Handling renames
Ada.Characters.Latin_1.Handling;
The package Ada.Characters.Latin_9.Handling has the same specification as Ada.Characters.Latin_1.Handling, but the following semantic differences:
- The function Is_Letter returns True for the characters at positions 166, 168, 180, 184, 188, 189 and 190 (in addition to those for which it is defined to return True in A.3.2(24)).
- The function Is_Lower returns True for the characters at positions 168, 184 and 189 (in addition to those for which it is defined to return True in A.3.2(25)).
- The function Is_Upper returns True for the characters at positions 166, 180, 188 and 190 (in addition to those for which it is defined to return True in A.3.2(26)).
- The function Is_Basic return True for the characters at positions 188 and 189 (in addition to those for which it is defined to return True in A.3.2 (27)).
- The upper-case form of '' is '' for the purposes of function To_Upper.
- The function Is_Character return true if the Wide_Character Item has a name in ISO/IEC 10646 which is the name of some Character in ISO/IEC 8859-15.
- The function To_Character returns the Character which has the same name in ISO/IEC 8859-15 as the Wide_Character Item in ISO/IEC 10646.
- The function To_Wide_Character returns the Wide_Character which has the same name in ISO/IEC 10646 as the Character Item in ISO/IEC 8859-15.
The declaration of Character_Set in Ada.Strings.Wide_Maps.Wide_Constants is removed. It is replaced by:
Latin_1_Character_Set : constant Wide_Maps.Wide_Character_Set; Latin_9_Character_Set : constant Wide_Maps.Wide_Character_Set; Character_Set : Wide_Maps.Wide_Character_Set renames Latin_1_Character_Set;
An implementation is allowed to add the following declarations to package Standard:
type Wide_Wide_Character is (nul, soh, ..., FFFE, FFFF, 00010000, ..., 7FFFFFFF); type Wide_Wide_String is array (Positive range <>) of Wide_Wide_Character; pragma Pack (Wide_Wide_String);
The type Wide_Wide_Character has 2 ** 31 values. Its first 2 ** 16 positions must have the same contents as type Wide_Character.
If an implementation provides these two types, it must also provide the following packages:
Ada.Strings.Wide_Wide_Bounded Ada.Strings.Wide_Wide_Fixed Ada.Strings.Wide_Wide_Maps Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants Ada.Strings.Wide_Wide_Unbounded Ada.Wide_Wide_Text_IO
These packages are similar to their Wide_ equivalents, with Wide_Wide_ substituted for Wide_ everywhere. In addition the following declaration is present in Ada.Strings.Wide_Wide_Maps.Wide_Wide_Constants:
Wide_Character_Set : constant Wide_Wide_Maps.Wide_Wide_Character_Set;
It contains each Wide_Wide_Character value in the BMP of ISO/IEC 10646.
The attributes Wide_Wide_Image, Wide_Wide_Value and Wide_Wide_Width must also be provided. Their definition is similar to that of Wide_Image, Wide_Value and Wide_Width, respectively, with Wide_Character and Wide_String replaced by Wide_Wide_Character and Wide_Wide_String.
The semantics of Wide_Image are modified as follows: the image has the same sequence of graphic characters as that defined for S'Wide_Wide_Image if all the graphic characters are defined in Wide_Character; otherwise the sequence of characters is implementation defined (but no shorter than that of S'Wide_Wide_Image for the same value of Arg).
!discussion
See proposal.
!example
!ACATS test
!appendix

From: Gary Dismukes
Sent: Tuesday, January 15, 2002  4:14 PM

Ben Brosgol recently pointed out to us (ACT) the introduction of a
variant of the Latin 1 character set that is designated Latin 9.

A web page describing Latin 9 can be viewed at:

  http://www.cs.tut.fi/~jkorpela/latin9.html

Here's the summary blurb on that page describing the relatively minor
differences between Latin 1 and Latin 9:

  ISO Latin 9 as compared with ISO Latin 1

  The ISO Latin 9 (ISO 8859-15) character set differs from the well-known
  ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro
  sign and some national letters used e.g. in French and Finnish have been
  introduced and some rarely used special characters omitted.

We've added a new package to the GNAT library named Ada.Characters.Latin_9,
analogous to Ada.Characters.Latin_1, to define character constants for this
new character set.

Robert Dewar asked me to post the following remarks from him
re Latin-9 and Ada.Characters.Handling:

----------

Note that the Ada package Latin-1 did not exactly follow the official
names of all characters, and I have copied its abbreviated naming style
for the new characters in Latin-9.

I have a gripe with the RM here. The setup for Ada.Characters.Latin_1 is
to have separate packages for separate character sets, which makes perfectly
good sense:

27   An implementation may provide additional packages as children of
Ada.Characters, to declare names for the symbols of the local character set
or other character sets.

But for Characters.Handling, we have the odd statement:

49   If an implementation provides a localized definition of Character or
Wide_Character, then the effects of the subprograms in Characters.Handling
should reflect the localizations.  See also 3.5.2.

which implies that some mysterious transformation happens on this package
(under what circumstnaces?) I think this is a bad idea for two reasons:

a) it requires specialized mechanisms in the compiler, and it seems odd
for the meaning of this package to depend on some compiler switch etc.

b) it precludes handling multiple character sets in the same program,
whereas the design for Ada.Characters.Latin_1 etc seems to accomodate this.

My recommendation is that an implementation generate separate packages,
called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling
being a renaming of Ada.Characters.Handling_Latin_1 perhaps?)

Robert Dewar

*************************************************************

From: Pascal Leroy
Sent: Tuesday, January 15, 2002  5:05 PM

>   The ISO Latin 9 (ISO 8859-15) character set differs from the well-known
>   ISO Latin 1 (ISO 8859-1) character set in a few positions only. The euro
>   sign and some national letters used e.g. in French and Finnish have been
>   introduced and some rarely used special characters omitted.

Oh boy, good to see that the OE and oe ligatures are now available, and that
we now can write French without having to use Unicode!

*************************************************************

From: John Barnes
Sent: Wednesday, January 16, 2002  1:44 AM

Better put that on the agenda for the next ARG. Ada 2005
should use Latin 9 rather than Latin 1.  A minor change.
Might be a few incompatibilities.

*************************************************************

From: Pascal Leroy
Sent: Wednesday, January 16, 2002  12:53 PM

As I mentioned in a mail yesterday, the fact that you can use Latin 9 to
write French makes it look very interesting to me.

On the other hand, it is not too useful for Ada to support Latin 9 if the
OSes don't: if I emit the character OE and it print out as 1/4 on my screen,
I didn't gain much.

So while I agree that we should consider supporting Latin 9 _in_addition_ to
Latin 1 in Ada 05, I don't think Latin 9 should _replace_ Latin 1, because I
am ready to bet that we will still have Latin 1 OSes ten years from now.

*************************************************************

From: John Barnes
Sent: Thursday, January 17, 2002  1:33 AM

It was somewhat of a jokey suggestion as I am sure you are aware.

Indeed I had a big problem when writing my book and
displaying the type Character. I wrote it in QuarkXpress on
a PC and it was fine. The publishers moved it to a Mac
before printing and some characters came out wrong.  One of
them came out as a picture of an apple. Moreover, someone
had bitten a lump out of it. So much for standards I
thought.

But supporting Latin-9 would be nice. All those adverts on
the Paris Metro for eating an oeuf can then be printed
properly.

*************************************************************

From: Bob Duff
Sent: Thursday, January 17, 2002  1:14 PM

> Indeed I had a big problem when writing my book and
> displaying the type Character.

I had a great deal of trouble writing the part of the Reference Manual
where type Character lives.  I think Randy had some trouble with the
updated RM, too.  At least we didn't try to show type Wide_Character in
its full glory.  ;-)

7-bit ascii will live forever, I suppose.

*************************************************************

From: Bob Duff
Sent: Wednesday, January 16, 2002  2:15 PM

> Ben Brosgol recently pointed out to us (ACT) the introduction of a
> variant of the Latin 1 character set that is designated Latin 9.

The nice thing about standards is that there are so many to choose
from.  ;-)

> My recommendation is that an implementation generate separate packages,
> called e.g. Ada.Characters.Handling_Latin_9 (with Ada.Characters.Handling
> being a renaming of Ada.Characters.Handling_Latin_1 perhaps?)

That makes sense.

But I think the RM statement you complain about is envisioning a
nonstandard version of Standard.[Wide_]Character, which is a separate
issue.  I don't see that as a big deal -- if you don't think it's a good
idea, don't implement any such thing.  I tend to agree that compiler
switches and the like shouldn't normally be meddling with the semantics
of packages Standard and Characters.Handling without a very good reason.

*************************************************************

From: Florian Weimer
Sent: Friday, January 18, 2002  6:58 AM

> But I think the RM statement you complain about is envisioning a
> nonstandard version of Standard.[Wide_]Character, which is a separate
> issue.

If you use Latin 9 for Standard.Character, this is certainly a
non-standard version, and Ada.Characters.Handling has to be modified
to remain useful.

*************************************************************

From: Florian Weimer
Sent: Friday, January 18, 2002  6:58 AM

> Better put that on the agenda for the next ARG. Ada 2005
> should use Latin 9 rather than Latin 1.  A minor change.
> Might be a few incompatibilities.

I disagree.  With Latin 9, the mapping from Character to
Wide_Character is less straightforward, and this could have unexpected
results.

OTOH, it seems that Wide_Character is not widely used (unless you are
forced to do so by ASIS), so this might not matter much.

In addition, we really should add Wide_Wide_Character (which covers
the sixteen additional planes), or make Wide_Character itself wider.
Otherwise, using Unicode with standard Ada will be rather painful.

*************************************************************

From: Florian Weimer
Sent: Saturday, April 20, 2002  3:18 AM

ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits,
and 10646-2:2001 allocates characters outside the Basic Multilingual
Plane.

Not too long ago, quite a few people assumed that characters beyond
the BMP would be interesting only for rather esoteric scholarly use
(Linear B is a perfect example).  However, we now have got at least
different sets of code positions outside the BMP which will see more
widespread use eventually: the mathematical alphabets and Plane 14
Language Tags (which are required to make some Japanese people happy
who fear that Japanese characters are rendered using Chinese glyphs).

Therefore, I think Ada 200X should somehow support characters outside
the BMP.

A few random thoughts (sorry, I'm probably not using strict ISO 10646
terminology):

  * Several major vendors have adopted ISO 10646-1:1993 early, using a
    16 bit representation for characters (i.e. wchar_t in C is 16
    bits).

These vendors include Sun (Java) and Microsoft (Windows), and probably
most proprietary UNIX vendors.  These vendor implementations now cover
the code positions beyond the BMP using UTF-16, which uses surrogate
pairs (a single character is represented using two 16 bit values from
reserved ranges in the BMP).

UTF-16 has got a few drawbacks: the ordering (in terms UCS code
positions) is no longer lexicographic (which leads us to such brain
damage as CESU-8), dealing with individual characters is complicated,
and you cannot implement the C wide character functions properly.

For Ada, numerous changes would be required if we want to expose the
UTF-16 representation to programmers, for example by declaring
Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no
longer be arrays of characters indexed by position).

GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS
characters in a single 32 bit value, that is, UTF-32), and while this
is certainly not the "industry standard" (it is encouraged by ISO
9899:1999, though), I really hope we can use this approach (UTF-32
internal representation) for Ada, as it simplifies things
considerably, especially if we want to add character properties
support (see below).

  * We could add Wide_Wide_Character and Wide_Wide_String types to
    pacakge Standard (and extending the Ada.Strings hierarchy), which
    are encoded in UTF-32.

I don't know if this is necessary.  IIRC, Robert Dewar once told that
the only applications using Wide_Character are based on ASIS, where
using Wide_Character is not really voluntarily.  Maybe it is possible
to bump Wide_Character'Size to 32 bits instead, without really
breaking backwards compatibility.

Of course, we would need a way to converted UTF-32 strings to UTF-16
strings and vice versa (the UTF-16 string type could become a
second-class citizen, though, without full support in the Ada.Strings
hierarchy).

  * External representation of UCS characters is rapidly moving
    towards UTF-8 (especially in Internet standards).

Ada should provide an interface for converting between the wide string
type(s) and UTF-8 octet sequences.  It should be possible to use
string literals where UTF-8 strings are expected.

  * Supporting higher levels of Unicode (e.g. accessing the character
    properties database, normalization forms) would be interesting,
    too.

Such documents will eventually follow in the ISO 10646 series, but I
don't know if the ISO standard will be ready for Ada 200X.  Currently,
only the Unicode Consortium has standardized or documented issues like
character properties or terminal behavior in detail.

I don't know how ISO reacts if ISO standards refer to competing
standardization efforts.  IEEE POSIX.1 (and probably, or already, ISO
POSIX) standardizes the BSD sockets interface, and not OSI, so maybe
this isn't an issue.

In any case, this point is mostly a library issue which can be
addressed by a community implementation effort, it does not require
changes in the Ada language (adding Wide_Wide_Character does, for
example).

*************************************************************

From: Pascal Leroy
Sent: Monday, April 22, 2002  8:32 AM

> ISO 10636-1:2000 extends the Universal Character Set beyond 16 bits,
> and 10646-2:2001 allocates characters outside the Basic Multilingual
> Plane.
>
> Therefore, I think Ada 200X should somehow support characters outside
> the BMP.

The normalization of new character sets (both as part of 10646 and of 8859)
was actually discussed at the last ARG meeting, and I was given an action
item to somehow integrate them in the language, probably as some kind of
amendment AI.

> A few random thoughts (sorry, I'm probably not using strict ISO 10646
> terminology):
>
>   * Several major vendors have adopted ISO 10646-1:1993 early, using a
>     16 bit representation for characters (i.e. wchar_t in C is 16
>     bits).

Which is fine as it maps directly to Ada's wide character.  I still think
that we want to retain the capacity of using 16-bit blobs to represent
characters in the BMP, as 99.5% of practical applications will only need the
BMP.

> For Ada, numerous changes would be required if we want to expose the
> UTF-16 representation to programmers, for example by declaring
> Wide_String to be encoded in UTF-16 instead of UCS-2 (strings would no
> longer be arrays of characters indexed by position).

Changes to Wide_Character and Wide_String are pretty much out of the
question.  On the other hand, the type that is intended for interfacing with
C is Interfaces.C.wchar_array, and it would be straightforward to provide
(in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit
Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C
compiler does).

> I really hope we can use this approach (UTF-32
> internal representation) for Ada, as it simplifies things
> considerably, especially if we want to add character properties
> support (see below).

I would think that we would want to use UCS-4, since it's an ISO standard.
Moreover, UTF-32 has a number of consistency rules (eg code points below
16#10ffff#) which seem irrelevant for internal manipulation of strings.

>   * We could add Wide_Wide_Character and Wide_Wide_String types to
>     pacakge Standard (and extending the Ada.Strings hierarchy), which
>     are encoded in UTF-32.

Wide_Wide_ types seem like the natural way to add this capability to the
language, except that some compilers may not be quite prepared to deal with
enumeration types with 2 ** 32 literals (ours isn't).

> (the UTF-16 string type could become a
> second-class citizen, though, without full support in the Ada.Strings
> hierarchy).

As far as I can tell, there is no support for UTF-16, only for UCS-2.
Anyway, I don't think it is reasonable to force applications to go to the
full 32-bit overhead just because they use, say, the french OE ligature.

>   * External representation of UCS characters is rapidly moving
>     towards UTF-8 (especially in Internet standards).
>
> Ada should provide an interface for converting between the wide string
> type(s) and UTF-8 octet sequences.  It should be possible to use
> string literals where UTF-8 strings are expected.

External representation is best handled by Text_IO and friends, typically by
using a form parameter to specify the encoding (and there are many more
encodings than just UCS and UTF).  The ARG won't get into the business of
specifying the details of the form parameter, so this is something that will
remain non-portable for the foreseeable future.  (Where do we stop?  Do we
want to require all validated compilers to support UTF-8?  What about the
chinese Big5 or the JIS encodings?)

>   * Supporting higher levels of Unicode (e.g. accessing the character
>     properties database, normalization forms) would be interesting,
>     too.

We certainly don't want to get into that business.  The designers of Ada 95
wisely decided to lump all of the characters in the range 16#0100# ..
16#FFFD# into the category special_character, so that they don't have to
decide which is a letter, a number, etc.  Similarly they didn't provide
classification functions or upper/lower conversions for wide characters.
This seems reasonable if we don't want to have to amend Ada each time a
bunch of characters are added to 10646.

*************************************************************

From: Nick Roberts
Sent: Wednesday, April 24, 2002  7:31 PM

> Therefore, I think Ada 200X should somehow support characters outside
> the BMP.

I agree.

> GNU libc (and thus, GNU/Linux) is using a 32 bit wchar_t (encoding UCS
> characters in a single 32 bit value, that is, UTF-32), and while this is
> certainly not the "industry standard" (it is encouraged by ISO 9899:1999,
> though), I really hope we can use this approach (UTF-32 internal
> representation) for Ada, as it simplifies things considerably, especially
> if we want to add character properties support (see below).

I agree very strongly!

>   * We could add Wide_Wide_Character and Wide_Wide_String types to
>     pacakge Standard (and extending the Ada.Strings hierarchy), which
>     are encoded in UTF-32.

I must say I would prefer the identifiers Universal_Character and
Universal_String. I see the logic of Wide_Wide_ but it seems clumsy!

> I don't know if this is necessary.  IIRC, Robert Dewar once told that
> the only applications using Wide_Character are based on ASIS, where
> using Wide_Character is not really voluntarily.  Maybe it is possible
> to bump Wide_Character'Size to 32 bits instead, without really
> breaking backwards compatibility.

I disagree with this idea.

> Of course, we would need a way to converted UTF-32 strings to UTF-16
> strings and vice versa (the UTF-16 string type could become a
> second-class citizen, though, without full support in the Ada.Strings
> hierarchy).

Possibly these support packages should be in an optional annex.

>   * External representation of UCS characters is rapidly moving
>     towards UTF-8 (especially in Internet standards).
>
> Ada should provide an interface for converting between the wide string
> type(s) and UTF-8 octet sequences.  It should be possible to use string
> literals where UTF-8 strings are expected.
>
>   * Supporting higher levels of Unicode (e.g. accessing the character
>     properties database, normalization forms) would be interesting,
>     too.

Again, perhaps all this should really be in (or moved into) an optional annex.

*************************************************************

From: Robert Dewar
Sent: Wednesday, April 24, 2002  9:50 PM

I suspect that the work on wide_wide_character will in practice turn
out to be nearly useless in the short or medium term. We certainly
put in a lot of work in GNAT in implementing wide character with many
different representation schemes, but this feature has been very little
used (ASIS being the main use :-). In practice I think the 16-bit character
type defined in Ada now will be adequate for almost all use, and I see no
reason in requring implementations to go beyond this in the absence of
real market demand.

Yes, it's fun to talk about character set issues (after all I was chair of
the CRG, so I appreciate this), but there is no point in increasing
implementation burdens unless it's really valuable.

I would just give clear permission for an implementation to add additional
character types in standard (indeed that permission exists today in Ada 95),
and leave it at that.

*************************************************************

From: John Barnes
Sent: Thursday, April 25, 2002  1:46 AM

The BSI is looking at character set issues across languages
and your message reminded me of the CRG. Was there ever a
final report that I could refer to?

*************************************************************

From: Robert Dewar
Sent: Thursday, April 26, 2002  10:25 PM

I think there was a final report, perhaps Jim could track it down.

*************************************************************

From: Randy Brukardt
Sent: Thursday, April 25, 2002  3:44 PM

> We certainly put in a lot of work in GNAT in implementing wide
> character with many different representation schemes, but this
> feature has been very little used (ASIS being the main use :-).

To add another data point: Claw was designed so that a wide character version
could be easily created. But we've never implemented that version, mainly
because we've never had a paying customer ask for it. So I have to wonder how
important "Really_Wide_Character" would be.

*************************************************************

From: Florian Weimer
Sent: Saturday, May 18, 2002  5:41 AM

> I suspect that the work on wide_wide_character will in practice turn
> out to be nearly useless in the short or medium term.

Using Ada for internationalized applications on GNU systems (using GNU
facilities) almost requires 32 bit Wide_Wide_Character support, since
GNU uses a 32 bit wchar_t internally.

(See a similar discussion on the GCC development list.)

*************************************************************

From: Robert Dewar
Sent: Saturday, May 18, 2002  7:32 AM

We have seen zero demand for such functionality, so would not invest any time
at all in either design or implementation work here. If such a feature is
added to Ada, I would definitely suggest it be optional.

*************************************************************

From: Florian Weimer
Sent: Saturday, May 18, 2002  6:00 AM

>>   * Several major vendors have adopted ISO 10646-1:1993 early, using a
>>     16 bit representation for characters (i.e. wchar_t in C is 16
>>     bits).
>
> Which is fine as it maps directly to Ada's wide character.  I still think
> that we want to retain the capacity of using 16-bit blobs to represent
> characters in the BMP, as 99.5% of practical applications will only need the
> BMP.

Quite a few people have already changed their minds about the 99.5%
figure (mathematical characters and Plane 14 Language being the
reason).  Maybe it's true for the character count, but I doubt it for
the application count.

> Changes to Wide_Character and Wide_String are pretty much out of the
> question.

Okay, accepted.

> On the other hand, the type that is intended for interfacing with
> C is Interfaces.C.wchar_array, and it would be straightforward to provide
> (in some new child of Interfaces.C, I guess) subprograms to convert a 32-bit
> Wide_Wide_String to a wchar_array (and back) using UTF-16 (or whatever the C
> compiler does).

I doubt that C compilers can use UTF-16 for wchar_t.  You cannot apply
iswlower() to a single surrogate character. :-/

> I would think that we would want to use UCS-4, since it's an ISO standard.
> Moreover, UTF-32 has a number of consistency rules (eg code points below
> 16#10ffff#) which seem irrelevant for internal manipulation of strings.

Yes, UCS-4 is indeed the correct encoding form to use.

>>   * We could add Wide_Wide_Character and Wide_Wide_String types to
>>     pacakge Standard (and extending the Ada.Strings hierarchy), which
>>     are encoded in UTF-32.
>
> Wide_Wide_ types seem like the natural way to add this capability to the
> language, except that some compilers may not be quite prepared to deal with
> enumeration types with 2 ** 32 literals (ours isn't).

Ah, this could be a problem indeed, together with the large
universal_integer returned by Wide_Wide_Character'Pos.

>> (the UTF-16 string type could become a
>> second-class citizen, though, without full support in the Ada.Strings
>> hierarchy).
>
> As far as I can tell, there is no support for UTF-16, only for UCS-2.

At the moment, yes, but I think we need some UTF-16 support, too,
because many operating system interfaces use it.

> Anyway, I don't think it is reasonable to force applications to go to the
> full 32-bit overhead just because they use, say, the french OE ligature.

Most people apparently refuse to use Wide_Character, too, for the same
reason.  They either go for ISO 8859-15 or Windows 1252, or don't use
the OE ligature at all.

> External representation is best handled by Text_IO and friends, typically by
> using a form parameter to specify the encoding (and there are many more
> encodings than just UCS and UTF).

There was a recent discussion to add other I/O facilities.  UTF-8 is
becoming more and more common in the Internet context, and often, you
can determine the encoding of a file only after reading the first
couple of lines (think of a MIME-encoded mail message).  Furthermore,
UTF-8 already plays an important role in interacting with other
libraries (not written in Ada).

> (Where do we stop?  Do we want to require all validated compilers to
> support UTF-8?

Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?

> What about the chinese Big5 or the JIS encodings?)

If there is support for UCS-4, handling these encodings could be
performed by a mechanism similar to POSIX iconv().

*************************************************************

From: Robert Dewar
Sent: Saturday, May 18, 2002  7:43 AM

> Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?

Why not = because there is no real demand. Especially this time around we need
to be very careful not to require things that no one is really interested in.
If we do this, the vendors will simply ignore any new standard. In fact I
think if there is a new standard, it will only be implemented as a result of
direct customer interest in features in this standard. The value of formal
conformance and validation has largely disappeared from the Ada marketplace
at this stage (in terms of customer demand). That's not to say that the Ada
marketplace is not very vital and dynamic, we get dozens of requests for
enhancements from our users every month, but there is precious little
intersection between the things users seem to need and want and these
kind of discussions.

In GNAT, we put a lot of effort into implementing multiple character sets
(we just added the new Latin set with the Euro symbol, because customers
needed that for example). Some of it has been useful (like this Euro
addition), but mostly these features are of entertainment and advertising
value only. In fact the only serious user that we have for Wide_Character
and Wide_String is us (from ASIS :-)

One thing to remember here is that very little is needed in the way of
language support for fancy character sets (most of the effort in GNAT
for example for 8-bit sets is in csets, which gives proper case mapping
for identifiers, and it is easy enough to add new tables to this -- someone
contributed a new Cyrillic table just a few months ago). Most of the issues
are representational issues, and the Ada standard has nothing to say about
source representation (and this should not change in any new standard).

*************************************************************

From: Pascal Leroy
Sent: Tuesday, May 21, 2002  4:03 AM

> > Which is fine as it maps directly to Ada's wide character.  I still think
> > that we want to retain the capacity of using 16-bit blobs to represent
> > characters in the BMP, as 99.5% of practical applications will only need the
> > BMP.
>
> Quite a few people have already changed their minds about the 99.5%
> figure (mathematical characters and Plane 14 Language being the
> reason).  Maybe it's true for the character count, but I doubt it for
> the application count.

Remember, we are talking Ada applications here.  There are probably many
applications out there that deal with mathematical symbols or with Tengwar, but
I doubt that they are written in Ada.

> > External representation is best handled by Text_IO and friends, typically by
> > using a form parameter to specify the encoding (and there are many more
> > encodings than just UCS and UTF).
>
> There was a recent discussion to add other I/O facilities.  UTF-8 is
> becoming more and more common in the Internet context, and often, you
> can determine the encoding of a file only after reading the first
> couple of lines (think of a MIME-encoded mail message).  Furthermore,
> UTF-8 already plays an important role in interacting with other
> libraries (not written in Ada).

Maybe we need a predefined unit to convert UCS-2 to/from UTF-8.  But then such
conversion functions could easily be written by the user, too, or provided by
some public domain stuff.

> > (Where do we stop?  Do we want to require all validated compilers to
> > support UTF-8?
>
> Yes, why not?  Why shall all compilers support ISO 8859-1?  Why UCS-2?

You don't sell many compilers if you don't support 8859-1.  As for UCS-2, well,
that's pretty much the default representation of wide characters anyway.  Other
than that, it would seem that we should let the market decide.  Speaking for
Rational, we have had wide character support for about 7 years, and I don't
recall seeing a single bug report or request for enhancement on this topic.
This may indicate that our technology is perfect, but there are other
explanation ;-) .  (As a matter of fact we probably have very few licenses
installed in countries where 8859-1 is not sufficient to write the native
language -- ignoring the problem with the OE ligature in French.)

One option would be to add Wide_Wide_Character in a new annex, and let users
decide if they want their vendors to support this annex. Of course, chances are
that nobody would care, in which case that would be a lot of standardization
effort for nothing.

*************************************************************

From: Robert Dewar
Sent: Tuesday, May 21, 2002  4:39 AM

I agree with everything Pascal had to say about wide character. We do have
one Japanese customer using wide characters, and as I mentioned earlier,
ASIS uses wide strings to represent source texts, but other than that,
we have heard very little about wide strings. The only real input we have
got from customers on character set issues was the request to support
Latin-9 with the new Euro symbol and we got contributed tables for
Cyrillic from a Russian enthusiast (not a customer, but it seemed a
harmless addition :-)

*************************************************************

From: Florian Weimer
Sent: Tuesday, May 21, 2002  1:42 PM

> I agree with everything Pascal had to say about wide character. We do have
> one Japanese customer using wide characters, and as I mentioned earlier,
> ASIS uses wide strings to represent source texts, but other than that,
> we have heard very little about wide strings.

I guess this customer doesn't use Wide_Character in the way it was
intended (for storing ISO 10646 code position), so this example is a
bit dubious.

> The only real input we have got from customers on character set
> issues was the request to support Latin-9 with the new Euro symbol

Even in this rather innocent case, Wide_Character is no longer using
UCS-2 with GNAT.

*************************************************************

From: Michael F. Yoder
Sent: Monday, October 21, 2002  10:58 AM

This is one of the items on my homework list.

UTF = UCS Transformation Format. UCS = Universal Multiple-Octet Coded
Character Set. I guess the MOC is silent.  :-)

UTF-8 encodes 31-bit values as 8-bit values, as follows.

0xxxxxxx                     encodes itself (the coding is ASCII-compatible)
110xxxxx 10Y                 encodes xxxxxY where Y stands for yyyyyy
1110xxxx 10Y 10Z             encodes xxxxYZ
11110xxx 10Y 10Z 10U         encodes xxxYZU
111110xx 10Y 10Z 10U 10V     encodes xxYZUV
1111110x 10Y 10Z 10U 10V 10W encodes xYZUVW

The octets 11111110 and 11111111 aren't used in the encoding. So,
excepting these 2, octets starting with 11 are headers, those starting
with 10 are trailers, and those starting with 0 are singletons.

It's forbidden to use the redundant encodings (you must use the shortest
encoding allowed). There are security reasons for this, aside from the
fact that doing so breaks the string search property mentioned below.

The encoding is self-synchronizing: if you start in the middle of a
string of octets, you skip octets of the form 10xxxxxx to get to the
next start of character.

If the encoding is proper, string searches for an encoded pattern within
an encoded string will work as desired to yield all occurrences of the
pattern. (For case-folded searches and the like this only works if the
string is mapped before being converted to UTF-8.)

*************************************************************

From: Robert Dewar
Sent: Monday, October 21, 2002  11:03 AM

Is anyone using UTF-8 encoding with Ada. We have some customers using wide
character encodings but none to our knowledge uses UTF-8.

*************************************************************

From: Robert A. Duff
Sent: Monday, October 21, 2002  11:43 AM

> It's forbidden to use the redundant encodings (you must use the shortest
> encoding allowed). There are security reasons for this,...

I'm curious: why is that?  (Not quite curious enough to go RTFM.  ;-))

>... aside from the
> fact that doing so breaks the string search property mentioned below.

Yes, I understand that.

*************************************************************

From: Michael F. Yoder
Sent: Monday, October 21, 2002  1:15 PM

This problem is one my previous employer is having to deal with.
Basically, it's that redundant encodings can be used to sneak things
past filters if the redundant encodings aren't rejected; if redundant
encodings are allowed, writing (say) a regular expression that will
match exactly all possible encoded forms is a pain, is error-prone, and
is probably significantly slower to check.

Here's a contrived case. A program reads a command, and if it's the
special command 'shazam' it checks the user's authorization; otherwise
it passes on the command unmodified, because all other commands are
safe. If there's a redundant encoding of 'shazam' that the filter
misses, an unauthorized user can bypass the checking if he can arrange
to supply that encoding.

*************************************************************

From: Michael F. Yoder
Sent: Thursday, October 24, 2002  5:46 PM

This is the easy part of my homework. The identifier character ranges
are defined in terms of multiple character categories (see below), so I
can't get the harder part without a little coding.

This is using Unicode version 3.2.

A "space" is itself a normative category.  It is anything in the range
U+2000 to U+200B, plus 5 other scattered characters.

A "separator" is any space plus the two characters "Line Separator"
U+2028 and "Paragraph Separator" U+2029. These are each in a normative
category containing just 1 value.

A "decimal digit" is itself a normative category. There are 25 ranges of
these, 23 including the digits 0 through 9 and 2 with only the digits 1
through 9. (These two scripts use the ASCII zero rather than encoding a
separate one.) Five of these ranges are above U+FFFF, that is, out of
the BMP (their character descriptions all start with "mathematical").
The digits 1 through 9 in these scripts don't in general look much like
our 1 through 9.

The rules for identifiers say (I'm condensing and interpreting) that the
syntax for identifiers should start with their basic definition and
fiddle it as appropriate to include extra characters (for Ada, that
means underscore). Their basic definition is

   identifier ::= id-start { id-start | id-extend }

id-start is any letter (which come in 5 subcategories) or a "letter
number." There are a lot of letters outside the BMP, including the large
range "CJK Ideograph Extension B."

id-extend is decimal digits plus nonspacing marks, spacing combining
marks, connector punctuation, and formatting codes.

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  7:19 PM

I am completely confused, why are we discussing this eactly can you
be clear as to the goals of this discussion?

*************************************************************

From: Randy Brukardt
Sent: Thursday, October 24, 2002  2:50 PM

I know I don't count, :-)
but I've had several requests to extend my spam filter to support UTF-8
encodings. Because I'm not asking for any money for the filter, and I
haven't had any signficant amount of UTF-8 mail, I haven't done anything
about it yet. But it seems likely that I will need to do this at some point
(I've seen occassional UTF-8 encoded mail, but not enough good mail that
handling it manually is a problem.)

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  4:29 PM

Oh sure, UTF-8 encoded spam is common indeed, but that was not what I was
talking about (unless you have some spam messages written in Ada source code :-)

*************************************************************

From: Randy Brukardt
Sent: Thursday, October 24, 2002  4:59 PM

I think you misunderstand. I have written an anti-spam plugin for the IMS
mailserver that I use. It is written in Ada, of course, and I've had
requests for it to be able to handle UTF-8 encoded mail. For me, it's fine
to treat such mail as all spam, but that is not true for some of the other
users of it. (I've made it available to the community of IMS users, as they
have made many useful plugins available that I have been using for years.)

In order to properly support UTF-8 mail, I'd need at least to convert the
search patterns (in Latin-1, of course) into UTF-8. I'd also need to verify
that the rules that Mike noted are followed (a common trick of spammers is
to violate basic encoding rules, as most decoders don't check. But the
illegal encodings tend to get ignored by filters, because they don't match
exactly. That was one of the prime reasons I wrote the plugin in the first
place, because a lot of spam is now coming encoded in one way or another,
and thus is not picked up by a plain text scan).

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  7:17 PM

Oh! I was confused then, I thought this was something to do with Ada.

*************************************************************

From: Randy Brukardt
Sent: Thursday, October 24, 2002  7:46 PM

Of course it has to do with Ada. You asked "Is anyone using UTF-8 encoding
with Ada." And I answered that I have an Ada program that needs to process
UTF-8 text (but doesn't yet). And I tried to explain what the program is and
why it needs to process UTF-8 text and what support from Ada would be
valuable.

Perhaps I should have just answered your original question "Yes"? :-)

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  8:09 PM

Sorry, when I meant "using UTF-8 encoding with Ada", I was talking about
language features for wide character representation.

The fact that your program is in Ada does not seem to be particularly
informative. I am completely confused here, what ARG-related language
problem is this thread addressing?

*************************************************************

From: Randy Brukardt
Sent: Thursday, October 24, 2002  8:32 PM

As I recall, one of the facets of UTF-8 support in Ada would be the
equivalent of Ada.Characters.Handling for UTF-8 represented Strings. Those
operations would be valuable for this application, particularly
To_Wide_String (UTF_8_String) or To_UTF_8_String (String). A UTF-8 Text_IO
would also be valuable, although I'd find that overkill for this application
(usually the text has to be decoded to UTF-8 from some 7-bit representation
anyway).

I'm not sure where else UTF-8 would appear in the standard. Source
representation and external file representations are outside of the scope of
the standard. The regular string operations seem to work for most (all?)
operations. Everything else seems to already be covered by the existing wide
character support.

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  8:45 PM

Well, harmless I suppose, but I doubt worth the effort. Again, I would
generate packages on the basis of packages that exist, have proved useful
and are actually widely used. It seems a mistake to get into the "here's
a neat idea for a package that would help with something I happen to be
doing".

*************************************************************

From: Michael F. Yoder
Sent: Thursday, October 24, 2002  5:46 PM

>  I am completely confused here, what ARG-related language
>problem is this thread addressing?

Kiyoshi Ishihata stated at the last meeting that there was in interest
in some countries in being able to write programs as much as possible in
native languages, the primary deficit in this regard being that
identifiers are entirely in Latin-1 characters. He didn't specify which
countries to my recollection, but Japan, Russia, China, and India are
obvious cases where the commonly used scripts are disjoint from Latin-1.

The information being supplied is exploratory in nature: the idea is to
find out how hard it would be to extend existing compilers so as to
satisfy all the national groups at once, and whether and to what extent
the ARG should be involved in specifying standards for such extensions.

There was a separate issue involving the fact that ISO 10646-n (I forget
what n is) now has mapped characters outside the BMP. This had to
happen, given that the code now maps some 70,000 Han characters.

*************************************************************

From: Robert Dewar
Sent: Thursday, October 24, 2002  8:54 PM

Well I would just allow arbitrary wide characters in identifiers, why not,
it does not cause any problems. GNAT has implemented an option for this
for ever. I would specify that there is no upper/lower case equivalence
in this case, since otherwise you get into a huge mess that is simply not
worth the effort.

*************************************************************

From: Tucker Taft
Sent: Thursday, October 24, 2002  10:10 PM

I suggest you read the ARG minutes when they are available.  Kiyoshi
indicated specifically that they wanted to restrict usage to
characters that "make sense" as identifier characters.  I will admit
I was in your camp that the simplest is to just allow anything.
However, I will leave it to Kiyoshi to explain his reasoning.
He certainly knows more than I do about the requirements.  You
should perhaps discuss it direclty with Kiyoshi if you don't agree.

Mike indicated that UTF-8 encoding makes it easy to support even
very wide characters in identifiers, because it provides a canonical
representation, as a stream of bytes.  We asked him to share his
knowledge in this area, so we didn't all have to become experts in
ISO-10646 to evaluate the implemenation issues in this area.

*************************************************************

From: Randy Brukardt
Sent: Thursday, October 24, 2002  10:29 PM

Here is my notes on the Wide_Character in identifiers issue, which will be
turned into the minutes.

"What about full source representation of the language in Wide_Character?
Kiyoshi reports that there is a push in SC22 to allow full wide characters
in identifiers.

How do you define which characters are letters? How do you define case
equivalence? Mike says just use "letter" in the character standard. But this
is likely to be very complex in the compiler and in the run-time. Tucker
suggests use anything out of row 00 be treated a letter. Kiyoshi says that
would not be acceptable to Japan, which is preparing a standard for which
characters are allowed in identifiers."

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  4:11 AM

> I suggest you read the ARG minutes when they are available.  Kiyoshi
> indicated specifically that they wanted to restrict usage to
> characters that "make sense" as identifier characters.  I will admit
> I was in your camp that the simplest is to just allow anything.
> However, I will leave it to Kiyoshi to explain his reasoning.
> He certainly knows more than I do about the requirements.  You
> should perhaps discuss it direclty with Kiyoshi if you don't agree.

I would leave such restrictions up to either local coding standards,
enforced e.g. by ASIS tools, or enforced by compiler restrictions.
Getting into what makes sense in different languages is way way out
of scope (I speak as the former chair of the CRG, character issues
are very difficult to deal with. In the context of the CRG work, we
spent ages discussing the issue of whether E and E-acute should be
equivalent in identifiers, and came to the conclusion that the answer
might be different in different languages.

There is no point in adding a huge national dependent mess here. Indeed I
would consider in the ISO standard saying specifically that national bodies
are welcome to devise local sub-standards for identifiers and character
set requirements and leave it at that.

I perfectly well understand where Kiyoshi is coming from. I am sure he feels
as strongly that only certain characters be used as Jean Ichbiah felt about
the E/E-acute issue. But it just is not practical for the international
standard to get into the business of deciding what are and what are not
useful identifier names in all the languages of the world, or even just
for the P members :-)

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  4:16 AM

OK, so great, very appropriate, there can be a Japanese National standard
that specifies that for Ada compilers to meet this standard, there must be
a mode in which identifiers are only allowed to contain bla bla characters.
Other countries in the world are free to devise similar national standards
but I fail to see why they should be a matter for an international standard.

What would be marginally useful in the international standard would be to
devise a general framework for those national standards, and make it clear
that it is an acceptable thing for Ada compilers to implement one or more
of these standards. Frankly I think that the standard already does that,
but it would be fine to make it explicit. GNAT for example allows lots
of localization of identifier characters sets, e.g. Latin-2, Cyrillic etc.

*************************************************************

From: Pascal Leroy
Sent: Friday, October 25, 2002  6:54 AM

> But it just is not practical for the international
> standard to get into the business of deciding what are and what are not
> useful identifier names in all the languages of the world...

It has certainly never been the intent to have the ARG discuss the
identifier characters for all the languages in the world.  However, there is
an ISO working group in charge of developing and maintaining the ISO 10646
standard, and the intent was to piggyback on the work done there.

10646 defines precisely what is a character (and so yes, E and E-acute are
distinct, as are uppercase A and uppercase alpha, even though they really
look the same), what is a letter, a digit, how the uppercase/lowercase
conversions work, etc.  I see no reason why the Ada standard couldn't use
these definitions.  (And Mike gave us a feeling of what this would look
like, and it doesn't seem unreasonably complicated to me.)

Note that Java does exactly that, and defines letters and digits in a way
which is derived from Unicode (itself a close approximation to 10646).  I
don't see why Ada would lag behind in this area: it would not be a big
implementation effort, and it would improve usability of the language.

I don't buy the notion that national bodies have a role to play here (except
of course that they probably want to influence 10646).  It's already hard to
define one language standard and ensure that it's implemented with a minimum
of consistency, I don't see how users or implementers could live with the
coexistence of "Japanese Ada" and "Hebrew Ada" and "Russian Ada".

Pascal

PS: Note that the E vs. E-acute discussion is moot, since this is already
settled by Latin-1 and yes, they are different.

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  7:55 PM

> I don't buy the notion that national bodies have a role to play here (except
> of course that they probably want to influence 10646).  It's already hard to
> define one language standard and ensure that it's implemented with a minimum
> of consistency, I don't see how users or implementers could live with the
> coexistence of "Japanese Ada" and "Hebrew Ada" and "Russian Ada".

Well GNAT implements lots of different localized character sets, and noone
seems to have dropped dead :-)

*************************************************************

From: Robert A. Duff
Sent: Friday, October 25, 2002  9:13 AM

> Kiyoshi Ishihata stated at the last meeting that there was in interest
> in some countries in being able to write programs as much as possible in
> native languages, the primary deficit in this regard being that
> identifiers are entirely in Latin-1 characters.

Yes, but it was also mentioned at the meeting that SC22 is trying to get
programming languages to do something-or-other related to this.
I.e. allow 31-bit characters in identifiers, and have some uniformity
across programming languages about which characters are allowed in
identifiers.  I suppose WG9 is supposed to "obey" SC22 on this point?

By the way, let's mention the AI number being discussed in these
messages, so we don't get the "What the heck are you talking about?"
kinds of messages from Robert or others who might have missed part of
the discussion.  ;-)  I believe Pascal raised the issue many months ago,
and it has an AI number, and one can presumably search for that AI
number in the meeting minutes (once Randy publishes them).

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  8:32 PM

I tried, I could not find the AI number on this one

Of course if there are uniform rules at the SC22 level, then it is fine
to adopt them in Ada. I just think it is not something we should expend
our own very limited resources on.

*************************************************************

From: Randy Brukardt
Sent: Friday, October 25, 2002  8:59 PM

This was discussed as part of AI-285, which started life as an AI about
Latin-9. That discussion took up the entire afternoon of the third day of
the meeting.

These other issues came up since it was felt that better Wide_Character
support would (might?) make it unnecessary for the standard to directly deal
with Latin-9. (Implementations still would have to, in all likelyhood.)

There are a lot of notes in this area, and I haven't gotten that far in the
minutes yet. So my summary might be suspect... (And I haven't posted the
mail yet, either, but it's likely that it will all got on AI-285.)

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  9:12 PM

> This was discussed as part of AI-285, which started life as an AI about
> Latin-9. That discussion took up the entire afternoon of the third day of
> the meeting.

Be careful not to be eaten alive by character discussions. It was quite
intentional that we banned discussion of these issues from the main group
in the Ada 9X effort and shoveled them off to the CRG. Spending one of
six sessions on this issue alone to me says that things are already getting
out of control :-) I quite understand how this happens (remember I was chair
of the CRG!)

> > These other issues came up since it was felt that better Wide_Character
> > support would (might?) make it unnecessary for the standard to directly deal
> > with Latin-9. (Implementations still would have to, in all likelyhood.)

Well of course in practice Latin-9 is barely interesting, it just introduces
a different name for the Euro character. But for sure most computing with
Ada will be done using latin-9 whatever the Ada standard says :-)

*************************************************************

From: Randy Brukardt
Sent: Friday, October 25, 2002  10:14 PM

Well, it sounds worse that it is. The afternoon session of the last day is
typically short. We didn't get back from lunch until about 2:15, and we
adjorned at 3:28. Still, I probably would have dozed off during this
discussion if I hadn't been taking notes...

*************************************************************

From: Robert A. Duff
Sent: Friday, October 25, 2002  9:19 AM

I agree that the ARG should not spend time thinking about characters.
And we should not add all kinds of verbiage about character sets to the
RM.  But if there is a character-set standard that can be simply
referred to, why not.  Apparently, there *is* a definition of which
31-bit characters are "letters".  I thought the intent was to simply
refer to that definition (which of course changes from year to year).

*************************************************************

From: Robert Dewar
Sent: Friday, October 25, 2002  8:45 PM

Probably that's reasonable, although I worry that this will generate a lot
of busy work in implementations for extraordinarily little gain.

*************************************************************

From: Robert A. Duff
Sent: Saturday, October 26, 2002  9:58 AM

Yes.  The purpose of Mike Yoder's "homework assignment" was to determine
how difficult it is to write the "Is_Letter" function that the Ada lexer
would need.  And a case conversion routine, I guess.  And how
inefficient these would have to be.  (People at the meeting were
concerned about huge character-set tables having to be in the compiler.)

I'm not at all interested in these character set issues.  If folks can
make an AI that is trivial to implement (efficiently), and invokes all
character-set junk by reference to other standards, then I suppose it's
OK with me.

[ Insert my usual rant about what's important, here.  ;-) ]

*************************************************************

From: Robert A. Duff
Sent: Saturday, October 26, 2002  10:14 AM

I agree with Bob in all respects, including the parenthetical comment

*************************************************************

From: Pascal Leroy
Sent: Wednesday, November 27, 2002  4:27 AM

During the last meeting we discussed the possibility of allowing any Unicode
character (er, I mean, ISO 10646) in Ada source.  Some people were concerned
that the classification tables and the uppercase translation tables would be
huge and complex to produce.

Mike Y provided some input on this topic a while back, but since I (and
probably other people) prefer to see the real tables, I spent a couple of hours
writing a little Ada program to parse the Unicode database and spit out
aggregates for these tables.  I am attaching to this message three
classification tables (letters, digits, and spaces) as well as the table that
converts to uppercase.

The latter is the largest one, and it only has 419 entries, for a total of 5028
bytes.  And that's with a representation that is not particularly compact: a
more space-efficient representation could be obtained for instance by storing
the ranges as (First, Length) instead of (First, Last).

The tables would change slightly depending on the rules that we choose (e.g.
for the syntax of identifiers) but their size would not be substantially
modified.

This demonstrates two things:

1 - The tables are easy to produce from the Unicode database.
2 - The tables are small.

---

Digits : constant Ranges :=
   (
    (16#30#, 16#39#), -- DIGIT ZERO .. DIGIT NINE
    (16#B2#, 16#B3#), -- SUPERSCRIPT TWO .. SUPERSCRIPT THREE
    (16#B9#, 16#B9#), -- SUPERSCRIPT ONE .. SUPERSCRIPT ONE
    (16#660#, 16#669#), -- ARABIC-INDIC DIGIT ZERO .. ARABIC-INDIC DIGIT NINE
    (16#6F0#, 16#6F9#), -- EXTENDED ARABIC-INDIC DIGIT ZERO .. EXTENDED ARABIC-INDIC DIGIT NINE
    (16#966#, 16#96F#), -- DEVANAGARI DIGIT ZERO .. DEVANAGARI DIGIT NINE
    (16#9E6#, 16#9EF#), -- BENGALI DIGIT ZERO .. BENGALI DIGIT NINE
    (16#A66#, 16#A6F#), -- GURMUKHI DIGIT ZERO .. GURMUKHI DIGIT NINE
    (16#AE6#, 16#AEF#), -- GUJARATI DIGIT ZERO .. GUJARATI DIGIT NINE
    (16#B66#, 16#B6F#), -- ORIYA DIGIT ZERO .. ORIYA DIGIT NINE
    (16#BE7#, 16#BEF#), -- TAMIL DIGIT ONE .. TAMIL DIGIT NINE
    (16#C66#, 16#C6F#), -- TELUGU DIGIT ZERO .. TELUGU DIGIT NINE
    (16#CE6#, 16#CEF#), -- KANNADA DIGIT ZERO .. KANNADA DIGIT NINE
    (16#D66#, 16#D6F#), -- MALAYALAM DIGIT ZERO .. MALAYALAM DIGIT NINE
    (16#E50#, 16#E59#), -- THAI DIGIT ZERO .. THAI DIGIT NINE
    (16#ED0#, 16#ED9#), -- LAO DIGIT ZERO .. LAO DIGIT NINE
    (16#F20#, 16#F29#), -- TIBETAN DIGIT ZERO .. TIBETAN DIGIT NINE
    (16#1040#, 16#1049#), -- MYANMAR DIGIT ZERO .. MYANMAR DIGIT NINE
    (16#1369#, 16#1371#), -- ETHIOPIC DIGIT ONE .. ETHIOPIC DIGIT NINE
    (16#17E0#, 16#17E9#), -- KHMER DIGIT ZERO .. KHMER DIGIT NINE
    (16#1810#, 16#1819#), -- MONGOLIAN DIGIT ZERO .. MONGOLIAN DIGIT NINE
    (16#2070#, 16#2070#), -- SUPERSCRIPT ZERO .. SUPERSCRIPT ZERO
    (16#2074#, 16#2079#), -- SUPERSCRIPT FOUR .. SUPERSCRIPT NINE
    (16#2080#, 16#2089#), -- SUBSCRIPT ZERO .. SUBSCRIPT NINE
    (16#FF10#, 16#FF19#), -- FULLWIDTH DIGIT ZERO .. FULLWIDTH DIGIT NINE
    (16#1D7CE#, 16#1D7FF#) -- MATHEMATICAL BOLD DIGIT ZERO .. MATHEMATICAL MONOSPACE DIGIT NINE
   );

---

Letters : constant Ranges :=
   (
    (16#41#, 16#5A#), -- LATIN CAPITAL LETTER A .. LATIN CAPITAL LETTER Z
    (16#61#, 16#7A#), -- LATIN SMALL LETTER A .. LATIN SMALL LETTER Z
    (16#AA#, 16#AA#), -- FEMININE ORDINAL INDICATOR .. FEMININE ORDINAL INDICATOR
    (16#B5#, 16#B5#), -- MICRO SIGN .. MICRO SIGN
    (16#BA#, 16#BA#), -- MASCULINE ORDINAL INDICATOR .. MASCULINE ORDINAL INDICATOR
    (16#C0#, 16#D6#), -- LATIN CAPITAL LETTER A WITH GRAVE .. LATIN CAPITAL LETTER O WITH DIAERESIS
    (16#D8#, 16#F6#), -- LATIN CAPITAL LETTER O WITH STROKE .. LATIN SMALL LETTER O WITH DIAERESIS
    (16#F8#, 16#2B8#), -- LATIN SMALL LETTER O WITH STROKE .. MODIFIER LETTER SMALL Y
    (16#2BB#, 16#2C1#), -- MODIFIER LETTER TURNED COMMA .. MODIFIER LETTER REVERSED GLOTTAL STOP
    (16#2D0#, 16#2D1#), -- MODIFIER LETTER TRIANGULAR COLON .. MODIFIER LETTER HALF TRIANGULAR COLON
    (16#2E0#, 16#2E4#), -- MODIFIER LETTER SMALL GAMMA .. MODIFIER LETTER SMALL REVERSED GLOTTAL STOP
    (16#2EE#, 16#2EE#), -- MODIFIER LETTER DOUBLE APOSTROPHE .. MODIFIER LETTER DOUBLE APOSTROPHE
    (16#37A#, 16#37A#), -- GREEK YPOGEGRAMMENI .. GREEK YPOGEGRAMMENI
    (16#386#, 16#386#), -- GREEK CAPITAL LETTER ALPHA WITH TONOS .. GREEK CAPITAL LETTER ALPHA WITH TONOS
    (16#388#, 16#3F5#), -- GREEK CAPITAL LETTER EPSILON WITH TONOS .. GREEK LUNATE EPSILON SYMBOL
    (16#400#, 16#481#), -- CYRILLIC CAPITAL LETTER IE WITH GRAVE .. CYRILLIC SMALL LETTER KOPPA
    (16#48A#, 16#559#), -- CYRILLIC CAPITAL LETTER SHORT I WITH TAIL .. ARMENIAN MODIFIER LETTER LEFT HALF RING
    (16#561#, 16#587#), -- ARMENIAN SMALL LETTER AYB .. ARMENIAN SMALL LIGATURE ECH YIWN
    (16#5D0#, 16#5F2#), -- HEBREW LETTER ALEF .. HEBREW LIGATURE YIDDISH DOUBLE YOD
    (16#621#, 16#64A#), -- ARABIC LETTER HAMZA .. ARABIC LETTER YEH
    (16#66E#, 16#66F#), -- ARABIC LETTER DOTLESS BEH .. ARABIC LETTER DOTLESS QAF
    (16#671#, 16#6D3#), -- ARABIC LETTER ALEF WASLA .. ARABIC LETTER YEH BARREE WITH HAMZA ABOVE
    (16#6D5#, 16#6D5#), -- ARABIC LETTER AE .. ARABIC LETTER AE
    (16#6E5#, 16#6E6#), -- ARABIC SMALL WAW .. ARABIC SMALL YEH
    (16#6FA#, 16#6FC#), -- ARABIC LETTER SHEEN WITH DOT BELOW .. ARABIC LETTER GHAIN WITH DOT BELOW
    (16#710#, 16#710#), -- SYRIAC LETTER ALAPH .. SYRIAC LETTER ALAPH
    (16#712#, 16#72C#), -- SYRIAC LETTER BETH .. SYRIAC LETTER TAW
    (16#780#, 16#7A5#), -- THAANA LETTER HAA .. THAANA LETTER WAAVU
    (16#7B1#, 16#7B1#), -- THAANA LETTER NAA .. THAANA LETTER NAA
    (16#905#, 16#939#), -- DEVANAGARI LETTER A .. DEVANAGARI LETTER HA
    (16#93D#, 16#93D#), -- DEVANAGARI SIGN AVAGRAHA .. DEVANAGARI SIGN AVAGRAHA
    (16#950#, 16#950#), -- DEVANAGARI OM .. DEVANAGARI OM
    (16#958#, 16#961#), -- DEVANAGARI LETTER QA .. DEVANAGARI LETTER VOCALIC LL
    (16#985#, 16#9B9#), -- BENGALI LETTER A .. BENGALI LETTER HA
    (16#9DC#, 16#9E1#), -- BENGALI LETTER RRA .. BENGALI LETTER VOCALIC LL
    (16#9F0#, 16#9F1#), -- BENGALI LETTER RA WITH MIDDLE DIAGONAL .. BENGALI LETTER RA WITH LOWER DIAGONAL
    (16#A05#, 16#A39#), -- GURMUKHI LETTER A .. GURMUKHI LETTER HA
    (16#A59#, 16#A5E#), -- GURMUKHI LETTER KHHA .. GURMUKHI LETTER FA
    (16#A72#, 16#A74#), -- GURMUKHI IRI .. GURMUKHI EK ONKAR
    (16#A85#, 16#AB9#), -- GUJARATI LETTER A .. GUJARATI LETTER HA
    (16#ABD#, 16#ABD#), -- GUJARATI SIGN AVAGRAHA .. GUJARATI SIGN AVAGRAHA
    (16#AD0#, 16#AE0#), -- GUJARATI OM .. GUJARATI LETTER VOCALIC RR
    (16#B05#, 16#B39#), -- ORIYA LETTER A .. ORIYA LETTER HA
    (16#B3D#, 16#B3D#), -- ORIYA SIGN AVAGRAHA .. ORIYA SIGN AVAGRAHA
    (16#B5C#, 16#B61#), -- ORIYA LETTER RRA .. ORIYA LETTER VOCALIC LL
    (16#B83#, 16#BB9#), -- TAMIL SIGN VISARGA .. TAMIL LETTER HA
    (16#C05#, 16#C39#), -- TELUGU LETTER A .. TELUGU LETTER HA
    (16#C60#, 16#C61#), -- TELUGU LETTER VOCALIC RR .. TELUGU LETTER VOCALIC LL
    (16#C85#, 16#CB9#), -- KANNADA LETTER A .. KANNADA LETTER HA
    (16#CDE#, 16#CE1#), -- KANNADA LETTER FA .. KANNADA LETTER VOCALIC LL
    (16#D05#, 16#D39#), -- MALAYALAM LETTER A .. MALAYALAM LETTER HA
    (16#D60#, 16#D61#), -- MALAYALAM LETTER VOCALIC RR .. MALAYALAM LETTER VOCALIC LL
    (16#D85#, 16#DC6#), -- SINHALA LETTER AYANNA .. SINHALA LETTER FAYANNA
    (16#E01#, 16#E30#), -- THAI CHARACTER KO KAI .. THAI CHARACTER SARA A
    (16#E32#, 16#E33#), -- THAI CHARACTER SARA AA .. THAI CHARACTER SARA AM
    (16#E40#, 16#E46#), -- THAI CHARACTER SARA E .. THAI CHARACTER MAIYAMOK
    (16#E81#, 16#EB0#), -- LAO LETTER KO .. LAO VOWEL SIGN A
    (16#EB2#, 16#EB3#), -- LAO VOWEL SIGN AA .. LAO VOWEL SIGN AM
    (16#EBD#, 16#EC6#), -- LAO SEMIVOWEL SIGN NYO .. LAO KO LA
    (16#EDC#, 16#F00#), -- LAO HO NO .. TIBETAN SYLLABLE OM
    (16#F40#, 16#F6A#), -- TIBETAN LETTER KA .. TIBETAN LETTER FIXED-FORM RA
    (16#F88#, 16#F8B#), -- TIBETAN SIGN LCE TSA CAN .. TIBETAN SIGN GRU MED RGYINGS
    (16#1000#, 16#102A#), -- MYANMAR LETTER KA .. MYANMAR LETTER AU
    (16#1050#, 16#1055#), -- MYANMAR LETTER SHA .. MYANMAR LETTER VOCALIC LL
    (16#10A0#, 16#10F8#), -- GEORGIAN CAPITAL LETTER AN .. GEORGIAN LETTER ELIFI
    (16#1100#, 16#135A#), -- HANGUL CHOSEONG KIYEOK .. ETHIOPIC SYLLABLE FYA
    (16#13A0#, 16#166C#), -- CHEROKEE LETTER A .. CANADIAN SYLLABICS CARRIER TTSA
    (16#166F#, 16#1676#), -- CANADIAN SYLLABICS QAI .. CANADIAN SYLLABICS NNGAA
    (16#1681#, 16#169A#), -- OGHAM LETTER BEITH .. OGHAM LETTER PEITH
    (16#16A0#, 16#16EA#), -- RUNIC LETTER FEHU FEOH FE F .. RUNIC LETTER X
    (16#1700#, 16#1711#), -- TAGALOG LETTER A .. TAGALOG LETTER HA
    (16#1720#, 16#1731#), -- HANUNOO LETTER A .. HANUNOO LETTER HA
    (16#1740#, 16#1751#), -- BUHID LETTER A .. BUHID LETTER HA
    (16#1760#, 16#1770#), -- TAGBANWA LETTER A .. TAGBANWA LETTER SA
    (16#1780#, 16#17B3#), -- KHMER LETTER KA .. KHMER INDEPENDENT VOWEL QAU
    (16#17D7#, 16#17D7#), -- KHMER SIGN LEK TOO .. KHMER SIGN LEK TOO
    (16#17DC#, 16#17DC#), -- KHMER SIGN AVAKRAHASANYA .. KHMER SIGN AVAKRAHASANYA
    (16#1820#, 16#18A8#), -- MONGOLIAN LETTER A .. MONGOLIAN LETTER MANCHU ALI GALI BHA
    (16#1E00#, 16#1FBC#), -- LATIN CAPITAL LETTER A WITH RING BELOW .. GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
    (16#1FBE#, 16#1FBE#), -- GREEK PROSGEGRAMMENI .. GREEK PROSGEGRAMMENI
    (16#1FC2#, 16#1FCC#), -- GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI .. GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
    (16#1FD0#, 16#1FDB#), -- GREEK SMALL LETTER IOTA WITH VRACHY .. GREEK CAPITAL LETTER IOTA WITH OXIA
    (16#1FE0#, 16#1FEC#), -- GREEK SMALL LETTER UPSILON WITH VRACHY .. GREEK CAPITAL LETTER RHO WITH DASIA
    (16#1FF2#, 16#1FFC#), -- GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI .. GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
    (16#2071#, 16#2071#), -- SUPERSCRIPT LATIN SMALL LETTER I .. SUPERSCRIPT LATIN SMALL LETTER I
    (16#207F#, 16#207F#), -- SUPERSCRIPT LATIN SMALL LETTER N .. SUPERSCRIPT LATIN SMALL LETTER N
    (16#2102#, 16#2102#), -- DOUBLE-STRUCK CAPITAL C .. DOUBLE-STRUCK CAPITAL C
    (16#2107#, 16#2107#), -- EULER CONSTANT .. EULER CONSTANT
    (16#210A#, 16#2113#), -- SCRIPT SMALL G .. SCRIPT SMALL L
    (16#2115#, 16#2115#), -- DOUBLE-STRUCK CAPITAL N .. DOUBLE-STRUCK CAPITAL N
    (16#2119#, 16#211D#), -- DOUBLE-STRUCK CAPITAL P .. DOUBLE-STRUCK CAPITAL R
    (16#2124#, 16#2124#), -- DOUBLE-STRUCK CAPITAL Z .. DOUBLE-STRUCK CAPITAL Z
    (16#2126#, 16#2126#), -- OHM SIGN .. OHM SIGN
    (16#2128#, 16#2128#), -- BLACK-LETTER CAPITAL Z .. BLACK-LETTER CAPITAL Z
    (16#212A#, 16#212D#), -- KELVIN SIGN .. BLACK-LETTER CAPITAL C
    (16#212F#, 16#2131#), -- SCRIPT SMALL E .. SCRIPT CAPITAL F
    (16#2133#, 16#2139#), -- SCRIPT CAPITAL M .. INFORMATION SOURCE
    (16#213D#, 16#213F#), -- DOUBLE-STRUCK SMALL GAMMA .. DOUBLE-STRUCK CAPITAL PI
    (16#2145#, 16#2149#), -- DOUBLE-STRUCK ITALIC CAPITAL D .. DOUBLE-STRUCK ITALIC SMALL J
    (16#3005#, 16#3006#), -- IDEOGRAPHIC ITERATION MARK .. IDEOGRAPHIC CLOSING MARK
    (16#3031#, 16#3035#), -- VERTICAL KANA REPEAT MARK .. VERTICAL KANA REPEAT MARK LOWER HALF
    (16#303B#, 16#303C#), -- VERTICAL IDEOGRAPHIC ITERATION MARK .. MASU MARK
    (16#3041#, 16#3096#), -- HIRAGANA LETTER SMALL A .. HIRAGANA LETTER SMALL KE
    (16#309D#, 16#309F#), -- HIRAGANA ITERATION MARK .. HIRAGANA DIGRAPH YORI
    (16#30A1#, 16#30FA#), -- KATAKANA LETTER SMALL A .. KATAKANA LETTER VO
    (16#30FC#, 16#318E#), -- KATAKANA-HIRAGANA PROLONGED SOUND MARK .. HANGUL LETTER ARAEAE
    (16#31A0#, 16#31FF#), -- BOPOMOFO LETTER BU .. KATAKANA LETTER SMALL RO
    (16#3400#, 16#A48C#), -- <CJK Ideograph Extension A, First> .. YI SYLLABLE YYR
    (16#AC00#, 16#D7A3#), -- <Hangul Syllable, First> .. <Hangul Syllable, Last>
    (16#F900#, 16#FB1D#), -- CJK COMPATIBILITY IDEOGRAPH-F900 .. HEBREW LETTER YOD WITH HIRIQ
    (16#FB1F#, 16#FB28#), -- HEBREW LIGATURE YIDDISH YOD YOD PATAH .. HEBREW LETTER WIDE TAV
    (16#FB2A#, 16#FD3D#), -- HEBREW LETTER SHIN WITH SHIN DOT .. ARABIC LIGATURE ALEF WITH FATHATAN ISOLATED FORM
    (16#FD50#, 16#FDFB#), -- ARABIC LIGATURE TEH WITH JEEM WITH MEEM INITIAL FORM .. ARABIC LIGATURE JALLAJALALOUHOU
    (16#FE70#, 16#FEFC#), -- ARABIC FATHATAN ISOLATED FORM .. ARABIC LIGATURE LAM WITH ALEF FINAL FORM
    (16#FF21#, 16#FF3A#), -- FULLWIDTH LATIN CAPITAL LETTER A .. FULLWIDTH LATIN CAPITAL LETTER Z
    (16#FF41#, 16#FF5A#), -- FULLWIDTH LATIN SMALL LETTER A .. FULLWIDTH LATIN SMALL LETTER Z
    (16#FF66#, 16#FFDC#), -- HALFWIDTH KATAKANA LETTER WO .. HALFWIDTH HANGUL LETTER I
    (16#10300#, 16#1031E#), -- OLD ITALIC LETTER A .. OLD ITALIC LETTER UU
    (16#10330#, 16#10349#), -- GOTHIC LETTER AHSA .. GOTHIC LETTER OTHAL
    (16#10400#, 16#1044D#), -- DESERET CAPITAL LETTER LONG I .. DESERET SMALL LETTER ENG
    (16#1D400#, 16#1D6C0#), -- MATHEMATICAL BOLD CAPITAL A .. MATHEMATICAL BOLD CAPITAL OMEGA
    (16#1D6C2#, 16#1D6DA#), -- MATHEMATICAL BOLD SMALL ALPHA .. MATHEMATICAL BOLD SMALL OMEGA
    (16#1D6DC#, 16#1D6FA#), -- MATHEMATICAL BOLD EPSILON SYMBOL .. MATHEMATICAL ITALIC CAPITAL OMEGA
    (16#1D6FC#, 16#1D714#), -- MATHEMATICAL ITALIC SMALL ALPHA .. MATHEMATICAL ITALIC SMALL OMEGA
    (16#1D716#, 16#1D734#), -- MATHEMATICAL ITALIC EPSILON SYMBOL .. MATHEMATICAL BOLD ITALIC CAPITAL OMEGA
    (16#1D736#, 16#1D74E#), -- MATHEMATICAL BOLD ITALIC SMALL ALPHA .. MATHEMATICAL BOLD ITALIC SMALL OMEGA
    (16#1D750#, 16#1D76E#), -- MATHEMATICAL BOLD ITALIC EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD CAPITAL OMEGA
    (16#1D770#, 16#1D788#), -- MATHEMATICAL SANS-SERIF BOLD SMALL ALPHA .. MATHEMATICAL SANS-SERIF BOLD SMALL OMEGA
    (16#1D78A#, 16#1D7A8#), -- MATHEMATICAL SANS-SERIF BOLD EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL OMEGA
    (16#1D7AA#, 16#1D7C2#), -- MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL ALPHA .. MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL OMEGA
    (16#1D7C4#, 16#1D7C9#), -- MATHEMATICAL SANS-SERIF BOLD ITALIC EPSILON SYMBOL .. MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL
    (16#20000#, 16#2FA1D#) -- <CJK Ideograph Extension B, First> .. CJK COMPATIBILITY IDEOGRAPH-2FA1D
   );

---

Spaces : constant Ranges :=
   (
    (16#20#, 16#20#), -- SPACE .. SPACE
    (16#A0#, 16#A0#), -- NO-BREAK SPACE .. NO-BREAK SPACE
    (16#1680#, 16#1680#), -- OGHAM SPACE MARK .. OGHAM SPACE MARK
    (16#2000#, 16#200B#), -- EN QUAD .. ZERO WIDTH SPACE
    (16#202F#, 16#202F#), -- NARROW NO-BREAK SPACE .. NARROW NO-BREAK SPACE
    (16#205F#, 16#205F#), -- MEDIUM MATHEMATICAL SPACE .. MEDIUM MATHEMATICAL SPACE
    (16#3000#, 16#3000#) -- IDEOGRAPHIC SPACE .. IDEOGRAPHIC SPACE
   );

---

Uppercase_Mapping : constant Mapping_Ranges :=
   (
    (16#61#, 16#7A#, -32), -- LATIN SMALL LETTER A .. LATIN SMALL LETTER Z
    (16#B5#, 16#B5#, 743), -- MICRO SIGN .. MICRO SIGN
    (16#E0#, 16#F6#, -32), -- LATIN SMALL LETTER A WITH GRAVE .. LATIN SMALL LETTER O WITH DIAERESIS
    (16#F8#, 16#FE#, -32), -- LATIN SMALL LETTER O WITH STROKE .. LATIN SMALL LETTER THORN
    (16#FF#, 16#FF#, 121), -- LATIN SMALL LETTER Y WITH DIAERESIS .. LATIN SMALL LETTER Y WITH DIAERESIS
    (16#101#, 16#101#, -1), -- LATIN SMALL LETTER A WITH MACRON .. LATIN SMALL LETTER A WITH MACRON
    (16#103#, 16#103#, -1), -- LATIN SMALL LETTER A WITH BREVE .. LATIN SMALL LETTER A WITH BREVE
    (16#105#, 16#105#, -1), -- LATIN SMALL LETTER A WITH OGONEK .. LATIN SMALL LETTER A WITH OGONEK
    (16#107#, 16#107#, -1), -- LATIN SMALL LETTER C WITH ACUTE .. LATIN SMALL LETTER C WITH ACUTE
    (16#109#, 16#109#, -1), -- LATIN SMALL LETTER C WITH CIRCUMFLEX .. LATIN SMALL LETTER C WITH CIRCUMFLEX
    (16#10B#, 16#10B#, -1), -- LATIN SMALL LETTER C WITH DOT ABOVE .. LATIN SMALL LETTER C WITH DOT ABOVE
    (16#10D#, 16#10D#, -1), -- LATIN SMALL LETTER C WITH CARON .. LATIN SMALL LETTER C WITH CARON
    (16#10F#, 16#10F#, -1), -- LATIN SMALL LETTER D WITH CARON .. LATIN SMALL LETTER D WITH CARON
    (16#111#, 16#111#, -1), -- LATIN SMALL LETTER D WITH STROKE .. LATIN SMALL LETTER D WITH STROKE
    (16#113#, 16#113#, -1), -- LATIN SMALL LETTER E WITH MACRON .. LATIN SMALL LETTER E WITH MACRON
    (16#115#, 16#115#, -1), -- LATIN SMALL LETTER E WITH BREVE .. LATIN SMALL LETTER E WITH BREVE
    (16#117#, 16#117#, -1), -- LATIN SMALL LETTER E WITH DOT ABOVE .. LATIN SMALL LETTER E WITH DOT ABOVE
    (16#119#, 16#119#, -1), -- LATIN SMALL LETTER E WITH OGONEK .. LATIN SMALL LETTER E WITH OGONEK
    (16#11B#, 16#11B#, -1), -- LATIN SMALL LETTER E WITH CARON .. LATIN SMALL LETTER E WITH CARON
    (16#11D#, 16#11D#, -1), -- LATIN SMALL LETTER G WITH CIRCUMFLEX .. LATIN SMALL LETTER G WITH CIRCUMFLEX
    (16#11F#, 16#11F#, -1), -- LATIN SMALL LETTER G WITH BREVE .. LATIN SMALL LETTER G WITH BREVE
    (16#121#, 16#121#, -1), -- LATIN SMALL LETTER G WITH DOT ABOVE .. LATIN SMALL LETTER G WITH DOT ABOVE
    (16#123#, 16#123#, -1), -- LATIN SMALL LETTER G WITH CEDILLA .. LATIN SMALL LETTER G WITH CEDILLA
    (16#125#, 16#125#, -1), -- LATIN SMALL LETTER H WITH CIRCUMFLEX .. LATIN SMALL LETTER H WITH CIRCUMFLEX
    (16#127#, 16#127#, -1), -- LATIN SMALL LETTER H WITH STROKE .. LATIN SMALL LETTER H WITH STROKE
    (16#129#, 16#129#, -1), -- LATIN SMALL LETTER I WITH TILDE .. LATIN SMALL LETTER I WITH TILDE
    (16#12B#, 16#12B#, -1), -- LATIN SMALL LETTER I WITH MACRON .. LATIN SMALL LETTER I WITH MACRON
    (16#12D#, 16#12D#, -1), -- LATIN SMALL LETTER I WITH BREVE .. LATIN SMALL LETTER I WITH BREVE
    (16#12F#, 16#12F#, -1), -- LATIN SMALL LETTER I WITH OGONEK .. LATIN SMALL LETTER I WITH OGONEK
    (16#131#, 16#131#, -232), -- LATIN SMALL LETTER DOTLESS I .. LATIN SMALL LETTER DOTLESS I
    (16#133#, 16#133#, -1), -- LATIN SMALL LIGATURE IJ .. LATIN SMALL LIGATURE IJ
    (16#135#, 16#135#, -1), -- LATIN SMALL LETTER J WITH CIRCUMFLEX .. LATIN SMALL LETTER J WITH CIRCUMFLEX
    (16#137#, 16#137#, -1), -- LATIN SMALL LETTER K WITH CEDILLA .. LATIN SMALL LETTER K WITH CEDILLA
    (16#13A#, 16#13A#, -1), -- LATIN SMALL LETTER L WITH ACUTE .. LATIN SMALL LETTER L WITH ACUTE
    (16#13C#, 16#13C#, -1), -- LATIN SMALL LETTER L WITH CEDILLA .. LATIN SMALL LETTER L WITH CEDILLA
    (16#13E#, 16#13E#, -1), -- LATIN SMALL LETTER L WITH CARON .. LATIN SMALL LETTER L WITH CARON
    (16#140#, 16#140#, -1), -- LATIN SMALL LETTER L WITH MIDDLE DOT .. LATIN SMALL LETTER L WITH MIDDLE DOT
    (16#142#, 16#142#, -1), -- LATIN SMALL LETTER L WITH STROKE .. LATIN SMALL LETTER L WITH STROKE
    (16#144#, 16#144#, -1), -- LATIN SMALL LETTER N WITH ACUTE .. LATIN SMALL LETTER N WITH ACUTE
    (16#146#, 16#146#, -1), -- LATIN SMALL LETTER N WITH CEDILLA .. LATIN SMALL LETTER N WITH CEDILLA
    (16#148#, 16#148#, -1), -- LATIN SMALL LETTER N WITH CARON .. LATIN SMALL LETTER N WITH CARON
    (16#14B#, 16#14B#, -1), -- LATIN SMALL LETTER ENG .. LATIN SMALL LETTER ENG
    (16#14D#, 16#14D#, -1), -- LATIN SMALL LETTER O WITH MACRON .. LATIN SMALL LETTER O WITH MACRON
    (16#14F#, 16#14F#, -1), -- LATIN SMALL LETTER O WITH BREVE .. LATIN SMALL LETTER O WITH BREVE
    (16#151#, 16#151#, -1), -- LATIN SMALL LETTER O WITH DOUBLE ACUTE .. LATIN SMALL LETTER O WITH DOUBLE ACUTE
    (16#153#, 16#153#, -1), -- LATIN SMALL LIGATURE OE .. LATIN SMALL LIGATURE OE
    (16#155#, 16#155#, -1), -- LATIN SMALL LETTER R WITH ACUTE .. LATIN SMALL LETTER R WITH ACUTE
    (16#157#, 16#157#, -1), -- LATIN SMALL LETTER R WITH CEDILLA .. LATIN SMALL LETTER R WITH CEDILLA
    (16#159#, 16#159#, -1), -- LATIN SMALL LETTER R WITH CARON .. LATIN SMALL LETTER R WITH CARON
    (16#15B#, 16#15B#, -1), -- LATIN SMALL LETTER S WITH ACUTE .. LATIN SMALL LETTER S WITH ACUTE
    (16#15D#, 16#15D#, -1), -- LATIN SMALL LETTER S WITH CIRCUMFLEX .. LATIN SMALL LETTER S WITH CIRCUMFLEX
    (16#15F#, 16#15F#, -1), -- LATIN SMALL LETTER S WITH CEDILLA .. LATIN SMALL LETTER S WITH CEDILLA
    (16#161#, 16#161#, -1), -- LATIN SMALL LETTER S WITH CARON .. LATIN SMALL LETTER S WITH CARON
    (16#163#, 16#163#, -1), -- LATIN SMALL LETTER T WITH CEDILLA .. LATIN SMALL LETTER T WITH CEDILLA
    (16#165#, 16#165#, -1), -- LATIN SMALL LETTER T WITH CARON .. LATIN SMALL LETTER T WITH CARON
    (16#167#, 16#167#, -1), -- LATIN SMALL LETTER T WITH STROKE .. LATIN SMALL LETTER T WITH STROKE
    (16#169#, 16#169#, -1), -- LATIN SMALL LETTER U WITH TILDE .. LATIN SMALL LETTER U WITH TILDE
    (16#16B#, 16#16B#, -1), -- LATIN SMALL LETTER U WITH MACRON .. LATIN SMALL LETTER U WITH MACRON
    (16#16D#, 16#16D#, -1), -- LATIN SMALL LETTER U WITH BREVE .. LATIN SMALL LETTER U WITH BREVE
    (16#16F#, 16#16F#, -1), -- LATIN SMALL LETTER U WITH RING ABOVE .. LATIN SMALL LETTER U WITH RING ABOVE
    (16#171#, 16#171#, -1), -- LATIN SMALL LETTER U WITH DOUBLE ACUTE .. LATIN SMALL LETTER U WITH DOUBLE ACUTE
    (16#173#, 16#173#, -1), -- LATIN SMALL LETTER U WITH OGONEK .. LATIN SMALL LETTER U WITH OGONEK
    (16#175#, 16#175#, -1), -- LATIN SMALL LETTER W WITH CIRCUMFLEX .. LATIN SMALL LETTER W WITH CIRCUMFLEX
    (16#177#, 16#177#, -1), -- LATIN SMALL LETTER Y WITH CIRCUMFLEX .. LATIN SMALL LETTER Y WITH CIRCUMFLEX
    (16#17A#, 16#17A#, -1), -- LATIN SMALL LETTER Z WITH ACUTE .. LATIN SMALL LETTER Z WITH ACUTE
    (16#17C#, 16#17C#, -1), -- LATIN SMALL LETTER Z WITH DOT ABOVE .. LATIN SMALL LETTER Z WITH DOT ABOVE
    (16#17E#, 16#17E#, -1), -- LATIN SMALL LETTER Z WITH CARON .. LATIN SMALL LETTER Z WITH CARON
    (16#17F#, 16#17F#, -300), -- LATIN SMALL LETTER LONG S .. LATIN SMALL LETTER LONG S
    (16#183#, 16#183#, -1), -- LATIN SMALL LETTER B WITH TOPBAR .. LATIN SMALL LETTER B WITH TOPBAR
    (16#185#, 16#185#, -1), -- LATIN SMALL LETTER TONE SIX .. LATIN SMALL LETTER TONE SIX
    (16#188#, 16#188#, -1), -- LATIN SMALL LETTER C WITH HOOK .. LATIN SMALL LETTER C WITH HOOK
    (16#18C#, 16#18C#, -1), -- LATIN SMALL LETTER D WITH TOPBAR .. LATIN SMALL LETTER D WITH TOPBAR
    (16#192#, 16#192#, -1), -- LATIN SMALL LETTER F WITH HOOK .. LATIN SMALL LETTER F WITH HOOK
    (16#195#, 16#195#, 97), -- LATIN SMALL LETTER HV .. LATIN SMALL LETTER HV
    (16#199#, 16#199#, -1), -- LATIN SMALL LETTER K WITH HOOK .. LATIN SMALL LETTER K WITH HOOK
    (16#19E#, 16#19E#, 130), -- LATIN SMALL LETTER N WITH LONG RIGHT LEG .. LATIN SMALL LETTER N WITH LONG RIGHT LEG
    (16#1A1#, 16#1A1#, -1), -- LATIN SMALL LETTER O WITH HORN .. LATIN SMALL LETTER O WITH HORN
    (16#1A3#, 16#1A3#, -1), -- LATIN SMALL LETTER OI .. LATIN SMALL LETTER OI
    (16#1A5#, 16#1A5#, -1), -- LATIN SMALL LETTER P WITH HOOK .. LATIN SMALL LETTER P WITH HOOK
    (16#1A8#, 16#1A8#, -1), -- LATIN SMALL LETTER TONE TWO .. LATIN SMALL LETTER TONE TWO
    (16#1AD#, 16#1AD#, -1), -- LATIN SMALL LETTER T WITH HOOK .. LATIN SMALL LETTER T WITH HOOK
    (16#1B0#, 16#1B0#, -1), -- LATIN SMALL LETTER U WITH HORN .. LATIN SMALL LETTER U WITH HORN
    (16#1B4#, 16#1B4#, -1), -- LATIN SMALL LETTER Y WITH HOOK .. LATIN SMALL LETTER Y WITH HOOK
    (16#1B6#, 16#1B6#, -1), -- LATIN SMALL LETTER Z WITH STROKE .. LATIN SMALL LETTER Z WITH STROKE
    (16#1B9#, 16#1B9#, -1), -- LATIN SMALL LETTER EZH REVERSED .. LATIN SMALL LETTER EZH REVERSED
    (16#1BD#, 16#1BD#, -1), -- LATIN SMALL LETTER TONE FIVE .. LATIN SMALL LETTER TONE FIVE
    (16#1BF#, 16#1BF#, 56), -- LATIN LETTER WYNN .. LATIN LETTER WYNN
    (16#1C5#, 16#1C5#, -1), -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
    (16#1C6#, 16#1C6#, -2), -- LATIN SMALL LETTER DZ WITH CARON .. LATIN SMALL LETTER DZ WITH CARON
    (16#1C8#, 16#1C8#, -1), -- LATIN CAPITAL LETTER L WITH SMALL LETTER J .. LATIN CAPITAL LETTER L WITH SMALL LETTER J
    (16#1C9#, 16#1C9#, -2), -- LATIN SMALL LETTER LJ .. LATIN SMALL LETTER LJ
    (16#1CB#, 16#1CB#, -1), -- LATIN CAPITAL LETTER N WITH SMALL LETTER J .. LATIN CAPITAL LETTER N WITH SMALL LETTER J
    (16#1CC#, 16#1CC#, -2), -- LATIN SMALL LETTER NJ .. LATIN SMALL LETTER NJ
    (16#1CE#, 16#1CE#, -1), -- LATIN SMALL LETTER A WITH CARON .. LATIN SMALL LETTER A WITH CARON
    (16#1D0#, 16#1D0#, -1), -- LATIN SMALL LETTER I WITH CARON .. LATIN SMALL LETTER I WITH CARON
    (16#1D2#, 16#1D2#, -1), -- LATIN SMALL LETTER O WITH CARON .. LATIN SMALL LETTER O WITH CARON
    (16#1D4#, 16#1D4#, -1), -- LATIN SMALL LETTER U WITH CARON .. LATIN SMALL LETTER U WITH CARON
    (16#1D6#, 16#1D6#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER U WITH DIAERESIS AND MACRON
    (16#1D8#, 16#1D8#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE
    (16#1DA#, 16#1DA#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND CARON .. LATIN SMALL LETTER U WITH DIAERESIS AND CARON
    (16#1DC#, 16#1DC#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE .. LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE
    (16#1DD#, 16#1DD#, -79), -- LATIN SMALL LETTER TURNED E .. LATIN SMALL LETTER TURNED E
    (16#1DF#, 16#1DF#, -1), -- LATIN SMALL LETTER A WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
    (16#1E1#, 16#1E1#, -1), -- LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
    (16#1E3#, 16#1E3#, -1), -- LATIN SMALL LETTER AE WITH MACRON .. LATIN SMALL LETTER AE WITH MACRON
    (16#1E5#, 16#1E5#, -1), -- LATIN SMALL LETTER G WITH STROKE .. LATIN SMALL LETTER G WITH STROKE
    (16#1E7#, 16#1E7#, -1), -- LATIN SMALL LETTER G WITH CARON .. LATIN SMALL LETTER G WITH CARON
    (16#1E9#, 16#1E9#, -1), -- LATIN SMALL LETTER K WITH CARON .. LATIN SMALL LETTER K WITH CARON
    (16#1EB#, 16#1EB#, -1), -- LATIN SMALL LETTER O WITH OGONEK .. LATIN SMALL LETTER O WITH OGONEK
    (16#1ED#, 16#1ED#, -1), -- LATIN SMALL LETTER O WITH OGONEK AND MACRON .. LATIN SMALL LETTER O WITH OGONEK AND MACRON
    (16#1EF#, 16#1EF#, -1), -- LATIN SMALL LETTER EZH WITH CARON .. LATIN SMALL LETTER EZH WITH CARON
    (16#1F2#, 16#1F2#, -1), -- LATIN CAPITAL LETTER D WITH SMALL LETTER Z .. LATIN CAPITAL LETTER D WITH SMALL LETTER Z
    (16#1F3#, 16#1F3#, -2), -- LATIN SMALL LETTER DZ .. LATIN SMALL LETTER DZ
    (16#1F5#, 16#1F5#, -1), -- LATIN SMALL LETTER G WITH ACUTE .. LATIN SMALL LETTER G WITH ACUTE
    (16#1F9#, 16#1F9#, -1), -- LATIN SMALL LETTER N WITH GRAVE .. LATIN SMALL LETTER N WITH GRAVE
    (16#1FB#, 16#1FB#, -1), -- LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE .. LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
    (16#1FD#, 16#1FD#, -1), -- LATIN SMALL LETTER AE WITH ACUTE .. LATIN SMALL LETTER AE WITH ACUTE
    (16#1FF#, 16#1FF#, -1), -- LATIN SMALL LETTER O WITH STROKE AND ACUTE .. LATIN SMALL LETTER O WITH STROKE AND ACUTE
    (16#201#, 16#201#, -1), -- LATIN SMALL LETTER A WITH DOUBLE GRAVE .. LATIN SMALL LETTER A WITH DOUBLE GRAVE
    (16#203#, 16#203#, -1), -- LATIN SMALL LETTER A WITH INVERTED BREVE .. LATIN SMALL LETTER A WITH INVERTED BREVE
    (16#205#, 16#205#, -1), -- LATIN SMALL LETTER E WITH DOUBLE GRAVE .. LATIN SMALL LETTER E WITH DOUBLE GRAVE
    (16#207#, 16#207#, -1), -- LATIN SMALL LETTER E WITH INVERTED BREVE .. LATIN SMALL LETTER E WITH INVERTED BREVE
    (16#209#, 16#209#, -1), -- LATIN SMALL LETTER I WITH DOUBLE GRAVE .. LATIN SMALL LETTER I WITH DOUBLE GRAVE
    (16#20B#, 16#20B#, -1), -- LATIN SMALL LETTER I WITH INVERTED BREVE .. LATIN SMALL LETTER I WITH INVERTED BREVE
    (16#20D#, 16#20D#, -1), -- LATIN SMALL LETTER O WITH DOUBLE GRAVE .. LATIN SMALL LETTER O WITH DOUBLE GRAVE
    (16#20F#, 16#20F#, -1), -- LATIN SMALL LETTER O WITH INVERTED BREVE .. LATIN SMALL LETTER O WITH INVERTED BREVE
    (16#211#, 16#211#, -1), -- LATIN SMALL LETTER R WITH DOUBLE GRAVE .. LATIN SMALL LETTER R WITH DOUBLE GRAVE
    (16#213#, 16#213#, -1), -- LATIN SMALL LETTER R WITH INVERTED BREVE .. LATIN SMALL LETTER R WITH INVERTED BREVE
    (16#215#, 16#215#, -1), -- LATIN SMALL LETTER U WITH DOUBLE GRAVE .. LATIN SMALL LETTER U WITH DOUBLE GRAVE
    (16#217#, 16#217#, -1), -- LATIN SMALL LETTER U WITH INVERTED BREVE .. LATIN SMALL LETTER U WITH INVERTED BREVE
    (16#219#, 16#219#, -1), -- LATIN SMALL LETTER S WITH COMMA BELOW .. LATIN SMALL LETTER S WITH COMMA BELOW
    (16#21B#, 16#21B#, -1), -- LATIN SMALL LETTER T WITH COMMA BELOW .. LATIN SMALL LETTER T WITH COMMA BELOW
    (16#21D#, 16#21D#, -1), -- LATIN SMALL LETTER YOGH .. LATIN SMALL LETTER YOGH
    (16#21F#, 16#21F#, -1), -- LATIN SMALL LETTER H WITH CARON .. LATIN SMALL LETTER H WITH CARON
    (16#223#, 16#223#, -1), -- LATIN SMALL LETTER OU .. LATIN SMALL LETTER OU
    (16#225#, 16#225#, -1), -- LATIN SMALL LETTER Z WITH HOOK .. LATIN SMALL LETTER Z WITH HOOK
    (16#227#, 16#227#, -1), -- LATIN SMALL LETTER A WITH DOT ABOVE .. LATIN SMALL LETTER A WITH DOT ABOVE
    (16#229#, 16#229#, -1), -- LATIN SMALL LETTER E WITH CEDILLA .. LATIN SMALL LETTER E WITH CEDILLA
    (16#22B#, 16#22B#, -1), -- LATIN SMALL LETTER O WITH DIAERESIS AND MACRON .. LATIN SMALL LETTER O WITH DIAERESIS AND MACRON
    (16#22D#, 16#22D#, -1), -- LATIN SMALL LETTER O WITH TILDE AND MACRON .. LATIN SMALL LETTER O WITH TILDE AND MACRON
    (16#22F#, 16#22F#, -1), -- LATIN SMALL LETTER O WITH DOT ABOVE .. LATIN SMALL LETTER O WITH DOT ABOVE
    (16#231#, 16#231#, -1), -- LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON .. LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON
    (16#233#, 16#233#, -1), -- LATIN SMALL LETTER Y WITH MACRON .. LATIN SMALL LETTER Y WITH MACRON
    (16#253#, 16#253#, -210), -- LATIN SMALL LETTER B WITH HOOK .. LATIN SMALL LETTER B WITH HOOK
    (16#254#, 16#254#, -206), -- LATIN SMALL LETTER OPEN O .. LATIN SMALL LETTER OPEN O
    (16#256#, 16#257#, -205), -- LATIN SMALL LETTER D WITH TAIL .. LATIN SMALL LETTER D WITH HOOK
    (16#259#, 16#259#, -202), -- LATIN SMALL LETTER SCHWA .. LATIN SMALL LETTER SCHWA
    (16#25B#, 16#25B#, -203), -- LATIN SMALL LETTER OPEN E .. LATIN SMALL LETTER OPEN E
    (16#260#, 16#260#, -205), -- LATIN SMALL LETTER G WITH HOOK .. LATIN SMALL LETTER G WITH HOOK
    (16#263#, 16#263#, -207), -- LATIN SMALL LETTER GAMMA .. LATIN SMALL LETTER GAMMA
    (16#268#, 16#268#, -209), -- LATIN SMALL LETTER I WITH STROKE .. LATIN SMALL LETTER I WITH STROKE
    (16#269#, 16#269#, -211), -- LATIN SMALL LETTER IOTA .. LATIN SMALL LETTER IOTA
    (16#26F#, 16#26F#, -211), -- LATIN SMALL LETTER TURNED M .. LATIN SMALL LETTER TURNED M
    (16#272#, 16#272#, -213), -- LATIN SMALL LETTER N WITH LEFT HOOK .. LATIN SMALL LETTER N WITH LEFT HOOK
    (16#275#, 16#275#, -214), -- LATIN SMALL LETTER BARRED O .. LATIN SMALL LETTER BARRED O
    (16#280#, 16#280#, -218), -- LATIN LETTER SMALL CAPITAL R .. LATIN LETTER SMALL CAPITAL R
    (16#283#, 16#283#, -218), -- LATIN SMALL LETTER ESH .. LATIN SMALL LETTER ESH
    (16#288#, 16#288#, -218), -- LATIN SMALL LETTER T WITH RETROFLEX HOOK .. LATIN SMALL LETTER T WITH RETROFLEX HOOK
    (16#28A#, 16#28B#, -217), -- LATIN SMALL LETTER UPSILON .. LATIN SMALL LETTER V WITH HOOK
    (16#292#, 16#292#, -219), -- LATIN SMALL LETTER EZH .. LATIN SMALL LETTER EZH
    (16#3AC#, 16#3AC#, -38), -- GREEK SMALL LETTER ALPHA WITH TONOS .. GREEK SMALL LETTER ALPHA WITH TONOS
    (16#3AD#, 16#3AF#, -37), -- GREEK SMALL LETTER EPSILON WITH TONOS .. GREEK SMALL LETTER IOTA WITH TONOS
    (16#3B1#, 16#3C1#, -32), -- GREEK SMALL LETTER ALPHA .. GREEK SMALL LETTER RHO
    (16#3C2#, 16#3C2#, -31), -- GREEK SMALL LETTER FINAL SIGMA .. GREEK SMALL LETTER FINAL SIGMA
    (16#3C3#, 16#3CB#, -32), -- GREEK SMALL LETTER SIGMA .. GREEK SMALL LETTER UPSILON WITH DIALYTIKA
    (16#3CC#, 16#3CC#, -64), -- GREEK SMALL LETTER OMICRON WITH TONOS .. GREEK SMALL LETTER OMICRON WITH TONOS
    (16#3CD#, 16#3CE#, -63), -- GREEK SMALL LETTER UPSILON WITH TONOS .. GREEK SMALL LETTER OMEGA WITH TONOS
    (16#3D0#, 16#3D0#, -62), -- GREEK BETA SYMBOL .. GREEK BETA SYMBOL
    (16#3D1#, 16#3D1#, -57), -- GREEK THETA SYMBOL .. GREEK THETA SYMBOL
    (16#3D5#, 16#3D5#, -47), -- GREEK PHI SYMBOL .. GREEK PHI SYMBOL
    (16#3D6#, 16#3D6#, -54), -- GREEK PI SYMBOL .. GREEK PI SYMBOL
    (16#3D9#, 16#3D9#, -1), -- GREEK SMALL LETTER ARCHAIC KOPPA .. GREEK SMALL LETTER ARCHAIC KOPPA
    (16#3DB#, 16#3DB#, -1), -- GREEK SMALL LETTER STIGMA .. GREEK SMALL LETTER STIGMA
    (16#3DD#, 16#3DD#, -1), -- GREEK SMALL LETTER DIGAMMA .. GREEK SMALL LETTER DIGAMMA
    (16#3DF#, 16#3DF#, -1), -- GREEK SMALL LETTER KOPPA .. GREEK SMALL LETTER KOPPA
    (16#3E1#, 16#3E1#, -1), -- GREEK SMALL LETTER SAMPI .. GREEK SMALL LETTER SAMPI
    (16#3E3#, 16#3E3#, -1), -- COPTIC SMALL LETTER SHEI .. COPTIC SMALL LETTER SHEI
    (16#3E5#, 16#3E5#, -1), -- COPTIC SMALL LETTER FEI .. COPTIC SMALL LETTER FEI
    (16#3E7#, 16#3E7#, -1), -- COPTIC SMALL LETTER KHEI .. COPTIC SMALL LETTER KHEI
    (16#3E9#, 16#3E9#, -1), -- COPTIC SMALL LETTER HORI .. COPTIC SMALL LETTER HORI
    (16#3EB#, 16#3EB#, -1), -- COPTIC SMALL LETTER GANGIA .. COPTIC SMALL LETTER GANGIA
    (16#3ED#, 16#3ED#, -1), -- COPTIC SMALL LETTER SHIMA .. COPTIC SMALL LETTER SHIMA
    (16#3EF#, 16#3EF#, -1), -- COPTIC SMALL LETTER DEI .. COPTIC SMALL LETTER DEI
    (16#3F0#, 16#3F0#, -86), -- GREEK KAPPA SYMBOL .. GREEK KAPPA SYMBOL
    (16#3F1#, 16#3F1#, -80), -- GREEK RHO SYMBOL .. GREEK RHO SYMBOL
    (16#3F2#, 16#3F2#, -79), -- GREEK LUNATE SIGMA SYMBOL .. GREEK LUNATE SIGMA SYMBOL
    (16#3F5#, 16#3F5#, -96), -- GREEK LUNATE EPSILON SYMBOL .. GREEK LUNATE EPSILON SYMBOL
    (16#430#, 16#44F#, -32), -- CYRILLIC SMALL LETTER A .. CYRILLIC SMALL LETTER YA
    (16#450#, 16#45F#, -80), -- CYRILLIC SMALL LETTER IE WITH GRAVE .. CYRILLIC SMALL LETTER DZHE
    (16#461#, 16#461#, -1), -- CYRILLIC SMALL LETTER OMEGA .. CYRILLIC SMALL LETTER OMEGA
    (16#463#, 16#463#, -1), -- CYRILLIC SMALL LETTER YAT .. CYRILLIC SMALL LETTER YAT
    (16#465#, 16#465#, -1), -- CYRILLIC SMALL LETTER IOTIFIED E .. CYRILLIC SMALL LETTER IOTIFIED E
    (16#467#, 16#467#, -1), -- CYRILLIC SMALL LETTER LITTLE YUS .. CYRILLIC SMALL LETTER LITTLE YUS
    (16#469#, 16#469#, -1), -- CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS .. CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS
    (16#46B#, 16#46B#, -1), -- CYRILLIC SMALL LETTER BIG YUS .. CYRILLIC SMALL LETTER BIG YUS
    (16#46D#, 16#46D#, -1), -- CYRILLIC SMALL LETTER IOTIFIED BIG YUS .. CYRILLIC SMALL LETTER IOTIFIED BIG YUS
    (16#46F#, 16#46F#, -1), -- CYRILLIC SMALL LETTER KSI .. CYRILLIC SMALL LETTER KSI
    (16#471#, 16#471#, -1), -- CYRILLIC SMALL LETTER PSI .. CYRILLIC SMALL LETTER PSI
    (16#473#, 16#473#, -1), -- CYRILLIC SMALL LETTER FITA .. CYRILLIC SMALL LETTER FITA
    (16#475#, 16#475#, -1), -- CYRILLIC SMALL LETTER IZHITSA .. CYRILLIC SMALL LETTER IZHITSA
    (16#477#, 16#477#, -1), -- CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT .. CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
    (16#479#, 16#479#, -1), -- CYRILLIC SMALL LETTER UK .. CYRILLIC SMALL LETTER UK
    (16#47B#, 16#47B#, -1), -- CYRILLIC SMALL LETTER ROUND OMEGA .. CYRILLIC SMALL LETTER ROUND OMEGA
    (16#47D#, 16#47D#, -1), -- CYRILLIC SMALL LETTER OMEGA WITH TITLO .. CYRILLIC SMALL LETTER OMEGA WITH TITLO
    (16#47F#, 16#47F#, -1), -- CYRILLIC SMALL LETTER OT .. CYRILLIC SMALL LETTER OT
    (16#481#, 16#481#, -1), -- CYRILLIC SMALL LETTER KOPPA .. CYRILLIC SMALL LETTER KOPPA
    (16#48B#, 16#48B#, -1), -- CYRILLIC SMALL LETTER SHORT I WITH TAIL .. CYRILLIC SMALL LETTER SHORT I WITH TAIL
    (16#48D#, 16#48D#, -1), -- CYRILLIC SMALL LETTER SEMISOFT SIGN .. CYRILLIC SMALL LETTER SEMISOFT SIGN
    (16#48F#, 16#48F#, -1), -- CYRILLIC SMALL LETTER ER WITH TICK .. CYRILLIC SMALL LETTER ER WITH TICK
    (16#491#, 16#491#, -1), -- CYRILLIC SMALL LETTER GHE WITH UPTURN .. CYRILLIC SMALL LETTER GHE WITH UPTURN
    (16#493#, 16#493#, -1), -- CYRILLIC SMALL LETTER GHE WITH STROKE .. CYRILLIC SMALL LETTER GHE WITH STROKE
    (16#495#, 16#495#, -1), -- CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK
    (16#497#, 16#497#, -1), -- CYRILLIC SMALL LETTER ZHE WITH DESCENDER .. CYRILLIC SMALL LETTER ZHE WITH DESCENDER
    (16#499#, 16#499#, -1), -- CYRILLIC SMALL LETTER ZE WITH DESCENDER .. CYRILLIC SMALL LETTER ZE WITH DESCENDER
    (16#49B#, 16#49B#, -1), -- CYRILLIC SMALL LETTER KA WITH DESCENDER .. CYRILLIC SMALL LETTER KA WITH DESCENDER
    (16#49D#, 16#49D#, -1), -- CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE
    (16#49F#, 16#49F#, -1), -- CYRILLIC SMALL LETTER KA WITH STROKE .. CYRILLIC SMALL LETTER KA WITH STROKE
    (16#4A1#, 16#4A1#, -1), -- CYRILLIC SMALL LETTER BASHKIR KA .. CYRILLIC SMALL LETTER BASHKIR KA
    (16#4A3#, 16#4A3#, -1), -- CYRILLIC SMALL LETTER EN WITH DESCENDER .. CYRILLIC SMALL LETTER EN WITH DESCENDER
    (16#4A5#, 16#4A5#, -1), -- CYRILLIC SMALL LIGATURE EN GHE .. CYRILLIC SMALL LIGATURE EN GHE
    (16#4A7#, 16#4A7#, -1), -- CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK .. CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK
    (16#4A9#, 16#4A9#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN HA .. CYRILLIC SMALL LETTER ABKHASIAN HA
    (16#4AB#, 16#4AB#, -1), -- CYRILLIC SMALL LETTER ES WITH DESCENDER .. CYRILLIC SMALL LETTER ES WITH DESCENDER
    (16#4AD#, 16#4AD#, -1), -- CYRILLIC SMALL LETTER TE WITH DESCENDER .. CYRILLIC SMALL LETTER TE WITH DESCENDER
    (16#4AF#, 16#4AF#, -1), -- CYRILLIC SMALL LETTER STRAIGHT U .. CYRILLIC SMALL LETTER STRAIGHT U
    (16#4B1#, 16#4B1#, -1), -- CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE .. CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE
    (16#4B3#, 16#4B3#, -1), -- CYRILLIC SMALL LETTER HA WITH DESCENDER .. CYRILLIC SMALL LETTER HA WITH DESCENDER
    (16#4B5#, 16#4B5#, -1), -- CYRILLIC SMALL LIGATURE TE TSE .. CYRILLIC SMALL LIGATURE TE TSE
    (16#4B7#, 16#4B7#, -1), -- CYRILLIC SMALL LETTER CHE WITH DESCENDER .. CYRILLIC SMALL LETTER CHE WITH DESCENDER
    (16#4B9#, 16#4B9#, -1), -- CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE .. CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE
    (16#4BB#, 16#4BB#, -1), -- CYRILLIC SMALL LETTER SHHA .. CYRILLIC SMALL LETTER SHHA
    (16#4BD#, 16#4BD#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN CHE .. CYRILLIC SMALL LETTER ABKHASIAN CHE
    (16#4BF#, 16#4BF#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER .. CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER
    (16#4C2#, 16#4C2#, -1), -- CYRILLIC SMALL LETTER ZHE WITH BREVE .. CYRILLIC SMALL LETTER ZHE WITH BREVE
    (16#4C4#, 16#4C4#, -1), -- CYRILLIC SMALL LETTER KA WITH HOOK .. CYRILLIC SMALL LETTER KA WITH HOOK
    (16#4C6#, 16#4C6#, -1), -- CYRILLIC SMALL LETTER EL WITH TAIL .. CYRILLIC SMALL LETTER EL WITH TAIL
    (16#4C8#, 16#4C8#, -1), -- CYRILLIC SMALL LETTER EN WITH HOOK .. CYRILLIC SMALL LETTER EN WITH HOOK
    (16#4CA#, 16#4CA#, -1), -- CYRILLIC SMALL LETTER EN WITH TAIL .. CYRILLIC SMALL LETTER EN WITH TAIL
    (16#4CC#, 16#4CC#, -1), -- CYRILLIC SMALL LETTER KHAKASSIAN CHE .. CYRILLIC SMALL LETTER KHAKASSIAN CHE
    (16#4CE#, 16#4CE#, -1), -- CYRILLIC SMALL LETTER EM WITH TAIL .. CYRILLIC SMALL LETTER EM WITH TAIL
    (16#4D1#, 16#4D1#, -1), -- CYRILLIC SMALL LETTER A WITH BREVE .. CYRILLIC SMALL LETTER A WITH BREVE
    (16#4D3#, 16#4D3#, -1), -- CYRILLIC SMALL LETTER A WITH DIAERESIS .. CYRILLIC SMALL LETTER A WITH DIAERESIS
    (16#4D5#, 16#4D5#, -1), -- CYRILLIC SMALL LIGATURE A IE .. CYRILLIC SMALL LIGATURE A IE
    (16#4D7#, 16#4D7#, -1), -- CYRILLIC SMALL LETTER IE WITH BREVE .. CYRILLIC SMALL LETTER IE WITH BREVE
    (16#4D9#, 16#4D9#, -1), -- CYRILLIC SMALL LETTER SCHWA .. CYRILLIC SMALL LETTER SCHWA
    (16#4DB#, 16#4DB#, -1), -- CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS .. CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS
    (16#4DD#, 16#4DD#, -1), -- CYRILLIC SMALL LETTER ZHE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
    (16#4DF#, 16#4DF#, -1), -- CYRILLIC SMALL LETTER ZE WITH DIAERESIS .. CYRILLIC SMALL LETTER ZE WITH DIAERESIS
    (16#4E1#, 16#4E1#, -1), -- CYRILLIC SMALL LETTER ABKHASIAN DZE .. CYRILLIC SMALL LETTER ABKHASIAN DZE
    (16#4E3#, 16#4E3#, -1), -- CYRILLIC SMALL LETTER I WITH MACRON .. CYRILLIC SMALL LETTER I WITH MACRON
    (16#4E5#, 16#4E5#, -1), -- CYRILLIC SMALL LETTER I WITH DIAERESIS .. CYRILLIC SMALL LETTER I WITH DIAERESIS
    (16#4E7#, 16#4E7#, -1), -- CYRILLIC SMALL LETTER O WITH DIAERESIS .. CYRILLIC SMALL LETTER O WITH DIAERESIS
    (16#4E9#, 16#4E9#, -1), -- CYRILLIC SMALL LETTER BARRED O .. CYRILLIC SMALL LETTER BARRED O
    (16#4EB#, 16#4EB#, -1), -- CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS .. CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS
    (16#4ED#, 16#4ED#, -1), -- CYRILLIC SMALL LETTER E WITH DIAERESIS .. CYRILLIC SMALL LETTER E WITH DIAERESIS
    (16#4EF#, 16#4EF#, -1), -- CYRILLIC SMALL LETTER U WITH MACRON .. CYRILLIC SMALL LETTER U WITH MACRON
    (16#4F1#, 16#4F1#, -1), -- CYRILLIC SMALL LETTER U WITH DIAERESIS .. CYRILLIC SMALL LETTER U WITH DIAERESIS
    (16#4F3#, 16#4F3#, -1), -- CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE .. CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
    (16#4F5#, 16#4F5#, -1), -- CYRILLIC SMALL LETTER CHE WITH DIAERESIS .. CYRILLIC SMALL LETTER CHE WITH DIAERESIS
    (16#4F9#, 16#4F9#, -1), -- CYRILLIC SMALL LETTER YERU WITH DIAERESIS .. CYRILLIC SMALL LETTER YERU WITH DIAERESIS
    (16#501#, 16#501#, -1), -- CYRILLIC SMALL LETTER KOMI DE .. CYRILLIC SMALL LETTER KOMI DE
    (16#503#, 16#503#, -1), -- CYRILLIC SMALL LETTER KOMI DJE .. CYRILLIC SMALL LETTER KOMI DJE
    (16#505#, 16#505#, -1), -- CYRILLIC SMALL LETTER KOMI ZJE .. CYRILLIC SMALL LETTER KOMI ZJE
    (16#507#, 16#507#, -1), -- CYRILLIC SMALL LETTER KOMI DZJE .. CYRILLIC SMALL LETTER KOMI DZJE
    (16#509#, 16#509#, -1), -- CYRILLIC SMALL LETTER KOMI LJE .. CYRILLIC SMALL LETTER KOMI LJE
    (16#50B#, 16#50B#, -1), -- CYRILLIC SMALL LETTER KOMI NJE .. CYRILLIC SMALL LETTER KOMI NJE
    (16#50D#, 16#50D#, -1), -- CYRILLIC SMALL LETTER KOMI SJE .. CYRILLIC SMALL LETTER KOMI SJE
    (16#50F#, 16#50F#, -1), -- CYRILLIC SMALL LETTER KOMI TJE .. CYRILLIC SMALL LETTER KOMI TJE
    (16#561#, 16#586#, -48), -- ARMENIAN SMALL LETTER AYB .. ARMENIAN SMALL LETTER FEH
    (16#1E01#, 16#1E01#, -1), -- LATIN SMALL LETTER A WITH RING BELOW .. LATIN SMALL LETTER A WITH RING BELOW
    (16#1E03#, 16#1E03#, -1), -- LATIN SMALL LETTER B WITH DOT ABOVE .. LATIN SMALL LETTER B WITH DOT ABOVE
    (16#1E05#, 16#1E05#, -1), -- LATIN SMALL LETTER B WITH DOT BELOW .. LATIN SMALL LETTER B WITH DOT BELOW
    (16#1E07#, 16#1E07#, -1), -- LATIN SMALL LETTER B WITH LINE BELOW .. LATIN SMALL LETTER B WITH LINE BELOW
    (16#1E09#, 16#1E09#, -1), -- LATIN SMALL LETTER C WITH CEDILLA AND ACUTE .. LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
    (16#1E0B#, 16#1E0B#, -1), -- LATIN SMALL LETTER D WITH DOT ABOVE .. LATIN SMALL LETTER D WITH DOT ABOVE
    (16#1E0D#, 16#1E0D#, -1), -- LATIN SMALL LETTER D WITH DOT BELOW .. LATIN SMALL LETTER D WITH DOT BELOW
    (16#1E0F#, 16#1E0F#, -1), -- LATIN SMALL LETTER D WITH LINE BELOW .. LATIN SMALL LETTER D WITH LINE BELOW
    (16#1E11#, 16#1E11#, -1), -- LATIN SMALL LETTER D WITH CEDILLA .. LATIN SMALL LETTER D WITH CEDILLA
    (16#1E13#, 16#1E13#, -1), -- LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW
    (16#1E15#, 16#1E15#, -1), -- LATIN SMALL LETTER E WITH MACRON AND GRAVE .. LATIN SMALL LETTER E WITH MACRON AND GRAVE
    (16#1E17#, 16#1E17#, -1), -- LATIN SMALL LETTER E WITH MACRON AND ACUTE .. LATIN SMALL LETTER E WITH MACRON AND ACUTE
    (16#1E19#, 16#1E19#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW
    (16#1E1B#, 16#1E1B#, -1), -- LATIN SMALL LETTER E WITH TILDE BELOW .. LATIN SMALL LETTER E WITH TILDE BELOW
    (16#1E1D#, 16#1E1D#, -1), -- LATIN SMALL LETTER E WITH CEDILLA AND BREVE .. LATIN SMALL LETTER E WITH CEDILLA AND BREVE
    (16#1E1F#, 16#1E1F#, -1), -- LATIN SMALL LETTER F WITH DOT ABOVE .. LATIN SMALL LETTER F WITH DOT ABOVE
    (16#1E21#, 16#1E21#, -1), -- LATIN SMALL LETTER G WITH MACRON .. LATIN SMALL LETTER G WITH MACRON
    (16#1E23#, 16#1E23#, -1), -- LATIN SMALL LETTER H WITH DOT ABOVE .. LATIN SMALL LETTER H WITH DOT ABOVE
    (16#1E25#, 16#1E25#, -1), -- LATIN SMALL LETTER H WITH DOT BELOW .. LATIN SMALL LETTER H WITH DOT BELOW
    (16#1E27#, 16#1E27#, -1), -- LATIN SMALL LETTER H WITH DIAERESIS .. LATIN SMALL LETTER H WITH DIAERESIS
    (16#1E29#, 16#1E29#, -1), -- LATIN SMALL LETTER H WITH CEDILLA .. LATIN SMALL LETTER H WITH CEDILLA
    (16#1E2B#, 16#1E2B#, -1), -- LATIN SMALL LETTER H WITH BREVE BELOW .. LATIN SMALL LETTER H WITH BREVE BELOW
    (16#1E2D#, 16#1E2D#, -1), -- LATIN SMALL LETTER I WITH TILDE BELOW .. LATIN SMALL LETTER I WITH TILDE BELOW
    (16#1E2F#, 16#1E2F#, -1), -- LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE .. LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE
    (16#1E31#, 16#1E31#, -1), -- LATIN SMALL LETTER K WITH ACUTE .. LATIN SMALL LETTER K WITH ACUTE
    (16#1E33#, 16#1E33#, -1), -- LATIN SMALL LETTER K WITH DOT BELOW .. LATIN SMALL LETTER K WITH DOT BELOW
    (16#1E35#, 16#1E35#, -1), -- LATIN SMALL LETTER K WITH LINE BELOW .. LATIN SMALL LETTER K WITH LINE BELOW
    (16#1E37#, 16#1E37#, -1), -- LATIN SMALL LETTER L WITH DOT BELOW .. LATIN SMALL LETTER L WITH DOT BELOW
    (16#1E39#, 16#1E39#, -1), -- LATIN SMALL LETTER L WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER L WITH DOT BELOW AND MACRON
    (16#1E3B#, 16#1E3B#, -1), -- LATIN SMALL LETTER L WITH LINE BELOW .. LATIN SMALL LETTER L WITH LINE BELOW
    (16#1E3D#, 16#1E3D#, -1), -- LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW
    (16#1E3F#, 16#1E3F#, -1), -- LATIN SMALL LETTER M WITH ACUTE .. LATIN SMALL LETTER M WITH ACUTE
    (16#1E41#, 16#1E41#, -1), -- LATIN SMALL LETTER M WITH DOT ABOVE .. LATIN SMALL LETTER M WITH DOT ABOVE
    (16#1E43#, 16#1E43#, -1), -- LATIN SMALL LETTER M WITH DOT BELOW .. LATIN SMALL LETTER M WITH DOT BELOW
    (16#1E45#, 16#1E45#, -1), -- LATIN SMALL LETTER N WITH DOT ABOVE .. LATIN SMALL LETTER N WITH DOT ABOVE
    (16#1E47#, 16#1E47#, -1), -- LATIN SMALL LETTER N WITH DOT BELOW .. LATIN SMALL LETTER N WITH DOT BELOW
    (16#1E49#, 16#1E49#, -1), -- LATIN SMALL LETTER N WITH LINE BELOW .. LATIN SMALL LETTER N WITH LINE BELOW
    (16#1E4B#, 16#1E4B#, -1), -- LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW
    (16#1E4D#, 16#1E4D#, -1), -- LATIN SMALL LETTER O WITH TILDE AND ACUTE .. LATIN SMALL LETTER O WITH TILDE AND ACUTE
    (16#1E4F#, 16#1E4F#, -1), -- LATIN SMALL LETTER O WITH TILDE AND DIAERESIS .. LATIN SMALL LETTER O WITH TILDE AND DIAERESIS
    (16#1E51#, 16#1E51#, -1), -- LATIN SMALL LETTER O WITH MACRON AND GRAVE .. LATIN SMALL LETTER O WITH MACRON AND GRAVE
    (16#1E53#, 16#1E53#, -1), -- LATIN SMALL LETTER O WITH MACRON AND ACUTE .. LATIN SMALL LETTER O WITH MACRON AND ACUTE
    (16#1E55#, 16#1E55#, -1), -- LATIN SMALL LETTER P WITH ACUTE .. LATIN SMALL LETTER P WITH ACUTE
    (16#1E57#, 16#1E57#, -1), -- LATIN SMALL LETTER P WITH DOT ABOVE .. LATIN SMALL LETTER P WITH DOT ABOVE
    (16#1E59#, 16#1E59#, -1), -- LATIN SMALL LETTER R WITH DOT ABOVE .. LATIN SMALL LETTER R WITH DOT ABOVE
    (16#1E5B#, 16#1E5B#, -1), -- LATIN SMALL LETTER R WITH DOT BELOW .. LATIN SMALL LETTER R WITH DOT BELOW
    (16#1E5D#, 16#1E5D#, -1), -- LATIN SMALL LETTER R WITH DOT BELOW AND MACRON .. LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
    (16#1E5F#, 16#1E5F#, -1), -- LATIN SMALL LETTER R WITH LINE BELOW .. LATIN SMALL LETTER R WITH LINE BELOW
    (16#1E61#, 16#1E61#, -1), -- LATIN SMALL LETTER S WITH DOT ABOVE .. LATIN SMALL LETTER S WITH DOT ABOVE
    (16#1E63#, 16#1E63#, -1), -- LATIN SMALL LETTER S WITH DOT BELOW .. LATIN SMALL LETTER S WITH DOT BELOW
    (16#1E65#, 16#1E65#, -1), -- LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE .. LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE
    (16#1E67#, 16#1E67#, -1), -- LATIN SMALL LETTER S WITH CARON AND DOT ABOVE .. LATIN SMALL LETTER S WITH CARON AND DOT ABOVE
    (16#1E69#, 16#1E69#, -1), -- LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE .. LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE
    (16#1E6B#, 16#1E6B#, -1), -- LATIN SMALL LETTER T WITH DOT ABOVE .. LATIN SMALL LETTER T WITH DOT ABOVE
    (16#1E6D#, 16#1E6D#, -1), -- LATIN SMALL LETTER T WITH DOT BELOW .. LATIN SMALL LETTER T WITH DOT BELOW
    (16#1E6F#, 16#1E6F#, -1), -- LATIN SMALL LETTER T WITH LINE BELOW .. LATIN SMALL LETTER T WITH LINE BELOW
    (16#1E71#, 16#1E71#, -1), -- LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW
    (16#1E73#, 16#1E73#, -1), -- LATIN SMALL LETTER U WITH DIAERESIS BELOW .. LATIN SMALL LETTER U WITH DIAERESIS BELOW
    (16#1E75#, 16#1E75#, -1), -- LATIN SMALL LETTER U WITH TILDE BELOW .. LATIN SMALL LETTER U WITH TILDE BELOW
    (16#1E77#, 16#1E77#, -1), -- LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW .. LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW
    (16#1E79#, 16#1E79#, -1), -- LATIN SMALL LETTER U WITH TILDE AND ACUTE .. LATIN SMALL LETTER U WITH TILDE AND ACUTE
    (16#1E7B#, 16#1E7B#, -1), -- LATIN SMALL LETTER U WITH MACRON AND DIAERESIS .. LATIN SMALL LETTER U WITH MACRON AND DIAERESIS
    (16#1E7D#, 16#1E7D#, -1), -- LATIN SMALL LETTER V WITH TILDE .. LATIN SMALL LETTER V WITH TILDE
    (16#1E7F#, 16#1E7F#, -1), -- LATIN SMALL LETTER V WITH DOT BELOW .. LATIN SMALL LETTER V WITH DOT BELOW
    (16#1E81#, 16#1E81#, -1), -- LATIN SMALL LETTER W WITH GRAVE .. LATIN SMALL LETTER W WITH GRAVE
    (16#1E83#, 16#1E83#, -1), -- LATIN SMALL LETTER W WITH ACUTE .. LATIN SMALL LETTER W WITH ACUTE
    (16#1E85#, 16#1E85#, -1), -- LATIN SMALL LETTER W WITH DIAERESIS .. LATIN SMALL LETTER W WITH DIAERESIS
    (16#1E87#, 16#1E87#, -1), -- LATIN SMALL LETTER W WITH DOT ABOVE .. LATIN SMALL LETTER W WITH DOT ABOVE
    (16#1E89#, 16#1E89#, -1), -- LATIN SMALL LETTER W WITH DOT BELOW .. LATIN SMALL LETTER W WITH DOT BELOW
    (16#1E8B#, 16#1E8B#, -1), -- LATIN SMALL LETTER X WITH DOT ABOVE .. LATIN SMALL LETTER X WITH DOT ABOVE
    (16#1E8D#, 16#1E8D#, -1), -- LATIN SMALL LETTER X WITH DIAERESIS .. LATIN SMALL LETTER X WITH DIAERESIS
    (16#1E8F#, 16#1E8F#, -1), -- LATIN SMALL LETTER Y WITH DOT ABOVE .. LATIN SMALL LETTER Y WITH DOT ABOVE
    (16#1E91#, 16#1E91#, -1), -- LATIN SMALL LETTER Z WITH CIRCUMFLEX .. LATIN SMALL LETTER Z WITH CIRCUMFLEX
    (16#1E93#, 16#1E93#, -1), -- LATIN SMALL LETTER Z WITH DOT BELOW .. LATIN SMALL LETTER Z WITH DOT BELOW
    (16#1E95#, 16#1E95#, -1), -- LATIN SMALL LETTER Z WITH LINE BELOW .. LATIN SMALL LETTER Z WITH LINE BELOW
    (16#1E9B#, 16#1E9B#, -59), -- LATIN SMALL LETTER LONG S WITH DOT ABOVE .. LATIN SMALL LETTER LONG S WITH DOT ABOVE
    (16#1EA1#, 16#1EA1#, -1), -- LATIN SMALL LETTER A WITH DOT BELOW .. LATIN SMALL LETTER A WITH DOT BELOW
    (16#1EA3#, 16#1EA3#, -1), -- LATIN SMALL LETTER A WITH HOOK ABOVE .. LATIN SMALL LETTER A WITH HOOK ABOVE
    (16#1EA5#, 16#1EA5#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
    (16#1EA7#, 16#1EA7#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
    (16#1EA9#, 16#1EA9#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
    (16#1EAB#, 16#1EAB#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
    (16#1EAD#, 16#1EAD#, -1), -- LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
    (16#1EAF#, 16#1EAF#, -1), -- LATIN SMALL LETTER A WITH BREVE AND ACUTE .. LATIN SMALL LETTER A WITH BREVE AND ACUTE
    (16#1EB1#, 16#1EB1#, -1), -- LATIN SMALL LETTER A WITH BREVE AND GRAVE .. LATIN SMALL LETTER A WITH BREVE AND GRAVE
    (16#1EB3#, 16#1EB3#, -1), -- LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE .. LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
    (16#1EB5#, 16#1EB5#, -1), -- LATIN SMALL LETTER A WITH BREVE AND TILDE .. LATIN SMALL LETTER A WITH BREVE AND TILDE
    (16#1EB7#, 16#1EB7#, -1), -- LATIN SMALL LETTER A WITH BREVE AND DOT BELOW .. LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
    (16#1EB9#, 16#1EB9#, -1), -- LATIN SMALL LETTER E WITH DOT BELOW .. LATIN SMALL LETTER E WITH DOT BELOW
    (16#1EBB#, 16#1EBB#, -1), -- LATIN SMALL LETTER E WITH HOOK ABOVE .. LATIN SMALL LETTER E WITH HOOK ABOVE
    (16#1EBD#, 16#1EBD#, -1), -- LATIN SMALL LETTER E WITH TILDE .. LATIN SMALL LETTER E WITH TILDE
    (16#1EBF#, 16#1EBF#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE
    (16#1EC1#, 16#1EC1#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE
    (16#1EC3#, 16#1EC3#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE
    (16#1EC5#, 16#1EC5#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE
    (16#1EC7#, 16#1EC7#, -1), -- LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW
    (16#1EC9#, 16#1EC9#, -1), -- LATIN SMALL LETTER I WITH HOOK ABOVE .. LATIN SMALL LETTER I WITH HOOK ABOVE
    (16#1ECB#, 16#1ECB#, -1), -- LATIN SMALL LETTER I WITH DOT BELOW .. LATIN SMALL LETTER I WITH DOT BELOW
    (16#1ECD#, 16#1ECD#, -1), -- LATIN SMALL LETTER O WITH DOT BELOW .. LATIN SMALL LETTER O WITH DOT BELOW
    (16#1ECF#, 16#1ECF#, -1), -- LATIN SMALL LETTER O WITH HOOK ABOVE .. LATIN SMALL LETTER O WITH HOOK ABOVE
    (16#1ED1#, 16#1ED1#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE
    (16#1ED3#, 16#1ED3#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE
    (16#1ED5#, 16#1ED5#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE
    (16#1ED7#, 16#1ED7#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE
    (16#1ED9#, 16#1ED9#, -1), -- LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW .. LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW
    (16#1EDB#, 16#1EDB#, -1), -- LATIN SMALL LETTER O WITH HORN AND ACUTE .. LATIN SMALL LETTER O WITH HORN AND ACUTE
    (16#1EDD#, 16#1EDD#, -1), -- LATIN SMALL LETTER O WITH HORN AND GRAVE .. LATIN SMALL LETTER O WITH HORN AND GRAVE
    (16#1EDF#, 16#1EDF#, -1), -- LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE
    (16#1EE1#, 16#1EE1#, -1), -- LATIN SMALL LETTER O WITH HORN AND TILDE .. LATIN SMALL LETTER O WITH HORN AND TILDE
    (16#1EE3#, 16#1EE3#, -1), -- LATIN SMALL LETTER O WITH HORN AND DOT BELOW .. LATIN SMALL LETTER O WITH HORN AND DOT BELOW
    (16#1EE5#, 16#1EE5#, -1), -- LATIN SMALL LETTER U WITH DOT BELOW .. LATIN SMALL LETTER U WITH DOT BELOW
    (16#1EE7#, 16#1EE7#, -1), -- LATIN SMALL LETTER U WITH HOOK ABOVE .. LATIN SMALL LETTER U WITH HOOK ABOVE
    (16#1EE9#, 16#1EE9#, -1), -- LATIN SMALL LETTER U WITH HORN AND ACUTE .. LATIN SMALL LETTER U WITH HORN AND ACUTE
    (16#1EEB#, 16#1EEB#, -1), -- LATIN SMALL LETTER U WITH HORN AND GRAVE .. LATIN SMALL LETTER U WITH HORN AND GRAVE
    (16#1EED#, 16#1EED#, -1), -- LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE .. LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE
    (16#1EEF#, 16#1EEF#, -1), -- LATIN SMALL LETTER U WITH HORN AND TILDE .. LATIN SMALL LETTER U WITH HORN AND TILDE
    (16#1EF1#, 16#1EF1#, -1), -- LATIN SMALL LETTER U WITH HORN AND DOT BELOW .. LATIN SMALL LETTER U WITH HORN AND DOT BELOW
    (16#1EF3#, 16#1EF3#, -1), -- LATIN SMALL LETTER Y WITH GRAVE .. LATIN SMALL LETTER Y WITH GRAVE
    (16#1EF5#, 16#1EF5#, -1), -- LATIN SMALL LETTER Y WITH DOT BELOW .. LATIN SMALL LETTER Y WITH DOT BELOW
    (16#1EF7#, 16#1EF7#, -1), -- LATIN SMALL LETTER Y WITH HOOK ABOVE .. LATIN SMALL LETTER Y WITH HOOK ABOVE
    (16#1EF9#, 16#1EF9#, -1), -- LATIN SMALL LETTER Y WITH TILDE .. LATIN SMALL LETTER Y WITH TILDE
    (16#1F00#, 16#1F07#, 8), -- GREEK SMALL LETTER ALPHA WITH PSILI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI
    (16#1F10#, 16#1F15#, 8), -- GREEK SMALL LETTER EPSILON WITH PSILI .. GREEK SMALL LETTER EPSILON WITH DASIA AND OXIA
    (16#1F20#, 16#1F27#, 8), -- GREEK SMALL LETTER ETA WITH PSILI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI
    (16#1F30#, 16#1F37#, 8), -- GREEK SMALL LETTER IOTA WITH PSILI .. GREEK SMALL LETTER IOTA WITH DASIA AND PERISPOMENI
    (16#1F40#, 16#1F45#, 8), -- GREEK SMALL LETTER OMICRON WITH PSILI .. GREEK SMALL LETTER OMICRON WITH DASIA AND OXIA
    (16#1F51#, 16#1F51#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA .. GREEK SMALL LETTER UPSILON WITH DASIA
    (16#1F53#, 16#1F53#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND VARIA
    (16#1F55#, 16#1F55#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA .. GREEK SMALL LETTER UPSILON WITH DASIA AND OXIA
    (16#1F57#, 16#1F57#, 8), -- GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI .. GREEK SMALL LETTER UPSILON WITH DASIA AND PERISPOMENI
    (16#1F60#, 16#1F67#, 8), -- GREEK SMALL LETTER OMEGA WITH PSILI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI
    (16#1F70#, 16#1F71#, 74), -- GREEK SMALL LETTER ALPHA WITH VARIA .. GREEK SMALL LETTER ALPHA WITH OXIA
    (16#1F72#, 16#1F75#, 86), -- GREEK SMALL LETTER EPSILON WITH VARIA .. GREEK SMALL LETTER ETA WITH OXIA
    (16#1F76#, 16#1F77#, 100), -- GREEK SMALL LETTER IOTA WITH VARIA .. GREEK SMALL LETTER IOTA WITH OXIA
    (16#1F78#, 16#1F79#, 128), -- GREEK SMALL LETTER OMICRON WITH VARIA .. GREEK SMALL LETTER OMICRON WITH OXIA
    (16#1F7A#, 16#1F7B#, 112), -- GREEK SMALL LETTER UPSILON WITH VARIA .. GREEK SMALL LETTER UPSILON WITH OXIA
    (16#1F7C#, 16#1F7D#, 126), -- GREEK SMALL LETTER OMEGA WITH VARIA .. GREEK SMALL LETTER OMEGA WITH OXIA
    (16#1F80#, 16#1F87#, 8), -- GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
    (16#1F90#, 16#1F97#, 8), -- GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
    (16#1FA0#, 16#1FA7#, 8), -- GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
    (16#1FB0#, 16#1FB1#, 8), -- GREEK SMALL LETTER ALPHA WITH VRACHY .. GREEK SMALL LETTER ALPHA WITH MACRON
    (16#1FB3#, 16#1FB3#, 9), -- GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
    (16#1FBE#, 16#1FBE#, -7205), -- GREEK PROSGEGRAMMENI .. GREEK PROSGEGRAMMENI
    (16#1FC3#, 16#1FC3#, 9), -- GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
    (16#1FD0#, 16#1FD1#, 8), -- GREEK SMALL LETTER IOTA WITH VRACHY .. GREEK SMALL LETTER IOTA WITH MACRON
    (16#1FE0#, 16#1FE1#, 8), -- GREEK SMALL LETTER UPSILON WITH VRACHY .. GREEK SMALL LETTER UPSILON WITH MACRON
    (16#1FE5#, 16#1FE5#, 7), -- GREEK SMALL LETTER RHO WITH DASIA .. GREEK SMALL LETTER RHO WITH DASIA
    (16#1FF3#, 16#1FF3#, 9), -- GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI .. GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
    (16#FF41#, 16#FF5A#, -32), -- FULLWIDTH LATIN SMALL LETTER A .. FULLWIDTH LATIN SMALL LETTER Z
    (16#10428#, 16#1044D#, -40) -- DESERET SMALL LETTER LONG I .. DESERET SMALL LETTER ENG
   );

*************************************************************

From: Randy Brukardt
Sent: Wednesday, November 27, 2002  11:01 AM

Thanks for doing this. Where are you finding the information that you are
using to do this? A quick search of the net didn't turn up anything
machine-readable...

*************************************************************

From: Michael F. Yoder
Sent: Wednesday, November 27, 2002  12:20 PM

The root link is www.unicode.org and the "latest version" link goes to

http://www.unicode.org/unicode/reports/tr28/

The "this version" link at the top goes to a page with some relevant
stuff. The page with the machine-readable files for V3.2 is:
http://www.unicode.org/Public/UNIDATA/ . The current organization seems
to be harder to navigate than it used to be; I'm unsure why.

N.B. version 3.2 of Unicode claims to be "fully synchronized" with ISO
10646, so it is strongly preferable to earlier versions.

*************************************************************


Questions? Ask the ACAA Technical Agent