Version 1.6 of ai05s/ai05-0127-2.txt

Unformatted version of ai05s/ai05-0127-2.txt version 1.6
Other versions for file ai05s/ai05-0127-2.txt

!standard A.19(0)          10-10-15 AI05-0127-2/05
!standard 1.2(1)
!standard 1.2(4/2)
!class Amendment 10-06-01
!status Amendment 2012 10-08-05
!status WG9 Approved 10-10-28
!status ARG Approved 9-0-0 10-06-20
!status work item 10-06-01
!status received 10-06-01
!priority Low
!difficulty Medium
!subject Adding Locale Capabilities
!summary
A package is added to identify the current locale.
!problem
Ada does not provide a portable way to determine the active locale in an environment. Knowing the active locale would facilitate writing applications that tailor the user's experience to match the user's expectations. The means to determine the current locale is operating system specific and non-portable. Should basic localization support be added to the language? (Yes.)
!proposal
Most modern operating systems provide capabilities that facilitate writing applications that tailor the user's experience with an application to match the user's expectations. The existing approaches vary considerably however and are non-portable.
For example POSIX provides POSIX library calls whereas Microsoft Windows provides a completely different set of interfaces. A portable solution is desired for Ada.
There are many areas that are affected by locale settings such as dates, times, currency, character collation orders, message text, and numeric formatting. The basic need however, is to be able to determine the current locale (language and country). If an application has this capability, all locale related differences can be programmed into the application in a portable manner.
This proposal provides a new package Ada.Locales that provides functions to query the identity of the country and language associated with the current locale.
!wording
Add to normative references after 1.2(1):
ISO/IEC 639-3:2007, Terminology and other language and content resources -- Codes for the representation of names of languages -- Part 3: Alpha-3 code for
comprehensive coverage of languages.
Add to normative references after 1.2(4/2):
ISO/IEC 3166-1:2006, Information and documentation Codes for the representation of names of countries and their subdivisions -- Part 1: Country Codes.
Add a new clause:
A.19 The Package Locales
A locale identifies a geopolitical place or region and its associated language, which can be used to determine other internationalization related characteristics.
Static Semantics
The following language-defined library package exists:
package Ada.Locales is pragma Preelaborate(Locales); pragma Remote_Types(Locales);
type Language_Code is array (1 .. 3) of Character range 'a' .. 'z'; type Country_Code is array (1 .. 2) of Character range 'A' .. 'Z';
Language_Unknown : constant Language_Code := "und"; Country_Unknown : constant Country_Code := "ZZ";
function Language return Language_Code; function Country return Country_Code;
end Ada.Locales;
The active locale is the locale associated with the active partition.
Language_Code is a lower-case string representation of an ISO 639-3 alpha-3 code that identifies a language.
Country_Code is an upper-case string representation of an ISO 3166-1 alpha-2 code that identifies a country.
Function Language returns the code of the language associated with the active locale. If the Language_Code associated with the active locale cannot be determined from the environment then Language returns Language_Unknown.
Function Country returns the code of the country associated with the active locale. If the Country_Code associated with the active locale cannot be determined from the environment then Country returns Country_Unknown.
!discussion
ISO 3166-1 defines three sets of codes; alpha-2, alpha-3, and numeric-3. These three codes cover an identical number of country names.
The alpha-2 code is a two letter code, alpha-3 is a three letter code, and numeric-3 is a three digit numeric code.
e.g. Country alpha-2 alpha-3 number-3 -------------------------------------------------- AFGHANISTAN AF AFG 004 CANADA CA CAN 124 FRANCE FR FRA 250 GERMANY DE DEU 276 ITALY IT ITA 380 SPAIN ES ESP 724 UNITED KINGDOM GB GBR 826 UNITED STATES US USA 840
Numeric codes are used mostly for countries where non-Latin scripts are used.
The Country function returns an upper-case string that represents the country of the current locale. The ISO 3166-1 standard is case insensitive for country codes, but recommends upper case for code usage, which is why the Country function limits the return result to upper case only. This simplifies client usage if clients know they can expect the return values to be consistently in upper case. Alpha-2 codes were chosen instead of alpha-3 codes because existing locale capabilities in POSIX and Apple OSX follow BCP 47 RFC 4646, which excludes the use of alpha-3 codes. Since Microsoft's locale id scheme does not follow ISO 3166-1, the Microsoft scheme does not impact this decision. Going with alpha-2 code format possibly allows for simpler implementations in POSIX and OSX environments since the alpha-2 code can be extracted directly from the environment without requiring a mapping.
Consideration was given to whether specific locale capabilities could be provided, such as accessing numeric formatting, date formatting, currency formatting, or collating sequence locale specific information. This was ruled out because it would be difficult to get this right, and would require a high level of effort, when there does not seem to be a high level of demand for these capabilities. A simple capability of determining the locale is all that is needed to provide portability, as application programmers can program specific locale differences as needed once the current locale has been determined.
A user application could relatively easily define a translation lookup facility that accepted the current locale, and an application message id to look up a locale specific translation. Such a facility could also look up localization features such as those provided by the OS for numeric, date formatting and currency formatting and collating sequences.
ISO 639 has 5 code lists, three of which are relevant.
Part 1, the alpha-2 code Part 2, the alpha-3 code
ISO 639-2/T contains alpha 3 codes for the same languages as defined
in ISO 639-1
ISO 639-3/B contains alpha 3 codes that are mostly the same as
ISO 639-2/T but with some codes derived from English names rather than native names of the languages
Part 3, the alpha-3 code for comprehensive coverage of languages.
e.g. Language 639-1 639-2/T 639-2/B 639-3 ------------------------------------------------------------ English en eng eng eng French fr fra fre fra German de deu ger deu Chinese zh zho chi zho+one of 13 subcodes
(eg cmn for mandarin)
The Language function returns a lower-case 639-3 alpha-3 string that represents the language of the current locale. The ISO 639-3 standard is case insensitive for language codes, but recommends lower case for code usage, which is why the Language function limits the return result to lower case only. This simplifies client usage if clients know they can expect the return values to be consistently in lower case.
The decision to go with 639-3 alpha-3 codes was driven by the fact that 639-1 codes only cover the major languages in use. ISO 639-2 defines codes for many more languages than 639-1, and generally covers all languages that have significant bodies of literature, and covers most languages.
The selection of 639-2/T over 639-2/B is driven by the fact that POSIX and Apple OSX follow BCP 47 (RFC 4646), which states that when there is a choice between the "T" code and the "B" code, the T code is the recommended choice.
ISO 639-3 is a superset of ISO 639-2 that uses the 639-2/T codes instead of the 639-2/B codes, when there are two codes for a language in 629-2. 639-3 also better deals with certain cases that are problematic in 639-2. For example, chinese is a macro language which has many dialects. In 629-2/T and 639-3 this appears as the code "zho". However, if the current locale is more specificly defined to be Mandarin Chinese, 639-3 provides a code "cmn" for this purpose. 639-2 does not break down chinese any further than the macro language. Thus we selected the more detailed ISO 639-3 codes.
The definition of Language_Unknown is defined to "und" because ISO 639 defines that code to be used in situations in which a language or languages must be indicated but the language cannot be identified.
The definition of Country_Unknown is defined to "ZZ" because ISO 3166-1 specifies that is one of a set of codes in the standard that is user assigned. User-assigned code elements are codes at the disposal of users who need to add further names of countries, territories, or other geographical entities to their in-house application of ISO 3166-1, and the ISO 3166/MA will never use these codes in the updating process of the standard. The following codes can be user-assigned:
* Alpha-2: AA, QM to QZ, XA to XZ, and ZZ
* Alpha-3: AAA to AAZ, QMA to QZZ, XAA to XZZ, and ZZA to ZZZ
One such user-assigned coding is by the Unicode Common Locale Data Repository, which assigns ZZ to represent "Unknown or Invalid Territory"
Since there is no specific code defined for unknown Country and there already are uses of this code for similar purposes, and because this is the last user assigned alpha-2 code and less likely to be used for other purposes, "ZZ", seemed like the correct choice.
Consideration was given to whether the package should deal with macro-geographic regions. The BCP 47 RFC indicates that country codes can be in numeric-3 format if the region identified is larger than a country, such as a continent. Microsoft does not have any locales based on macro-geographic regions. It is dubious that these locales are used much if at all. The numeric-3 codes in this case are outside of ISO 3166-1 because they do not represent countries. Trying to build support for this into the Ada package would be messy, and in these cases it is the Language that is the most important distinguisher rather than region. If the OS provides a numeric-3 format for macro-geographic region, it makes sense to return Country_Unknown for the Country function, since the Country truly is unknown.
Consideration was given to whether this new package should be a child of System or a child of Ada. It was decided that this package should be a child of Ada, because the package does not provide any impementation-defined definitions, and provides a portable way to access operating system facilities similar to Ada.Directories.
!example
package Canadian_Point_Of_Sale_System is type Dollars is delta 0.01 digits 8 range 0.0 .. 999_999.99;
function To_String (Amount : Dollars) return String; end Canadian_Point_Of_Sale_System;
with Ada.Locales; with Ada.Text_IO.Editing; with Ada.Text_IO;
package body Canadian_Point_Of_Sale_System is
function To_String (Amount : Dollars) return String is package Canadian_Decimal_Output is new Ada.Text_IO.Editing.Decimal_Output (Num => Dollars);
Separator, Radix : Character; use type Ada.Locales.Country_Code; use type Ada.Locales.Language_Code; begin
if Ada.Locales.Country /= "CA" then raise Program_Error; end if;
if Ada.Locales.Language = "eng" then Separator := ','; Radix := '.'; elsif Ada.Locales.Language = "fra" then Separator := '.'; Radix := ','; else raise Program_Error; end if;
return Canadian_Decimal_Output.Image (Item => Amount, Pic => Ada.Text_IO.Editing.To_Picture (Pic_String => "$ZZZZ_ZZ9.99"), Currency => "$", Separator => Separator, Radix_Mark => Radix);
end To_String;
end Canadian_Point_Of_Sale_System;
!corrigendum 1.2(1)
Insert after the paragraph:
The following standards contain provisions which, through reference in this text, constitute provisions of this International Standard. At the time of publication, the editions indicated were valid. All standards are subject to revision, and parties to agreements based on this International Standard are encouraged to investigate the possibility of applying the most recent editions of the standards indicated below. Members of IEC and ISO maintain registers of currently valid International Standards.
the new paragraph:
ISO/IEC 639-3:2007, Terminology and other language and content resources — Codes for the representation of names of languages — Part 3: Alpha-3 code for comprehensive coverage of languages.
!corrigendum 1.2(4/2)
Insert after the paragraph:
ISO/IEC 1989:2002, Information technology — Programming languages — COBOL.
the new paragraph:
ISO/IEC 3166-1:2006, Information and documentation — Codes for the representation of names of countries and their subdivisions — Part 1: Country Codes.
!corrigendum A.19(0)
Insert new clause:
A locale identifies a geopolitical place or region and its associated language, which can be used to determine other internationalization related characteristics.
Static Semantics
The following language-defined library package exists:
package Ada.Locales is pragma Preelaborate(Locales); pragma Remote_Types(Locales);
type Language_Code is array (1 .. 3) of Character range 'a' .. 'z'; type Country_Code is array (1 .. 2) of Character range 'A' .. 'Z';
Language_Unknown : constant Language_Code := "und"; Country_Unknown : constant Country_Code := "ZZ";
function Language return Language_Code; function Country return Country_Code;
end Ada.Locales;
The active locale is the locale associated with the active partition.
Language_Code is a lower-case string representation of an ISO 639-3 alpha-3 code that identifies a language.
Country_Code is an upper-case string representation of an ISO 3166-1 alpha-2 code that identifies a country.
Function Language returns the code of the language associated with the active locale. If the Language_Code associated with the active locale cannot be determined from the environment then Language returns Language_Unknown.
Function Country returns the code of the country associated with the active locale. If the Country_Code associated with the active locale cannot be determined from the environment then Country returns Country_Unknown.
!ACATS test
ACATS C-Tests are needed to test this package.
!appendix

From: Brad Moore
Date: Tuesday, June 1, 2010  1:15 AM

Here is a much simplified version of AI05-0127, my homework. [This is version
/01 of this AI - Editor.]

I've eliminated all locale functionality other than a capability to determine
the active language and country.

The idea is that once you have a portable means to determine the current locale,
the application programmer can program all locale related differences needed in
a portable manner.

Rather than return string, I thought it was better to return Language_Code and
Country_Code which are types derived from String.

My thinking was it is better to have distinct types for these rather than
subtypes of String to provide stronger type safety.

In the ARG meeting notes from Burlington, the suggest was to move the package
Ada.Locales to System.Locales. I have moved the new package to be a child of
System, but modified the package name from Locales to Locale. (Plural to
singular)

It reads better in the code.

  if System.Locale.Language = "en" then
    ...
  end if;

****************************************************************

From: Jean-Pierre Rosen
Date: Tuesday, June 1, 2010  2:30 AM

Small nit: the specification says Country_Unknown and Language_Unknown, but the
discussion talks about Unknown_Country and Unknown_Language

****************************************************************

From: Brad Moore
Date: Tuesday, June 1, 2010  10:10 AM

Yes, the specification was my intent, it should be Country_Unknown and Language_Unknown throughout.

****************************************************************

From: Bob Duff
Date: Tuesday, June 1, 2010  10:56 AM

> The idea is that once you have a portable means to determine the
> current locale, the application programmer can program all locale
> related differences needed in a portable manner.

I don't really see the need for this AI.

For one thing, it doesn't really provide portability, since the country names
and language names are impl-def.  Not totally impl-def; they have to follow one
of several standards (two-letter names, three-letter names, etc).

    The nice thing about standards is that you have so many to choose from.
        -- Somebody Famous.
           (This saying is attributed to at least Andrew S. Tanenbaum, Admiral
           Grace Hooper, and Ken Olsen, by various web sites.  And I seem to
           recall hearing some Comp Sci professor at CMU saying it, circa 1978.
           Which leads me to say, "The nice thing about the world wide web is
           that there's so much misinformation to choose from.")

And the supposed reason for using strings is to allow implementations to upgrade
to new versions of the relevant locale standards.  I'm not sure how to write
portable code using such a moving target.

If this stuff really is properly standardized, then we can use an enumeration
type.  The fact that we're using strings seems to indicate otherwise.

I don't like Unknown_Country/lang being impl-def.  Shouldn't we at least insist
that it be distinct from defined country names?  For that matter, why not nail
it down (say it's "unknown country code" or something).

According to the ARG minutes from Burlington (Feb 2010), 2 people voted against
keeping this alive.  I don't really remember, but I suspect I was one of them.
The last 3 messages in the !appendix show Pascal Leroy, Bob Duff, and Robert
Dewar, all suggesting to drop this AI (but note that that was a previous
much-more-ambitious version).  I haven't changed my mind -- I don't think even
this much-simpler version is worth the trouble.

If I were writing a program that needs l10n / i18n, I think I'd ignore this
package, and go straight to the O.S. facilities.  There really aren't that many
-- windows, plus misc vesions of Unix that probably support Posix.  Embedded
real-time kernels can probably be ignored.

In the !example, variables are left uninitialized if you're not in Canada (or if
the impl chooses the numeric encoding of that country).  I don't understand what
"Dollars" are doing in a supposedly i18n app.  I guess I don't really understand
the example.

I think I prefer Locales over Locale (no big deal -- I'm just used to plurals
for package names).

****************************************************************

From: Robert Dewar
Date: Tuesday, June 1, 2010  11:02 AM

I agree with everything Bob says, and I would recommend dropping this AI.

****************************************************************

From: Robert Dewar
Date: Tuesday, June 1, 2010  6:08 AM

...
> Rather than return string, I thought it was better to return
> Language_Code and Country_Code which are types derived from String.

I think that types derived from String tend to be a nuisance, because various
utility functions do not apply without junk conversions.

> My thinking was it is better to have distinct types for these rather
> than subtypes of String to provide stronger type safety.

I disagree

****************************************************************

From: Bob Duff
Date: Tuesday, June 1, 2010  11:10 AM

> I think that types derived from String tend to be a nuisance, because
> various utility functions do not apply without junk conversions.

But there are cases where distinct types are helpful, and I think this is one of
them.  See here for another example:

    http://www.adacore.com/2010/04/05/gem-83/

In C, you can say:

    printf (input_data); // a security hole, if privileged program!

when you should have said

    printf ("%s", input_data);

The idea of template-oriented formatting is a good one.
In fact, we use the same idea in GNAT for error messages, and also in IAC (the
CORBA IDL-to-Ada compiler). So does CodePeer (last time I checked).

But it works best if the "template" type is distinct from the "string that could
come from input data" type (namely String).

I recently fixed a bunch of bugs of this nature in IAC.
And to make sure they STAY fixed, I changed the type from String to a template
type derived from String.

> > My thinking was it is better to have distinct types for these rather
> > than subtypes of String to provide stronger type safety.
>
> I disagree

In this particular case, I agree with Brad's decision.
As I said in my previous message, these types are really more like enums than
strings.  Having country codes as a separate type allows you to keep track of
which strings have been verified to really be country codes, versus other
strings that could contain arbitrary text.

Note: In Ada 2012, I might use subtype predicates instead!  ;-)

Anyway, if we're going to have this AI, shouldn't there be an Is_Valid_Country
function?  And/or a conversion function String-->Country that checks?

****************************************************************

From: Brad Moore
Date: Tuesday, June 1, 2010  11:17 AM

I could go either way regarding derived types vs subtypes.

On the one hand I thought there might not be much need for applying utility
functions on the return codes for these functions, and the stronger types might
catch some errors (eg. erroneously passing a language code into a function that
accepts a country code to determine the currency symbol)

On the other hand, I agree that the junk conversions you mention can be an
annoyance. I am happy to go with the consensus on this, but considering your
comment, I am starting to think subtypes are the way to go.

I presume though that it is preferable to return the Language_Code and
Country_Code subtypes rather than just return string?

****************************************************************

From: Bob Duff
Date: Tuesday, June 1, 2010  11:25 AM

> On the other hand, I agree that the junk conversions you mention can
> be an annoyance. I am happy to go with the consensus on this, but
> considering your comment, I am starting to think subtypes are the way
> to go.

Don't give in so easily.  ;-)

But I suppose if we drop this AI, as Robert and I suggest, we can leave the
type-vs-subtype question moot.

****************************************************************

From: Brad Moore
Date: Tuesday, June 1, 2010  11:17 AM

> For one thing, it doesn't really provide portability, since the
> country names and language names are impl-def.  Not totally impl-def;
> they have to follow one of several standards (two-letter names,
> three-letter names, etc).
>
>     The nice thing about standards is that you have so many to choose
> from. -- Somebody Famous.
>       (This saying is attributed to at least Andrew S. Tanenbaum,
>       Admiral Grace Hooper, and Ken Olsen, by various web sites.  And
>       I seem to recall hearing some Comp Sci professor at CMU saying
>       it, circa 1978. Which leads me to say, "The nice thing about
>       the world wide web is that there's so much misinformation to
>       choose from.")

It's not quite that bad. Really, there is only one standard for country names,
and one standard for language names (ISO 3166-1 and ISO 839).

Each standard provides several formats for the codes. I think my mistake was to
try to get away with not specifying which of the formats was used by the Ada
package. I now think it would have been better to say that the alpha-2 formats
are always returned, since those are the ones used by Microsoft, POSIX, and Java
today w.r.t locale identification.

This would at least address your portability comment, I think, since the country
names and language names would then be implementation defined.

> And the supposed reason for using strings is to allow implementations
> to upgrade to new versions of the relevant locale standards.  I'm not
> sure how to write portable code using such a moving target.
>
> If this stuff really is properly standardized, then we can use an
> enumeration type.  The fact that we're using strings seems to indicate
> otherwise.

I originally considered defining an enumeration that mapped to the codes defined
by ISO, but I came to the conclusion that two character codes in the form of a
string are better suited for this purpose. Over time, as new countries form, and
new languages evolve, the ISO country and language standards will need to be
revised. Adding new values to an enumeration will be cause incompatibilities
that can be avoided if we stick to returning a string based value that maps to
the two-character codes.

If I am writing an application for my current locale, say in Canada where
English and French are the official languages, it would be nice to know that
introducing a new country name for some newly formed country on the other side
of the planet will not break any enumeration case statements in my application.

> I don't like Unknown_Country/lang being impl-def.  Shouldn't we at
> least insist that it be distinct from defined country names?  For that
> matter, why not nail it down (say it's "unknown country code" or
> something).

My intent was to define these as a constant, such as "  ", (two spaces) which
does not (and would not) map to any character codes defined by ISO. Nailing it
down to a constant value sounds good to me. The point is, these are the only
cases where the values returned are not defined in the ISO standard.

> If I were writing a program that needs l10n / i18n, I think I'd ignore
> this package, and go straight to the O.S. facilities.  There really
> aren't that many -- windows, plus misc vesions of Unix that probably
> support Posix.  Embedded real-time kernels can probably be ignored.

I do have some real experience with this issue. A major system we developed for
our Canadian customer required that all text displayed in all applications
running on the data terminal be displayed in either English or French depending
on the locale settings of the terminal.

The applications originally were developed for a Unix platform, but eventually
were also ported to Windows. This is one of the few areas where the code was not
portable, so our source tree ended up providing and maintaining multiple
implementations of a package.

Admittedly, it was not a huge problem to work around, but it is messier than
having one source. This complicates project make files. We were even considering
bringing in some preprocessor solution for this one issue, which we ended up
avoiding thankfully. To those developers on our team coming from a C/C++
environment, it was difficult to convince them that Ada's approach of not
providing a preprocessor was a good one, even though I believe that was a good
choice, for other technical reasons. On an aside, I was just bit last week by
some C/C++ code where a system include file had redefined an enumeration literal
I was trying to define to some other string. That's pretty scary stuff if you
can't trust that the source code you see displayed in the editor is not what the
compiler sees.

In my experience, given the choice between an Ada standard package, and going
straight to O.S. facilities, I would choose the Ada package almost always,
unless the O.S. facilities provided features that were not present in the Ada
package.

> In the !example, variables are left uninitialized if you're not in
> Canada (or if the impl chooses the numeric encoding of that country).
> I don't understand what "Dollars" are doing in a supposedly i18n app.
> I guess I don't really understand the example.

The example is not a comprehensive one. I was thinking of the application we
provided for the military. The application is only going to be run in a Canadian
context, which is why I didn't test for other countries. I should just have
checked to ensure that the country is Canada, and raised program error
otherwise.

In Canada, both English and French use Dollars.

The example also shows how the locale capability can be used with
Ada.Text_IO.Editing.Decimal_Output which is an existing package that can be used
to address locale formatting of numeric and currency values. It's rather odd
that we never provided a means to facilitate using locale to select the radix,
separator, and currency inputs.

I can probably come up with a better example. In fact, I think I would like to
resubmit this AI, with a version that only uses alpha-2 codes. Before we decide
to torch this AI, it would be good to have a version that at least addresses
some of these comments. It shouldn't take me long to update.

****************************************************************

From: Bob Duff
Date: Tuesday, June 1, 2010  4:51 PM

> In my experience, given the choice between an Ada standard package,
> and going straight to O.S. facilities, I would choose the Ada package
> almost always, unless the O.S. facilities provided features that were
> not present in the Ada package.

I guess that "unless" is the key point.  There are approximately 3 operating
systems to worry about: Windows, Linux/Unix/Posix, any other? There are hundreds
of countries/languages.  If I want portability across operating systems and
portability acrosss countries, I'm thinking I'd rather write 2 or 3 OS-dependent
versions of things, rather than hundreds.  The current (thankfully simplified!)
version of the AI gives a somewhat-portable way to query the country. But the OS
gives much more -- for example, collating sequences.

Would you rather do:

    if Country = "xx" then
        collating order for xx goes here
    elsif Country = "yy" then
        collating order for yy goes here
    ... 100 more elsif's.

Or:

    if this is windows then
        use windows-specific stuff to get this locale's collating sequence
    elsif this is linux then
        use posix stuff
    else
        is there anything else?

I think I'm choosing 2 or 3 elsifs over 100 elsifs.

Of course, your example is different -- you had just 2 locales
(English- and French-speaking parts of Canada), so I understand that's somewhat
simpler.

> > In the !example, variables are left uninitialized if you're not in
> > Canada (or if the impl chooses the numeric encoding of that
> > country).  I don't understand what "Dollars" are doing in a
> > supposedly i18n app.  I guess I don't really understand the example.
>
> The example is not a comprehensive one. I was thinking of the
> application we provided for the military. The application is only
> going to be run in a Canadian context, which is why I didn't test for
> other countries. I should just have checked to ensure that the country
> is Canada, and raised program error otherwise.

Right.  Or for a program that could run outside Canada, you'd default to some
locale if it's not one of the ones you've specifically coded for.

> In Canada, both English and French use Dollars.

I know -- I've been to both English- and French-speaking parts.
It looks like monopoly money, with all those colors, but hey, who am I to judge.
;-)

> The example also shows how the locale capability can be used with
> Ada.Text_IO.Editing.Decimal_Output which is an existing package that
> can be used to address locale formatting of numeric and currency values.
> It's rather odd that we never provided a means to facilitate using
> locale to select the radix, separator, and currency inputs.
>
> I can probably come up with a better example. In fact, I think I would
> like to resubmit this AI, with a version that only uses alpha-2 codes.
> Before we decide to torch this AI, it would be good to have a version
> that at least addresses some of these comments. It shouldn't take me
> long to update.

Well, maybe you should wait to see what others think.

I have never done any serious i18n work, so you should take what I say with a
grain of salt.  I read a book about it some years ago, and it seemed like
operating systems had some fairly sophisticated stuff.  Unfortunately not
portable across operating systems.  But portable across locales!

****************************************************************

From: Brad Moore
Date: Wednesday, June 2, 2010  12:53 AM

...
> I think I'm choosing 2 or 3 elsifs over 100 elsifs.

I agree that if one were to write an application that everyone in the world
could use, you have serious locale needs and probably would want to glean all
the OS capabilities you can by writing non-portable calls to the OS. I think you
would be hard pressed though to find many real world example of such an
application however.

By the way, if you can think of a new application that everyone in the world
would want to use, please let me know. :-) (I suppose a web browser is an
example of one such existing application)

One of the significant development costs we encountered in the Canadian
applications was in the area of language translation. It's fine having OS hooks
to determine decimal points, and currency symbols, but no amount of OS
capabilities is going to do the hard work of determining which text to display
to the user in GUI's, reports, help text, etc. Getting text from different
languages to fit on the same places on a GUI window can be quite a challenge in
itself. Writing an application that displays the correct text in every language
would be a monumental task. Good luck just finding the translators needed for
all those languages.

I suspect in practice, a good majority of il8n applications involve a handful of
languages at most, attempting to cover the main population of the users. For
example, bank machines in my area mostly have two languages, some have 4 or 5.
(eg. Chinese, English, French, Spanish, Japanese). Instruction manuals for
equipment I have seen may have 3 - 5 languages depending on where the equipment
is sold internationally. German might be one of the languages added to the list.

Any websites I have seen typically only support up to a handful of languages.

I don't recall ever encountering Swahili in my travels, (not that I would know
Swahili if I saw it), someone writing an application would not bother
translating to Swahili, unless there was a reasonable chance that someone
speaking that language would be relatively common user of the application.

So I think maybe I'm choosing up to a handful of portable elsifs over your 3
non-portable else ifs.

Incidentally, collating sequence I suspect is one of the more esoteric of locale
based areas. I dont think we used any locale based use of collating sequences in
our applications that I can recall.

The most important use of the locale we found is to select which text we wanted
to display to the user.

> > In fact, I think I
> > would like to resubmit this AI, with a version that only uses
> > alpha-2 codes.
>
> Well, maybe you should wait to see what others think.
>

Will do.

****************************************************************

From: Jean-Pierre Rosen
Date: Wednesday, June 2, 2010   2:49 AM

> I agree that if one were to write an application that everyone in the
> world could use, you have serious locale needs and probably would want
> to glean all the OS capabilities you can by writing non-portable calls
> to the OS. I think you would be hard pressed though to find many real
> world example of such an application however.

Do not forget games. "Battle for Wesnoth" is available in 49 languages (see
http://www.wesnoth.org/gettext/). According to one of the main developpers of
the game, Jeremy Rosen ;-), Gettext is the way to go to handle that many
languages. But out of scope for us, I guess.

****************************************************************

From: Bob Duff
Date: Wednesday, June 2, 2010   7:34 AM

> I think you would be hard pressed though to find many real world
> example of such an application however.

Indeed.  I have never worked on any project that had anything to do with i18n,
so I've got zero first-hand experience. I designed the message-printing stuff in
CodePeer to support it, but as far as I know, there's still only one version of
the messages (in English).  AdaCore has customers all over, but GNAT and
everything else we sell gives messages only in English.

My bank's machine asks me whether I want to use English or Spanish.

****************************************************************

From: Brad Moore
Date: Thursday, June 3, 2010   9:09 AM

> Do not forget games. "Battle for Wesnoth" is available in 49 languages
> (see http://www.wesnoth.org/gettext/). According to one of the main
> developpers of the game, Jeremy Rosen ;-), Gettext is the way to go to
> handle that many languages. But out of scope for us, I guess.

Thinking back on the Canadian application I've been mentioning, it also was
scalable to any number of languages.

For that we employed a simple flat-file database, indexed by application defined
message id strings.

A function that looked something like;

     function Lookup
        (Message_Id : String;
         Locale_Id : Integer) return String;

The message id hashed into a file to get the variable length translation record,
and each translation record contained a set of translations for however many
languages were supported in the system. For example we had two languages, so
English might map to 0, and French would map to 1. If we wanted to add Japanese
next, we would give it 2, and so on.

We used this mechanism for all locale related differences. We even used it for
strings returning field lengths for reports, since the spacing for reports and
tabular display on a GUI depended on the length of the text of column headers
and so on.

Once you had taken the time to enter all the translations for all the help
messages, GUI labels, report headers, and so on, adding the few extra handful of
translations that the OS provides (such as decimal radix point, currency symbol,
probably even collating sequence, into this lookup database would be negligible
compared to the work involved in doing all the other translations.

Whenever we changed the translations for a new release of the application, we
would run a utility to reindex the message file.

In this process, we even incorporated as ASIS program that I had written to
extract enumerations from the Ada source code. If a programmer changed an
enumeration in the source code, this tool would detect whether or not the
enumeration identifier mapped to the message identifier in the database file.
This ensured that there was full coverage of translations for all enumeration
values in the Ada code, as well as ordering of enumeration values matched, and
the enumeration literal names matched the message id names in the translation
file. The ASIS application would only be run during the reindexing process.

This system worked really well, and was fast enough that lookups could be done
on the fly as windows were being presented to the user.

All this system needed was a way to figure out the current locale, which is what
this AI is hoping to address. Everything else already was portable, and didn't
require any changes when we ported from Unix to Windows.

I had considered whether this AI should also provide a message lookup facility
like the one I described above, but thought that would be too much. It is not
that difficult to create a persistent hash table to implement the lookup
function. In certain environments that have relational databases, this could be
even easier. If the number of translations is small enough, someone could
implement this using one of the standard containers such as
Ada.Containers.Hashed_Maps. (Or create their own Persistent_Hashed_Maps
container) Another alternative is to use the gettext utility that Jean-Pierre
mentioned.

So, Bob Duff's earlier comment having to write too many if else statements to
support too many languages doesn't seem to apply if you have such a message
lookup facility.

****************************************************************

From: Tucker Taft
Date: Thursday, June 3, 2010   9:24 AM

> ...So, Bob Duff's earlier comment having to write too many if else
> statements to support too many languages doesn't seem to apply if you
> have such a message lookup facility.

I agree that all programs that support internationalization use some kind of
table lookup, rather than explicit "if...else" statements.  Having an easy way
to get the current locale seems useful.

I agree with the desire to make this as portable as possible, so we should
choose one of the representations, and it sounds like you recommend the
2-character one, which makes sense to me.  If that is true, we should probably
use String(1..2) explicitly (or a suitably-named subtype or type) rather than
simply String.  Having to manipulate arbitrary-length strings seems like
unnecessary overhead if we are standardizing on 2-character locale names.

****************************************************************

From: Brad Moore
Date: Thursday, June 3, 2010   9:40 AM

> All this system needed was a way to figure out the current locale,
> which is what this AI is hoping to address.

But you could use the same argument about any platform-dependent
issue:

    - I once wrote a program that needed to query the virtual memory
      page size.

    - Lots of people want to write programs that spawn subprocesses.

    - Just yesterday, we had an internal discussion at AdaCore,
      where we decided we wanted a way to query the number of
      processors on the current machine.

    - Etc.

Should we add portable ways to do the above in Ada?  Well, maybe.
We're moving slowly in that direction (e.g. adding Ada.Directories).

But why is querying the current locale more important than any other
OS-dependent thing?  If you don't have it, it's no big deal -- you write a
package with multiple bodies (one for windows, one for unix, ...).

> So, Bob Duff's earlier comment having to write too many if else
> statements to support too many languages doesn't seem to apply if you
> have such a message lookup facility.

Well, sure, you've moved the work.  It seems like the bulk of the work for the
kind of project you described is in implementing the database and the ASIS tool,
and hiring a native French (or whatever) speaker to translate the messages.
Writing the query-locale primitive seems trivial by comparison.

I'm not strongly opposed to this AI -- I just can't get too excited about having
this feature, and of course any feature, even a small one like this, has a cost.
But as I've admitted several times, I am biased by having spent my whole life
writing English-only software.

****************************************************************

From: Tucker Taft
Date: Thursday, June 3, 2010  10:07 AM

> ... But why is querying the current locale more important than any
> other OS-dependent thing?  If you don't have it, it's no big deal --
> you write a package with multiple bodies (one for windows, one for
> unix, ...)

The argument for this kind of package is that without it, you can't easily share
anything built on top of it.  So if someone builds a nice higher-level
internationalization capability, each such capability needs to invent its own
way to get the locale.  If you want to use two of these, say one that does nice
error messages and one that handles currency well, you can't easily mix and
match.

As software becomes more globalized, this kind of thing seems increasingly
important for an internationally-standardized language. Admittedly it is just a
start, but if it can be made even simpler (e.g. by eliminating the three or four
different formats), then it puts a useful "stake" in the ground for further
portable packages to be built upon.

If we can just agree on the package spec, we don't really care so much on
whether the package is implemented by the vendor, because as we know the
implementation of such a package is generally trivial on most existing O/Ss.  It
is the common spec that provides the real value.

****************************************************************

From: Bob Duff
Date: Thursday, June 3, 2010   4:01 PM

...
> But you could use the same argument about any platform-dependent
> issue:
>
>     - I once wrote a program that needed to query the virtual memory
>       page size.

Me too. But I don't think this is a common need (and when you do it, the result
isn't portable even to different versions of the same OS - I ended up tuning
each such program to the machine that I intended to run it on).

>     - Lots of people want to write programs that spawn subprocesses.

We've tried previously to standardize that, but we couldn't even figure out a
way to describe it portably. If anyone has a good idea, I think we would surely
consider it again.

>     - Just yesterday, we had an internal discussion at AdaCore,
>       where we decided we wanted a way to query the number of
>       processors on the current machine.

That's part of the CPU proposal, of course, which is now in AI05-0171-1.
Specifically, function Number_Of_CPUs. Should be part of Ada 2012.

>     - Etc.

Hard to comment on that.

> Should we add portable ways to do the above in Ada?  Well, maybe.
> We're moving slowly in that direction (e.g. adding Ada.Directories).

The answer is yes, but it is hard enough in some cases (spawn) that we haven't
done it.

> But why is querying the current locale more important than any other
> OS-dependent thing?  If you don't have it, it's no big deal -- you
> write a package with multiple bodies (one for windows, one for unix,
> ...).
...
> I'm not strongly opposed to this AI -- I just can't get too excited
> about having this feature, and of course any feature, even a small one
> like this, has a cost.  But as I've admitted several times, I am
> biased by having spent my whole life writing English-only software.

Well, for me, I think that all software should be written in English and people
that don't know English should keep using their abacuses. :-) But I don't expect
to get much support for *that* position.

And this seems pretty trivial. The main issue is whether to insist on strings of
a particular length or just make it String. Based on the various comments, I
thought we have decided on String (because the most recent standards in this
area use 3 character and longer strings sometimes), just to keep the future
flexibility available. (If we insist on 2-character strings, what happens when
Windows or Linux implements the 3-character names from i18n??)

****************************************************************

From: Brad Moore
Date: Friday, June 4, 2010   1:05 PM

> I agree with the desire to make this as portable as possible, so we
> should choose one of the representations, and it sounds like you
> recommend the 2-character one, which makes sense to me.  If that is
> true, we should probably use String(1..2) explicitly (or a
> suitably-named subtype or type) rather than simply String.  Having to
> manipulate arbitrary-length strings seems like unnecessary overhead if
> we are standardizing on 2-character locale names.

I have been giving this more thought, and think that for language names, it
makes for sense to go with 3-character codes defined in ISO 639-2, rather than
the 2-character codes defined in ISO 639-1

ISO 639-2 covers all the languages in ISO 639-1, but adds quite a number of
other languages.

ISO 639-1 is intended to cover all the major languages in the world ISO 639-2 is
intended to cover all the languages in the world with significant bodies of
literature, and also includes codes for language groups (although those probably
aren't relevant to this AI). It covers most of the languages of the world.

ISO 639-3 adds to the ISO 639-2 code set. It is intended to be a comprehensive
list of all languages, including extinct, ancient, historic, and constructed
languages. It also is a 3 character code.

See http://www.loc.gov/standards/iso639-2/php/code_list.php

Also is an excellent site that you can browse ISO 639-1, -2, and -3 codes.

http://www.sil.org/iso639-3/codes.asp

A quote from ISO's FAQ on the ISO 639 site.

Q => "Why do some languages have both ISO 639-1 and 639-2 codes associated with
them while others have only ISO 639-2 codes?"

A => "  ...  However, because of the inadequacy of the alpha-two codes to
represent all of the languages in the world (it can only accommodate 676 codes)
and to assure backwards compatibility with existing usage compliant with RFC
4646 (and its predecessors), new language codes may be considered for inclusion
in both parts or in ISO 639-2 only."

Assuming that we decide to go with ISO 639-2, The question then becomes should
we use the T codes or the B codes of ISO 639-2?

The B codes are the codes that match the English pronunciation of the language
name, whereas the T codes match the native name of the language.

e.g. For French, the B code is fre while the T code is fra.
     For German, the B code is ger while the T code is deu.

Most languages only have one code in ISO 639-2. For those languages you'd get
the same code for both.

The Wikipedia site below suggests that the T codes are generally preferred,
http://en.wikipedia.org/wiki/ISO_639-2

I suggest we go with ISO 639-2/T.

For country names defined in ISO 3166-1, the set of countries of alpha-2 codes
is identical to the set of countries with alpha-3 codes. In that case, there is
not much reason for recommending the alpha-2 codes vs the alpha-3 codes. Alpha-2
codes are used in domain name suffixes. Alpha-3 codes are used in places such as
passport identification. They are a bit more readable than that the 2 character
codes. To be consistent, I suppose I could argue that we should use 3 character
codes since we would want to use 3 character codes for language names.

A further note of discussion.
For locales, Microsoft uses its own concept called Locale Id.
Microsoft defines a locale as either a language, or a language combined with a
country.

A Windows locale id is a 16 bit code.

See
http://www.science.co.il/language/Locale-Codes.asp?s=decimal

Which shows a mapping from locale id to ISO 3661-1 country name (but not ISO 639
language name).

I believe ISO 639-2 would cover all the languages supported by Microsoft (and
much more).

I had a quick scan through the list and counted roughly 65 languages supported
by Microsoft some with multiple variants based on country. eg. Engish (United
States) and English (Canada)

(You'd think we speak the same language, but we say our "Z"'s differently. Not
to mention we also tend to favour British spellings on things, but sometimes
prefer the American spelling just to keep things confusing)

Microsoft's approach with locale id suggests to me that there is even more
reason for providing a portable means in Ada to get the locale. General purpose
translation lookup facilities such as mentioned in the previous email, would
benefit from having a portable way to get locale names (language and country) on
a Windows platform.

An implemention of this AI on windows could do the translation from windows
locale id into ISO 639-2 and ISO 3166-1, which I don't think would be hard to
do. It should be a simple mapping.

Assuming we decide to have the package return 3 character subtypes, rather than
string,

Which would be preferred?

1)
   type Language_Code is array (1 .. 3) of Character range 'a' .. 'z';
   type Country_Code is array (1 .. 3) of Character range 'a' .. 'z';

(or leave off the constraint. The ISO standards recommend lower case,  and the
codes are case insensitive)

2)
   type Language_Code is new String (1 .. 3);
   type Country_Code is new String (1 .. 3);

3)
   subtype Language_Code is String (1 .. 3);
   subtype Country_Code is String (1 .. 3);


4) Have the functions return
        String (1 ..3)

Any other suggestions?

I'm leaning toward either 1) or 2).

****************************************************************

From: Tucker Taft
Date: Friday, June 4, 2010   2:20 PM

All of this makes sense.  As far as T vs. B, what do most operating systems
provide? If they only provide B, then we should go with that.  If they provide
both, then "T" seems like the way to go.  If some provide only "T" and some
provide only "B", then we would have to say it is implementation defined whether
the "T" or "B" version is returned, and the user would have to have both as keys
in their mapping from locale ID to message contents.

Having to worry about both upper and lower case is a pain.  I would go with
upper case only if we want to save people the trouble of doing case-insensitive
lookups, since they seem to be used in upper case in many contexts, and Ada
tends to favor all upper case for things like Enum'Image.

By the way, are the characters used in the 3-character code guaranteed to be
Latin-1, or do we need to use Wide_Character for the character codes?

And I agree with making them a distinct type, if they are restricted to being
exactly three characters.

****************************************************************

From: Brad Moore
Date: Friday, June 4, 2010   5:50 PM

In Max OS X, locales are identified as per BCP 47 (RFC 4646)
http://www.rfc-editor.org/rfc/bcp/bcp47.txt

Locales in POSIX are identified by
system environment variables, the LANG environment variable.

From Wikipedia,
"On Unix, Linux and other POSIX-type platforms, locale identifiers are defined
similar to the BCP 47 definition of language tags, but the locale variant
modifier is defined differently, and the character set is included as a part of
the identifier. It is defined in this format:
[language[_territory][.codeset][@modifier]]

(For example, Australian English using the UTF-8 encoding is
en_AU.UTF-8.) "

BCP 47 (RFC 4646) identifies a format that starts with an ISO 639 code (either
alpha-2 or alpha-3) followed by other optional parts separated by hyphens. The
next optional part is the extended language tag which are up to 3 alpha-3 codes
(separated by hyphens) from ISO 639.

This mostly is not used, but some locales do use this extended language tag.

Examples of language tags including extlang subtags are:

    * zh-yue (Cantonese Chinese)
    * ar-afb (Gulf Arabic)

With this possibility, I would say we need to go back to language codes as being
defined as variable length string types.

Following this is a 4 character ISO 15924 code identifying the script associated
with the language.

Then comes the region identifier, which is either an alpha-2 or numeric-3 ISO
3166-1 code.

Note that alpha-3 is not used here to identify the country. I believe this is
how the syntax differentiates between optional extlang ISO 639 language codes
and the region code. If it's alpha-3 it's an ISO 639 code, otherwise, it's an
ISO 3166-1 code. This suggests that we should be using 3166-1 alpha-2 for the
country codes.

If we see a numeric-3 code, the implementation could convert that to an alpha-2
code. Or, alternatively we could say that the Country code function can return
either a 2 or 3 character field.

Some Key excerpts from BCP 47...

The RFC states "the language tags described in this document are sequences of
characters from the US-ASCII [ISO646] (7 bit ASCII) repertoire."

This answers your question about whether we need to worry about wide characters
and so on. The answer is no.

"At all times, language tags and their subtags, including private use
   and extensions, are to be treated as case insensitive:"

"This format generally corresponds to
   the common conventions for the various ISO standards from which the
   subtags are derived.

   These conventions include:

   o  [ISO639-1] recommends that language codes be written in lowercase
      ('mn' Mongolian).

   o  [ISO15924] recommends that script codes use lowercase with the
      initial letter capitalized ('Cyrl' Cyrillic).

   o  [ISO3166-1] recommends that country codes be capitalized ('MN'
      Mongolia).
"

"When languages have both an ISO 639-1 two-character code and a three-
   character code (assigned by ISO 639-2, ISO 639-3, or ISO 639-5), only
   the ISO 639-1 two-character code is defined in the IANA registry.
"

This suggests ISO 639-1 (alpha-2) is used when it can be, otherwise use ISO
639-2 (or higher) to return an alpha-3 code.

On my linux machine, the LANG variable is set to "en_CA.utf8".  -- utf8
identifies the script.

"When a language has no ISO 639-1 two-character code and the ISO
   639-2/T (Terminology) code and the ISO 639-2/B (Bibliographic) code
   for that language differ, only the Terminology code is defined in the
   IANA registry."

This suggests the answer to your question about "T" vs "B" is that they tend to
use "T".

User's of the Ada.Locale.Language function would write their applications
according to the spec we provide.

Based on all this, if we wanted to define as precise as possible type to
describe the return values in Ada we might end up with something like;

   ISO_639_Max_Length : constant := 3;
   subtype ISO_639_Index is
     Positive range 1 .. 4 * ISO_639_Max_Length;

   type Language_Code is
      array (ISO_639_Index range <>) of Character range 'a' .. 'z';
   -- Add an Ada 2012 invariant that says
   -- Language_Code'Length > 1 and
   -- (Language_Code'Length mod 3 = 0 or Language_Code'Length mod 3 = 2)
   -- This allows for extended language subtags

   type Country_Code is array (1 .. 2) of Character range 'A' .. 'Z';

Though the ISO standards are case insensitive, we could force return values from
our package to be only upper or lower case.

Or we could eliminate the constraint and use simpler string types, even though
we would always return upper or lower case.

****************************************************************

From: Brad Moore
Date: Friday, June 4, 2010   6:57 PM

- More on extended language subtags from BCP 47.

"  Although the ABNF production 'extlang' permits up to three
       extended language tags in the language tag, extended language
       subtags MUST NOT include another extended language subtag in
       their 'Prefix'.  That is, the second and third extended language
       subtag positions in a language tag are permanently reserved and
       tags that include those subtags in that position are, and will
       always remain, invalid.

   For example, the macrolanguage Chinese ('zh') encompasses a number of
   languages.  For compatibility reasons, each of these languages has
   both a primary and extended language subtag in the registry.  A few
   selected examples of these include Gan Chinese ('gan'), Cantonese
   Chinese ('yue'), and Mandarin Chinese ('cmn').  Each is encompassed
   by the macrolanguage 'zh' (Chinese).  Therefore, they each have the
   prefix "zh" in their registry records.  Thus, Gan Chinese is
   represented with tags beginning "zh-gan" or "gan", Cantonese with
   tags beginning either "yue" or "zh-yue", and Mandarin Chinese with
   "zh-cmn" or "cmn".  The language subtag 'zh' can still be used
   without an extended language subtag to label a resource as some
   unspecified variety of Chinese, while the primary language subtag
   ('gan', 'yue', 'cmn') is preferred to using the extended language
   form ("zh-gan", "zh-yue", "zh-cmn")."

This suggests we might be able to stick with just returning a single
alpha-2 or alpha-3 code for Language. If the locale has an extended sub-tab,
return that instead of the primary language sub-tab.

- Regarding numeric region (country) codes.

The numeric codes identify macro-geographic (continental) or sub regions. If the
region has an ISO 3166-1 code defined for it, that is what must be registered.
The numeric code is only used for bigger regions larger than a country.

To give an idea of what these numeric codes are;

I found the list of numeric regions at

http://rishida.net/utils/subtags/index.php?list=7&submit=List

001 World
002 Africa
005 South America
009 Oceania
011 Western Africa
013 Central America
014 Eastern Africa
015 Northern Africa
017 Middle Africa
018 Southern Africa
019 Americas
021 Northern America
029 Caribbean
030 Eastern Asia
034 Southern Asia
035 South-Eastern Asia
039 Southern Europe
053 Australia and New Zealand
054 Melanesia
057 Micronesia
061 Polynesia
142 Asia
143 Central Asia
145 Western Asia
150 Europe
151 Eastern Europe
154 Northern Europe
155 Western Europe
419 Latin America and the Caribbean

The key thing is, the rules in BCP 47 is setup in such a way that for each
locale there is only one way to define it according to the IANA registries. For
a given locale, it basically uses the shortest codes defined in ISO 3166-1 and
ISO 639. If we just return this minimal code then clients shouldn't have to
worry about checking for all the variants between alpha-2, alpha-3, and
numeric-3.

So my once again revised view of return types now is that Country and Language
would both return a single alpha-2 or alpha-3 code that uniquely identifies the
locale.

****************************************************************

From: Tucker Taft
Date: Saturday, June 5, 2010   10:42 AM

> ... So my once again revised view of return types now is that Country
> and Language would both return a single alpha-2 or
> alpha-3 code that uniquely identifies the locale.

TMI!

We lose portability if we try to accommodate everything.  I think we should pick
one, and have the body map to that, presuming that is possible.  Why allow both
alpha-2 and alpha-3?  How is the "portable" program supposed to deal with that?

****************************************************************

From: Brad Moore
Date: Saturday, June 5, 2010   12:53 PM

> TMI!

Sorry for the data overload. Once I started poking around in BCP 47, I ran into issues I hadn't considered, and was getting quite frazzled with the approach of having a package implementation that blindly returned whatever the OS gave us.

> We lose portability if we try to accommodate everything.  I think we
> should pick one, and have the body map to that, presuming that is
> possible.  Why allow both alpha-2 and alpha-3?  How is the "portable"
> program supposed to deal with that?

Yes! If we have the implementation map the OS string to alpha-3, I think it
makes things a whole lot simpler for us in the long run. I was worried that we
wanted to allow returning the OS string value. While that might be trivial to
implement as it doesn't require any mapping, you then have to deal with
specifying what actually is returned, including special IANA rules regarding
whether codes are registered or not, and how they are registered, and so on.

As you mention, then clients also have a harder time trying to figure out
whether to expect alpha-2 or alpha-3, or numeric-3.

The BCP 47 description of macro-geographic regions also through me for a loop,
and I was worried about how to deal with that.

Microsoft doesn't have locales based on macro-geopraphic regions, it's either a
language, or a language specialized by a country.

I suspect macro-geographic regions are seldom used, if at all.
Since those aren't countries, if that's what the OS gives us, I think it would
be fair to return Country_Unknown for those cases, if they exist. Likely it is
the Language that is important, not the country/region name in those cases.

So it seems clear to me that Language should returns a 3 character ISO 639-2/T
code.

It is less clear whether we should go with alpha-2 or alpha-3 IS0 3166-1 codes
for Country. We should pick one or the other and stick with it. Alpha-2 might
require less implementation work for POSIX/OSX since BCP 47 does not allow
alpha-3 country names. We could use a simple parsing of the OS value in that
case. Windows would require a mapping either way.

On the other hand, if we are using alpha-3 for Language, it might make sense to
use alpha-3 for country also. Alpha-3 is generally more readable than alpha-2,
and does not have the code space limitations of the 2 character coding scheme.
Based on this, I think it makes sense to go with alpha-3 for Country name.

I think we have converged a lot closer to a solution than the last writeup of
AI-0127. I'm thinking I should submit an updated revision so people can get a
better understanding of where we are at.

****************************************************************

From: Tucker Taft
Date: Saturday, June 5, 2010   1:19 PM

I would follow POSIX if it has
already standardized this to some
degree.  And yes, we need a simplified
version of this incorporating your
latest (and hopefully final ;-) thinking on this!

****************************************************************

From: Brad Moore
Date: Saturday, June 5, 2010   2:44 PM

OK, based on that I will go with alpha-2 for Country codes.

One more question.

I am thinking of going with;

   type Language_Code is array (1 .. 3) of Character range 'a' .. 'z';
   type Country_Code is array (1 .. 2) of Character range 'A' .. 'Z';

rather than;

   type Language_Code is new String (1 .. 3);
   type Country_Code is new String (1 .. 2);

since it is a more precise definition, and better portrays to the user what they
can expect as a return value.

That raises the question of how to define Country_Unknown and Language_Unknown,
since defining as a 2 or 3 spaces code would not match this definition.

For country codes, ISO 3166-1 does define some reserved codes in two categories;
   - reserved codes
   - user defined codes.

Reserved codes are codes that have become obsolete. The ISO 3166/MA, when
justified, reserves these codes which it undertakes not to use for other than
specified purposes during a limited or indeterminate period of time.

User-assigned code elements are codes at the disposal of users who need to add
further names of countries, territories, or other geographical entities to their
in-house application of ISO 3166-1, and the ISO 3166/MA will never use these
codes in the updating process of the standard. The following codes can be
user-assigned:

    * Alpha-2: AA, QM to QZ, XA to XZ, and ZZ
    * Alpha-3: AAA to AAZ, QMA to QZZ, XAA to XZZ, and ZZA to ZZZ

Of these two categories, if we wanted to define a constant for Country_Unknown,
I think we would want to select a code from the user-assigned code group.

According to Wikipedia, one such user-assigned coding is by the Unicode Common
Locale Data Repository, which assigns ZZ to represent "Unknown or Invalid
Territory"

Assuming this is the way we want to go,
I would propose we use "ZZ" also for that purpose in the definition of
Country_Unknown.

I like the idea of using a value defined within the standard rather than
defining our own constant such as "  ". We can then say that the values returned
by the Country function are always ISO 3166-1 codes.

For language codes, ISO-639 defines "und" (for undetermined) which is used in
situations in which a language or languages must be indicated but the language
cannot be identified.

The Language_Unknown constant should be set as that code.

****************************************************************

From: Tucker Taft
Date: Saturday, June 5, 2010   4:12 PM

XX and xxx would seem to be natural choices.
Defining the arrays as limited to the specified range of characters makes sense,
since just looking at the spec will eliminate a lot of questions.

****************************************************************

From: Brad Moore
Date: Saturday, June 5, 2010   5:11 PM

The reserved XXX code for alpha-3 only applies to Country codes and ISO 3166-1.
It is not reserved as far as I know in ISO 639.

ISO 639 specifically defines "und" which is not a special reserved code or user
assigned code, it is a regular code just like all the others. We should
definitely be using "und" for Language_Unknown.

Since we wouldn't use xxx for language, there is no benefit for a matching code
for country. Since ZZ is the last user defined Country code, and because it
already has uses for the purpose of representing an unknown country, I think we
should use the ZZ code.

****************************************************************

From: Brad Moore
Date: Saturday, June 5, 2010   3:58 PM

One other question.

Does it really make sense for this package to be a child of System?

System packages seem to be packages that define things such as implementation
defined constants, such as the Storage_Element definition.

This locale package does not have any implementation defined definitions, and
feels to me more like a portable library such as Ada.Directories.

I think it should be a child of Ada, rather than a child of System.

****************************************************************

From: Tucker Taft
Date: Saturday, June 5, 2010   4:13 PM

Agreed.  Make it a child of Ada,
or conceivably a child of Interfaces.

****************************************************************

From: Brad Moore
Date: Saturday, June 5, 2010   5:55 PM

I have a updated version attached that [Version /02 - Editor.]
  - eliminates implementation defined constants
  - eliminates implementation defined return values
  - Limits the codes returned to always be 3 characters for language
    codes as defined by ISO 629-2/T and 2 characters for country codes
    as defined by ISO 3166-1.
  - Changes the types to specify lower case constraints for language
    codes and upper case constraints for country codes.
  - Moves the package from a child of System to a child of Ada.

****************************************************************

From: Randy Brukardt
Date: Saturday, June 12, 2010   9:48 PM

A couple of questions for the next version (*AFTER* the meeting):

The package name changed from Locale to Locales in this version. Was that
intended?

The package requires ISO 639-2/T names. Why are the other parts of that standard
referenced in the "Normative References" section? If we're not using them, they
ought not be there. (BTW, I put those references in the required numeric order.)

****************************************************************

From: Brad Moore
Date: Friday, June 25, 2010   7:05 PM

> A couple of questions for the next version (*AFTER* the meeting):
>
> The package name changed from Locale to Locales in this version.
> Was that
> intended?

Yes, I should have mentioned that as another change, (which actually was the
name of the original version of the package)

The reasoning behind that change is that when the package was a child of System,
it seemed to make more sense to use singular, because it read better, since
"System" acted as an adjective to describe "Locale". When moved as a child of
Ada, this no longer was the case. "Ada" does not have a locale. Many of the
other child packages of Ada are plural as well (Directories, Assertions,
Containers, etc). Under "Ada", it seems to be a better choice to go with the
plural form, since "Locales" is more like a subject area, much like "Assertions"
is a subject area if one wants to go about inserting an assertion in ones code.

> The package requires ISO 639-2/T names. Why are the other parts of that
> standard referenced in the "Normative References" section? If we're not
> using them, they ought not be there. (BTW, I put those references in the
> required numeric order.)


I actually had removed 639-1 from my last submission, I see you added it back
in. I removed 639-1 because that describes alpha-2 codes which we definitely
weren't using.

I am thinking we should be able to get rid of the 639-2 reference also. The
639-3 standard is a superset of 639-2. It uses the 639-2/T codes instead of the
639-2/B codes, when there are two codes for a language in 629-2.

The 639-3 codes are also alpha-3 codes, as are 639-2 codes.

639-3 also better deals with certain cases that are problematic in 639-2.
For example, chinese is a macro language.
In 629-2/T and 639-3 this appears as the code "zho".

However, if the current locale is more specificly defined to be Mandarin
Chinese, 639-3 provides a code "cmn" for this purpose. 639-2 does not break down
chinese any further than the macro language

****************************************************************

From: Randy Brukardt
Date: Thursday, August 5, 2010  9:57 PM

The wording in for this package has no introduction. Compare to A.16
(Directories) or A.17 (Environment_Variables).

The first sentence of the wording could be moved to be an introduction:

A locale identifies a geopolitical place or region, its associated character
sets, data and time formats, currency formats, and other internationalization
related characteristics.

"locale" probably ought to be in italics here.

But this wording bothers me, as it seems to promise at lot more than this
package actually is going to deliver (country and language codes). There is
nothing about character sets, time formats, or currency formats here! I'd be
happier if we promised less:

A locale identifies a geopolitical place or region and its associated language,
which can be used to determine other internationalization related
characteristics.

Anybody have better wording or another idea??

---

Another problem with this wording: there never is anything that says what
Language and Country return (other than in the case where they don't know). We
define what a Language_Code and a Country_Code is, but that wording seems to
assume the only way to get such a value is from these functions (which is
obviously False, given the constants and string literals in the package). Surely
the *type* has nothing to do with the active locale!

So, for example, the Country_Code should be defined as follows:

Country_Code is an upper-case string representation of an ISO 3166-1 alpha-2
code that identifies a country.

And the function Country as (with the second sentence being the existing
wording):

Function Country returns the code of the country associated with the active
locale. If the Country_Code associated with the active locale cannot be
determined from the environment then Country returns Country_Unknown.

Language and Language_Code should be defined in the same way.

---

Finally a trivial glitch: the runtime semantics of library packages are always
defined in the Static Semantics section (don't ask me why); there shouldn't be a
Dynamic Semantics section.

****************************************************************

From: Brad Moore
Date: Friday, August 6, 2010   3:02 PM

> The first sentence of the wording could be moved to be an introduction:
>

Agree

> A locale identifies a geopolitical place or region and its associated
> language, which can be used to determine other internationalization
> related characteristics.
>
> Anybody have better wording or another idea??
>

I am fine with the wording you have suggested, unless someone comes up with
something better.

> So, for example, the Country_Code should be defined as follows:
>
> Country_Code is an upper-case string representation of an ISO 3166-1
> alpha-2 code that identifies a country.
>
> And the function Country as (with the second sentence being the
> existing
> wording):
>
> Function Country returns the code of the country associated with the
> active locale. If the Country_Code associated with the active locale
> cannot be determined from the environment then Country returns Country_Unknown.
>
> Language and Language_Code should be defined in the same way.

Agree

One other thing I have been thinking about. I think it would be nice if this
package could be a remote types package.

For example, I can imagine a server application that receives client requests
from  clients using different locales. The client could include its
country/language code in the request, and the server could respond with a
response suitable for the clients locale.

****************************************************************

From: Randy Brukardt
Date: Saturday, August 7, 2010  12:17 AM

...
> One other thing I have been thinking about. I think it would be nice
> if this package could be a remote types package.

That seems harmless, given that the package only exports two visible string
types. How could it *not* work as a remote types package?? It surely meets all
of the requirements of E.2.2. I'll add the missing pragma to the specification.

****************************************************************


Questions? Ask the ACAA Technical Agent