Version 1.1 of ai05s/ai05-0127-2.txt

Unformatted version of ai05s/ai05-0127-2.txt version 1.1
Other versions for file ai05s/ai05-0127-2.txt

!standard 13.7.3(0)          10-06-01 AI05-0127-2/01
!standard 1.2(2)
!standard 1.2(4/2)
!class Amendment 10-06-01
!status work item 10-06-01
!status received 10-06-01
!priority Low
!difficulty Medium
!subject Adding Locale Capabilities
!summary
A package is needed to identify the current locale.
!problem
Ada does not provide a portable way to determine the active locale in an environment. Knowing the active locale would facilitate writing applications that tailor the users experience to match the users expectations. The means to determine the current locale is operating system specific and non-portable. Should basic localization support be added to the language? (Yes.)
!proposal
Most modern operating systems provide capabilities that facilitate writing applications that tailor the users experience with an application to match the users expectations. The existing approaches vary considerably however and are non-portable.
For example POSIX provides POSIX library calls whereas Microsoft Windows provides a completely different set of interfaces. A portable solution is desired for Ada.
There are many areas that are affected by locale settings such as dates, times, currency, character collation orders, message text, and numeric formatting. The basic need however, is to be able to determine the current locale (language and country). If an application has this capability, all locale related differences can be programmed into the application in a portable manner.
This proposal provides a new package: System.Locale
package System.Locale is type Language_Code is new String; type Country_Code is new String;
Country_Unknown : constant Country_Code := implementation-defined Language_Unknown : constant Language_Code := implementation-defined
function Language return Language_Code; function Country return Country_Code; end System.Locale;
If the country associated with the current locale can be determined from the environment, the Country function returns a code as defined by ISO 3166-1, otherwise Unknown_Country is returned. ISO 3166-1 defines three sets of codes; alpha-2, alpha-3, and numeric-3. These three codes cover an identical number of country names.
The alpha-2 code is a two letter code, alpha-3 is a three letter code, and numeric-3 is a 3 digit numeric code.
e.g. Country alpha-2 alpha-3 number-3 -------------------------------------------------- AFGHANISTAN af afg 004 CANADA ca can 124 FRANCE fr fra 250 GERMANY de deu 276 ITALY it ita 380 SPAIN es esp 724 UNITED KINGDOM gb gbr 826 UNITED STATES us usa 840
Numeric codes are used mostly for countries where non-Latin scripts are used. The Country function returns a lower-case string that represents the country of the current locale. Whether it returns an alpha-2, alpha-3, or numeric-3 code is implementation defined, though it is recommended that the returned value be the value most appropriate for the environment, which typically is the alpha-2 code. These are the same codes used in the internet for top level domain names. E.g. google.ca
If the language associated with the current locale can be determined from the environment, the Language function returns a code as defined by ISO 639. Otherwise Language returns Unknown_Language. ISO 639 has 5 code lists, three of which are relevant.
Part 1, the alpha-2 code Part 2, the alpha-3 code
ISO 639-2/T contains alpha 3 codes for the same languages as defined in ISO 639-1 ISO 639-3/B contains alpha 3 codes that are mostly the same as ISO 639-2/T
but with some codes derived from English names rather than native names of the languages
Part 3, the alpha-3 code for comprehensive coverage of languages.
e.g. Language 639-1 639-2/T 639-2/B 639-3 ------------------------------------------------------------ English en eng eng eng French fr fra fre fra German de deu ger deu Chinese zh zho chi zho+one of 13 subcodes
(eg cmn for mandarin)
The Language function returns a lower-case string that represents the language of the current locale. Whether it returns a 639-1, 639-2/T, 639-2/B, or 639-3 code is implementation defined, though it is recommended that the returned value be the value most appropriate for the environment, which typically is the alpha-2 code.
!wording
Add to normative references after 1.2(2):
ISO/IEC 639-1:2002, Terminology and other language and content resources Codes for the representation of names of languages Part 1: Alpha-2 code
ISO/IEC 639-2:1998, Terminology and other language and content resources Codes for the representation of names of languages Part 2: Alpha-3 code
ISO/IEC 639-3:2007, Terminology and other language and content resources Codes for the representation of names of languages Part 3: Alpha-3 code for
comprehensive coverage of languages
Add to normative references after 1.2(4/2):
ISO/IEC 3166-1:2006, Information and documentation Codes for the representation of names of countries and their subdivisions Part 1: Country Codes
Add a new clause:
13.7.3 The Package System.Locale
Static Semantics
The following language-defined library package exists:
package System.Locale is pragma Preelaborate;
type Language_Code is new String; type Country_Code is new String;
Country_Unknown : constant Country_Code := implementation-defined Language_Unknown : constant Language_Code := implementation-defined
function Language return Language_Code; function Country return Country_Code; end System.Locale;
A locale identifies a geopolitical place or region, its associated character sets, data and time formats, currency formats, and other internationalization related characteristics. The active locale is the locale associated with the active partition.
Language_Code is a lower-case string representation of an ISO 639 code that identifies the name of a language associated with the active partition.
Country_Code is a lower-case string representation of an ISO 3166-1 code that identifies the name of a country associated with the active locale.
Dynamic Semantics
If the Country_Code associated with the active locale cannot be determined from the environment then Country returns Country_Unknown.
If the Language_Code associated with the active locale cannot be determined from the environment then Language returns Language_Unknown.
Implementation Advice
Codes returned should reflect the target environment semantics as closely as is reasonable. For example, in most environments, it makes sense to return an alpha-2 code instead of an alpha-3 code as defined by ISO 639-1, ISO 639-2, ISO 639-3 and ISO-3166-1, since those are commonly used, have the least variation, and have the highest portability for locale based capabilities.
!discussion
Consideration was given to whether specific locale capabilities could be provided, such as accessing numeric formatting, date formatting, currency formatting, or collating sequence locale specific information. This was ruled out because it would be difficult to get this right, and would require a high level of effort, when there does not seem to be a high level of demand for these capabilities. A simple capability of determining the locale is all that is needed to provide portability, as application programmers can program specific locale differences as needed once the current locale has been determined.
Consideration was also given to whether the returned codes should be specific lengths. For example, country codes are typically two character codes in Windows and POSIX environments. There are cases though where a three character code may be more appropriate. Numeric codes may be used for locales where non Latin scripts are used. It was decided that the result types for these functions should be string types, to provide the greatest flexibility. If the world switches to 3 character codes over time, it will not impact the specification of this package.
Originally the package was a child of Ada, however it was decided that this package should be a child of System, because locale capabilities are system dependent.
The package name was originally plural, as in System.Locales. Since there is only one active locale, usage of this package reads better if the package name is singular. e.g.,
If System.Locale.Language = "ca" then
...
end if;
!example
with System.Locale; with Ada.Text_IO.Editing;
procedure P is Fill, Separator, Radix : Character; Currency : constant String := "$"; Pic : constant Ada.Text_IO.Editing.Picture := Ada.Text_IO.Editing.To_Picture (Pic_String => "$ZZZZ_ZZ9.99", Blank_When_Zero => False);
type Dollars is delta 0.01 digits 8 range 0.0 .. 999_999.99; begin
if System.Locale.Country = "ca" or System.Locale.Country = "can" then
if System.Locale.Language = "en" or System.Locale.Language = "eng" then Fill := ' '; Separator := ','; Radix := '.'; elsif System.Locale.Language = "fr" or System.Locale.Language = "fre" then Fill := ' '; Separator := '.'; Radix := ','; end if; end if;
declare package Canadian_Cash is new Ada.Text_IO.Editing.Decimal_Output (Num => Dollars); Cost : constant Dollars := 256_778.99; begin Canadian_Cash.Put (Item => Cost, Pic => Pic, Currency => Currency, Fill => Fill, Separator => Separator, Radix_Mark => Radix); end;
end P;
--!corrigendum 13.7.3(0)
!ACATS test
ACATS C-Tests are needed to test this package.
!appendix

From: Brad Moore
Date: Tuesday, June 1, 2010  1:15 AM

Here is a much simplified version of AI05-0127, my homework. [This is version
/01 of this AI - Editor.]

I've eliminated all locale functionality other than a capability to determine
the active language and country.

The idea is that once you have a portable means to determine the current locale,
the application programmer can program all locale related differences needed in
a portable manner.

Rather than return string, I thought it was better to return Language_Code and
Country_Code which are types derived from String.

My thinking was it is better to have distinct types for these rather than
subtypes of String to provide stronger type safety.

In the ARG meeting notes from Burlington, the suggest was to move the package
Ada.Locales to System.Locales. I have moved the new package to be a child of
System, but modified the package name from Locales to Locale. (Plural to
singular)

It reads better in the code.

  if System.Locale.Language = "en" then
    ...
  end if;

****************************************************************

From: Jean-Pierre Rosen
Date: Tuesday, June 1, 2010  2:30 AM

Small nit: the specification says Country_Unknown and Language_Unknown, but the
discussion talks about Unknown_Country and Unknown_Language

****************************************************************

From: Brad Moore
Date: Tuesday, June 1, 2010  10:10 AM

Yes, the specification was my intent, it should be Country_Unknown and Language_Unknown throughout.

****************************************************************

From: Bob Duff
Date: Tuesday, June 1, 2010  10:56 AM

> The idea is that once you have a portable means to determine the
> current locale, the application programmer can program all locale
> related differences needed in a portable manner.

I don't really see the need for this AI.

For one thing, it doesn't really provide portability, since the country names
and language names are impl-def.  Not totally impl-def; they have to follow one
of several standards (two-letter names, three-letter names, etc).

    The nice thing about standards is that you have so many to choose from.
        -- Somebody Famous.
           (This saying is attributed to at least Andrew S. Tanenbaum, Admiral
           Grace Hooper, and Ken Olsen, by various web sites.  And I seem to
           recall hearing some Comp Sci professor at CMU saying it, circa 1978.
           Which leads me to say, "The nice thing about the world wide web is
           that there's so much misinformation to choose from.")

And the supposed reason for using strings is to allow implementations to upgrade
to new versions of the relevant locale standards.  I'm not sure how to write
portable code using such a moving target.

If this stuff really is properly standardized, then we can use an enumeration
type.  The fact that we're using strings seems to indicate otherwise.

I don't like Unknown_Country/lang being impl-def.  Shouldn't we at least insist
that it be distinct from defined country names?  For that matter, why not nail
it down (say it's "unknown country code" or something).

According to the ARG minutes from Burlington (Feb 2010), 2 people voted against
keeping this alive.  I don't really remember, but I suspect I was one of them.
The last 3 messages in the !appendix show Pascal Leroy, Bob Duff, and Robert
Dewar, all suggesting to drop this AI (but note that that was a previous
much-more-ambitious version).  I haven't changed my mind -- I don't think even
this much-simpler version is worth the trouble.

If I were writing a program that needs l10n / i18n, I think I'd ignore this
package, and go straight to the O.S. facilities.  There really aren't that many
-- windows, plus misc vesions of Unix that probably support Posix.  Embedded
real-time kernels can probably be ignored.

In the !example, variables are left uninitialized if you're not in Canada (or if
the impl chooses the numeric encoding of that country).  I don't understand what
"Dollars" are doing in a supposedly i18n app.  I guess I don't really understand
the example.

I think I prefer Locales over Locale (no big deal -- I'm just used to plurals
for package names).

****************************************************************

From: Robert Dewar
Date: Tuesday, June 1, 2010  11:02 AM

I agree with everything Bob says, and I would recommend dropping this AI.

****************************************************************

From: Robert Dewar
Date: Tuesday, June 1, 2010  6:08 AM

...
> Rather than return string, I thought it was better to return
> Language_Code and Country_Code which are types derived from String.

I think that types derived from String tend to be a nuisance, because various
utility functions do not apply without junk conversions.

> My thinking was it is better to have distinct types for these rather
> than subtypes of String to provide stronger type safety.

I disagree

****************************************************************

From: Bob Duff
Date: Tuesday, June 1, 2010  11:10 AM

> I think that types derived from String tend to be a nuisance, because
> various utility functions do not apply without junk conversions.

But there are cases where distinct types are helpful, and I think this is one of
them.  See here for another example:

    http://www.adacore.com/2010/04/05/gem-83/

In C, you can say:

    printf (input_data); // a security hole, if privileged program!

when you should have said

    printf ("%s", input_data);

The idea of template-oriented formatting is a good one.
In fact, we use the same idea in GNAT for error messages, and also in IAC (the
CORBA IDL-to-Ada compiler). So does CodePeer (last time I checked).

But it works best if the "template" type is distinct from the "string that could
come from input data" type (namely String).

I recently fixed a bunch of bugs of this nature in IAC.
And to make sure they STAY fixed, I changed the type from String to a template
type derived from String.

> > My thinking was it is better to have distinct types for these rather
> > than subtypes of String to provide stronger type safety.
>
> I disagree

In this particular case, I agree with Brad's decision.
As I said in my previous message, these types are really more like enums than
strings.  Having country codes as a separate type allows you to keep track of
which strings have been verified to really be country codes, versus other
strings that could contain arbitrary text.

Note: In Ada 2012, I might use subtype predicates instead!  ;-)

Anyway, if we're going to have this AI, shouldn't there be an Is_Valid_Country
function?  And/or a conversion function String-->Country that checks?

****************************************************************

From: Brad Moore
Date: Tuesday, June 1, 2010  11:17 AM

I could go either way regarding derived types vs subtypes.

On the one hand I thought there might not be much need for applying utility
functions on the return codes for these functions, and the stronger types might
catch some errors (eg. erroneously passing a language code into a function that
accepts a country code to determine the currency symbol)

On the other hand, I agree that the junk conversions you mention can be an
annoyance. I am happy to go with the consensus on this, but considering your
comment, I am starting to think subtypes are the way to go.

I presume though that it is preferable to return the Language_Code and
Country_Code subtypes rather than just return string?

****************************************************************

From: Bob Duff
Date: Tuesday, June 1, 2010  11:25 AM

> On the other hand, I agree that the junk conversions you mention can
> be an annoyance. I am happy to go with the consensus on this, but
> considering your comment, I am starting to think subtypes are the way
> to go.

Don't give in so easily.  ;-)

But I suppose if we drop this AI, as Robert and I suggest, we can leave the
type-vs-subtype question moot.

****************************************************************

From: Brad Moore
Date: Tuesday, June 1, 2010  11:17 AM

> For one thing, it doesn't really provide portability, since the
> country names and language names are impl-def.  Not totally impl-def;
> they have to follow one of several standards (two-letter names,
> three-letter names, etc).
>
>     The nice thing about standards is that you have so many to choose
> from. -- Somebody Famous.
>       (This saying is attributed to at least Andrew S. Tanenbaum,
>       Admiral Grace Hooper, and Ken Olsen, by various web sites.  And
>       I seem to recall hearing some Comp Sci professor at CMU saying
>       it, circa 1978. Which leads me to say, "The nice thing about
>       the world wide web is that there's so much misinformation to
>       choose from.")

It's not quite that bad. Really, there is only one standard for country names,
and one standard for language names (ISO 3166-1 and ISO 839).

Each standard provides several formats for the codes. I think my mistake was to
try to get away with not specifying which of the formats was used by the Ada
package. I now think it would have been better to say that the alpha-2 formats
are always returned, since those are the ones used by Microsoft, POSIX, and Java
today w.r.t locale identification.

This would at least address your portability comment, I think, since the country
names and language names would then be implementation defined.

> And the supposed reason for using strings is to allow implementations
> to upgrade to new versions of the relevant locale standards.  I'm not
> sure how to write portable code using such a moving target.
>
> If this stuff really is properly standardized, then we can use an
> enumeration type.  The fact that we're using strings seems to indicate
> otherwise.

I originally considered defining an enumeration that mapped to the codes defined
by ISO, but I came to the conclusion that two character codes in the form of a
string are better suited for this purpose. Over time, as new countries form, and
new languages evolve, the ISO country and language standards will need to be
revised. Adding new values to an enumeration will be cause incompatibilities
that can be avoided if we stick to returning a string based value that maps to
the two-character codes.

If I am writing an application for my current locale, say in Canada where
English and French are the official languages, it would be nice to know that
introducing a new country name for some newly formed country on the other side
of the planet will not break any enumeration case statements in my application.

> I don't like Unknown_Country/lang being impl-def.  Shouldn't we at
> least insist that it be distinct from defined country names?  For that
> matter, why not nail it down (say it's "unknown country code" or
> something).

My intent was to define these as a constant, such as "  ", (two spaces) which
does not (and would not) map to any character codes defined by ISO. Nailing it
down to a constant value sounds good to me. The point is, these are the only
cases where the values returned are not defined in the ISO standard.

> If I were writing a program that needs l10n / i18n, I think I'd ignore
> this package, and go straight to the O.S. facilities.  There really
> aren't that many -- windows, plus misc vesions of Unix that probably
> support Posix.  Embedded real-time kernels can probably be ignored.

I do have some real experience with this issue. A major system we developed for
our Canadian customer required that all text displayed in all applications
running on the data terminal be displayed in either English or French depending
on the locale settings of the terminal.

The applications originally were developed for a Unix platform, but eventually
were also ported to Windows. This is one of the few areas where the code was not
portable, so our source tree ended up providing and maintaining multiple
implementations of a package.

Admittedly, it was not a huge problem to work around, but it is messier than
having one source. This complicates project make files. We were even considering
bringing in some preprocessor solution for this one issue, which we ended up
avoiding thankfully. To those developers on our team coming from a C/C++
environment, it was difficult to convince them that Ada's approach of not
providing a preprocessor was a good one, even though I believe that was a good
choice, for other technical reasons. On an aside, I was just bit last week by
some C/C++ code where a system include file had redefined an enumeration literal
I was trying to define to some other string. That's pretty scary stuff if you
can't trust that the source code you see displayed in the editor is not what the
compiler sees.

In my experience, given the choice between an Ada standard package, and going
straight to O.S. facilities, I would choose the Ada package almost always,
unless the O.S. facilities provided features that were not present in the Ada
package.

> In the !example, variables are left uninitialized if you're not in
> Canada (or if the impl chooses the numeric encoding of that country).
> I don't understand what "Dollars" are doing in a supposedly i18n app.
> I guess I don't really understand the example.

The example is not a comprehensive one. I was thinking of the application we
provided for the military. The application is only going to be run in a Canadian
context, which is why I didn't test for other countries. I should just have
checked to ensure that the country is Canada, and raised program error
otherwise.

In Canada, both English and French use Dollars.

The example also shows how the locale capability can be used with
Ada.Text_IO.Editing.Decimal_Output which is an existing package that can be used
to address locale formatting of numeric and currency values. It's rather odd
that we never provided a means to facilitate using locale to select the radix,
separator, and currency inputs.

I can probably come up with a better example. In fact, I think I would like to
resubmit this AI, with a version that only uses alpha-2 codes. Before we decide
to torch this AI, it would be good to have a version that at least addresses
some of these comments. It shouldn't take me long to update.

****************************************************************

From: Bob Duff
Date: Tuesday, June 1, 2010  4:51 PM

> In my experience, given the choice between an Ada standard package,
> and going straight to O.S. facilities, I would choose the Ada package
> almost always, unless the O.S. facilities provided features that were
> not present in the Ada package.

I guess that "unless" is the key point.  There are approximately 3 operating
systems to worry about: Windows, Linux/Unix/Posix, any other? There are hundreds
of countries/languages.  If I want portability across operating systems and
portability acrosss countries, I'm thinking I'd rather write 2 or 3 OS-dependent
versions of things, rather than hundreds.  The current (thankfully simplified!)
version of the AI gives a somewhat-portable way to query the country. But the OS
gives much more -- for example, collating sequences.

Would you rather do:

    if Country = "xx" then
        collating order for xx goes here
    elsif Country = "yy" then
        collating order for yy goes here
    ... 100 more elsif's.

Or:

    if this is windows then
        use windows-specific stuff to get this locale's collating sequence
    elsif this is linux then
        use posix stuff
    else
        is there anything else?

I think I'm choosing 2 or 3 elsifs over 100 elsifs.

Of course, your example is different -- you had just 2 locales
(English- and French-speaking parts of Canada), so I understand that's somewhat
simpler.

> > In the !example, variables are left uninitialized if you're not in
> > Canada (or if the impl chooses the numeric encoding of that
> > country).  I don't understand what "Dollars" are doing in a
> > supposedly i18n app.  I guess I don't really understand the example.
>
> The example is not a comprehensive one. I was thinking of the
> application we provided for the military. The application is only
> going to be run in a Canadian context, which is why I didn't test for
> other countries. I should just have checked to ensure that the country
> is Canada, and raised program error otherwise.

Right.  Or for a program that could run outside Canada, you'd default to some
locale if it's not one of the ones you've specifically coded for.

> In Canada, both English and French use Dollars.

I know -- I've been to both English- and French-speaking parts.
It looks like monopoly money, with all those colors, but hey, who am I to judge.
;-)

> The example also shows how the locale capability can be used with
> Ada.Text_IO.Editing.Decimal_Output which is an existing package that
> can be used to address locale formatting of numeric and currency values.
> It's rather odd that we never provided a means to facilitate using
> locale to select the radix, separator, and currency inputs.
>
> I can probably come up with a better example. In fact, I think I would
> like to resubmit this AI, with a version that only uses alpha-2 codes.
> Before we decide to torch this AI, it would be good to have a version
> that at least addresses some of these comments. It shouldn't take me
> long to update.

Well, maybe you should wait to see what others think.

I have never done any serious i18n work, so you should take what I say with a
grain of salt.  I read a book about it some years ago, and it seemed like
operating systems had some fairly sophisticated stuff.  Unfortunately not
portable across operating systems.  But portable across locales!

****************************************************************

From: Brad Moore
Date: Wednesday, June 2, 2010  12:53 AM

...
> I think I'm choosing 2 or 3 elsifs over 100 elsifs.

I agree that if one were to write an application that everyone in the world
could use, you have serious locale needs and probably would want to glean all
the OS capabilities you can by writing non-portable calls to the OS. I think you
would be hard pressed though to find many real world example of such an
application however.

By the way, if you can think of a new application that everyone in the world
would want to use, please let me know. :-) (I suppose a web browser is an
example of one such existing application)

One of the significant development costs we encountered in the Canadian
applications was in the area of language translation. It's fine having OS hooks
to determine decimal points, and currency symbols, but no amount of OS
capabilities is going to do the hard work of determining which text to display
to the user in GUI's, reports, help text, etc. Getting text from different
languages to fit on the same places on a GUI window can be quite a challenge in
itself. Writing an application that displays the correct text in every language
would be a monumental task. Good luck just finding the translators needed for
all those languages.

I suspect in practice, a good majority of il8n applications involve a handful of
languages at most, attempting to cover the main population of the users. For
example, bank machines in my area mostly have two languages, some have 4 or 5.
(eg. Chinese, English, French, Spanish, Japanese). Instruction manuals for
equipment I have seen may have 3 - 5 languages depending on where the equipment
is sold internationally. German might be one of the languages added to the list.

Any websites I have seen typically only support up to a handful of languages.

I don't recall ever encountering Swahili in my travels, (not that I would know
Swahili if I saw it), someone writing an application would not bother
translating to Swahili, unless there was a reasonable chance that someone
speaking that language would be relatively common user of the application.

So I think maybe I'm choosing up to a handful of portable elsifs over your 3
non-portable else ifs.

Incidentally, collating sequence I suspect is one of the more esoteric of locale
based areas. I dont think we used any locale based use of collating sequences in
our applications that I can recall.

The most important use of the locale we found is to select which text we wanted
to display to the user.

> > In fact, I think I
> > would like to resubmit this AI, with a version that only uses
> > alpha-2 codes.
>
> Well, maybe you should wait to see what others think.
>

Will do.

****************************************************************

From: Jean-Pierre Rosen
Date: Wednesday, June 2, 2010   2:49 AM

> I agree that if one were to write an application that everyone in the
> world could use, you have serious locale needs and probably would want
> to glean all the OS capabilities you can by writing non-portable calls
> to the OS. I think you would be hard pressed though to find many real
> world example of such an application however.

Do not forget games. "Battle for Wesnoth" is available in 49 languages (see
http://www.wesnoth.org/gettext/). According to one of the main developpers of
the game, Jeremy Rosen ;-), Gettext is the way to go to handle that many
languages. But out of scope for us, I guess.

****************************************************************

From: Bob Duff
Date: Wednesday, June 2, 2010   7:34 AM

> I think you would be hard pressed though to find many real world
> example of such an application however.

Indeed.  I have never worked on any project that had anything to do with i18n,
so I've got zero first-hand experience. I designed the message-printing stuff in
CodePeer to support it, but as far as I know, there's still only one version of
the messages (in English).  AdaCore has customers all over, but GNAT and
everything else we sell gives messages only in English.

My bank's machine asks me whether I want to use English or Spanish.

****************************************************************


Questions? Ask the ACAA Technical Agent