!standard A.19(0) 10-08-06 AI05-0127-2/04 !standard 1.2(1) !standard 1.2(4/2) !class Amendment 10-06-01 !status Amendment 2012 10-08-05 !status ARG Approved 9-0-0 10-06-20 !status work item 10-06-01 !status received 10-06-01 !priority Low !difficulty Medium !subject Adding Locale Capabilities !summary A package is needed to identify the current locale. !problem Ada does not provide a portable way to determine the active locale in an environment. Knowing the active locale would facilitate writing applications that tailor the users experience to match the users expectations. The means to determine the current locale is operating system specific and non-portable. Should basic localization support be added to the language? (Yes.) !proposal Most modern operating systems provide capabilities that facilitate writing applications that tailor the users experience with an application to match the users expectations. The existing approaches vary considerably however and are non-portable. For example POSIX provides POSIX library calls whereas Microsoft Windows provides a completely different set of interfaces. A portable solution is desired for Ada. There are many areas that are affected by locale settings such as dates, times, currency, character collation orders, message text, and numeric formatting. The basic need however, is to be able to determine the current locale (language and country). If an application has this capability, all locale related differences can be programmed into the application in a portable manner. This proposal provides a new package Ada.Locales that provides functions to query the identity of the country and language associated with the current locale. !wording Add to normative references after 1.2(1): ISO/IEC 639-3:2007, Terminology and other language and content resources Codes for the representation of names of languages Part 3: Alpha-3 code for comprehensive coverage of languages Add to normative references after 1.2(4/2): ISO/IEC 3166-1:2006, Information and documentation Codes for the representation of names of countries and their subdivisions Part 1: Country Codes Add a new clause: A.19 The Package Locales A locale identifies a geopolitical place or region and its associated language, which can be used to determine other internationalization related characteristics. Static Semantics The following language-defined library package exists: package Ada.Locales is pragma Preelaborate(Locales); pragma Remote_Types(Locales); type Language_Code is array (1 .. 3) of Character range 'a' .. 'z'; type Country_Code is array (1 .. 2) of Character range 'A' .. 'Z'; Language_Unknown : constant Language_Code := "und"; Country_Unknown : constant Country_Code := "ZZ"; function Language return Language_Code; function Country return Country_Code; end Ada.Locales; The active locale is the locale associated with the active partition. Language_Code is a lower-case string representation of an ISO 639-3 alpha-3 code that identifies a language. Country_Code is an upper-case string representation of an ISO 3166-1 alpha-2 code that identifies a country. Function Language returns the code of the language associated with the active locale. If the Language_Code associated with the active locale cannot be determined from the environment then Language returns Language_Unknown. Function Country returns the code of the country associated with the active locale. If the Country_Code associated with the active locale cannot be determined from the environment then Country returns Country_Unknown. !discussion ISO 3166-1 defines three sets of codes; alpha-2, alpha-3, and numeric-3. These three codes cover an identical number of country names. The alpha-2 code is a two letter code, alpha-3 is a three letter code, and numeric-3 is a 3 digit numeric code. e.g. Country alpha-2 alpha-3 number-3 -------------------------------------------------- AFGHANISTAN AF AFG 004 CANADA CA CAN 124 FRANCE FR FRA 250 GERMANY DE DEU 276 ITALY IT ITA 380 SPAIN ES ESP 724 UNITED KINGDOM GB GBR 826 UNITED STATES US USA 840 Numeric codes are used mostly for countries where non-Latin scripts are used. The Country function returns an upper-case string that represents the country of the current locale. The ISO 3166-1 standard is case insensitive for country codes, but recommends upper case for code usage, which is why the Country function limits the return result to upper case only. This simplifies client usage if clients know they can expect the return values to be consistently in upper case. Alpha-2 codes were chosen instead of alpha-3 codes because existing locale capabilities in POSIX and Apple OSX follow BCP 47 RFC 4646, which excludes the use of alpha-3 codes. Since Microsoft's locale id scheme does not follow ISO 3166-1, the Microsoft scheme does not impact this decision. Going with alpha-2 code format possibly allows for simpler implementations in POSIX and OSX environments since the alpha-2 code can be extracted directly from the environment without requiring a mapping. Consideration was given to whether specific locale capabilities could be provided, such as accessing numeric formatting, date formatting, currency formatting, or collating sequence locale specific information. This was ruled out because it would be difficult to get this right, and would require a high level of effort, when there does not seem to be a high level of demand for these capabilities. A simple capability of determining the locale is all that is needed to provide portability, as application programmers can program specific locale differences as needed once the current locale has been determined. A user application could relatively easily define a translation lookup facility that accepted the current locale, and an application message id to lookup a locale specific translation. Such a facility could also lookup localization features such as those provided by the OS for numeric, date formatting and currency formatting and collating sequences. ISO 639 has 5 code lists, three of which are relevant. Part 1, the alpha-2 code Part 2, the alpha-3 code ISO 639-2/T contains alpha 3 codes for the same languages as defined in ISO 639-1 ISO 639-3/B contains alpha 3 codes that are mostly the same as ISO 639-2/T but with some codes derived from English names rather than native names of the languages Part 3, the alpha-3 code for comprehensive coverage of languages. e.g. Language 639-1 639-2/T 639-2/B 639-3 ------------------------------------------------------------ English en eng eng eng French fr fra fre fra German de deu ger deu Chinese zh zho chi zho+one of 13 subcodes (eg cmn for mandarin) The Language function returns a lower-case 639-3 alpha-3 string that represents the language of the current locale. The ISO 639-3 standard is case insensitive for language codes, but recommends lower case for code usage, which is why the Language function limits the return result to lower case only. This simplifies client usage if clients know they can expect the return values to be consistently in lower case. The decision to go with 639-3 alpha-3 codes was driven by the fact that 639-1 codes only cover the major languages in use. ISO 639-2 defines codes for many more languages than 639-1, and generally covers all languages that have significant bodies of literature, and covers most languages. The selection of 639-2/T over 639-2/B is driven by the fact that POSIX and Apple OSX follow BCP 47 (RFC 4646), which states that when there is a choice between the "T" code and the "B" code, the T code is the recommended choice. ISO 639-3 is a superset of ISO 639-2 that uses the 639-2/T codes instead of the 639-2/B codes, when there are two codes for a language in 629-2. 639-3 also better deals with certain cases that are problematic in 639-2. For example, chinese is a macro language which has many dialects. In 629-2/T and 639-3 this appears as the code "zho". However, if the current locale is more specificly defined to be Mandarin Chinese, 639-3 provides a code "cmn" for this purpose. 639-2 does not break down chinese any further than the macro language. Thus we selected the more detailed ISO 639-3 codes. The definition of Language_Unknown is defined to "und" because ISO 639 defines that code to be used in situations in which a language or languages must be indicated but the language cannot be identified. The definition of Country_Unknown is defined to "ZZ" because ISO 3166-1 specifies that is one of a set of codes in the standard that is user assigned. User-assigned code elements are codes at the disposal of users who need to add further names of countries, territories, or other geographical entities to their in-house application of ISO 3166-1, and the ISO 3166/MA will never use these codes in the updating process of the standard. The following codes can be user-assigned: * Alpha-2: AA, QM to QZ, XA to XZ, and ZZ * Alpha-3: AAA to AAZ, QMA to QZZ, XAA to XZZ, and ZZA to ZZZ One such user-assigned coding is by the Unicode Common Locale Data Repository, which assigns ZZ to represent "Unknown or Invalid Territory" Since there is no specific code defined for unknown Country and there already are uses of this code for similar purposes, and because this is the last user assigned alpha-2 code and less likely to be used for other purposes, "ZZ", seemed like the correct choice. Consideration was given to whether the package should deal with macro-geographic regions. The BCP 47 RFC indicates that country codes can be in numeric-3 format if the region identified is larger than a country, such as a continent. Microsoft does not have any locales based on macro-geographic regions. It is dubious that these locales are used much if at all. The numeric-3 codes in this case are outside of ISO 3166-1 because they do not represent countries. Trying to build support for this into the Ada package would be messy, and in these cases it is the Language that is the most important distinguisher rather than region. If the OS provides a numeric-3 format for macro-geographic region, it makes sense to return Country_Unknown for the Country function, since the Country truly is unknown. Consideration was given to whether this new package should be a child of System or a child of Ada. It was decided that this package should be a child of Ada, because the package does not provide any impementation-defined definitions, and provides a portable way to access operating system facilities similar to Ada.Directories. !example package Canadian_Point_Of_Sale_System is type Dollars is delta 0.01 digits 8 range 0.0 .. 999_999.99; function To_String (Amount : Dollars) return String; end Canadian_Point_Of_Sale_System; with Ada.Locales; with Ada.Text_IO.Editing; with Ada.Text_IO; package body Canadian_Point_Of_Sale_System is function To_String (Amount : Dollars) return String is package Canadian_Decimal_Output is new Ada.Text_IO.Editing.Decimal_Output (Num => Dollars); Separator, Radix : Character; use type Ada.Locales.Country_Code; use type Ada.Locales.Language_Code; begin if Ada.Locales.Country /= "CA" then raise Program_Error; end if; if Ada.Locales.Language = "eng" then Separator := ','; Radix := '.'; elsif Ada.Locales.Language = "fra" then Separator := '.'; Radix := ','; else raise Program_Error; end if; return Canadian_Decimal_Output.Image (Item => Amount, Pic => Ada.Text_IO.Editing.To_Picture (Pic_String => "$ZZZZ_ZZ9.99"), Currency => "$", Separator => Separator, Radix_Mark => Radix); end To_String; end Canadian_Point_Of_Sale_System; !corrigendum 1.2(1) @dinsa The following standards contain provisions which, through reference in this text, constitute provisions of this International Standard. At the time of publication, the editions indicated were valid. All standards are subject to revision, and parties to agreements based on this International Standard are encouraged to investigate the possibility of applying the most recent editions of the standards indicated below. Members of IEC and ISO maintain registers of currently valid International Standards. @dinst ISO/IEC 639-3:2007, @i !corrigendum 1.2(4/2) @dinsa ISO/IEC 1989:2002, @i @dinst ISO/IEC 3166-1:2006, @i !corrigendum A.19(0) @dinsc A @b identifies a geopolitical place or region and its associated language, which can be used to determine other internationalization related characteristics. @s8<@i> The following language-defined library package exists: @xcode<@b Ada.Locales @b @b Preelaborate(Locales); @b Remote_Types(Locales); @b Language_Code @b (1 .. 3) @b Character @b 'a' .. 'z'; @b Country_Code @b (1 .. 2) @b Character @b 'A' .. 'Z'; Language_Unknown : @b Language_Code := "und"; Country_Unknown : @b Country_Code := "ZZ"; @b Language @b Language_Code; @b Country @b Country_Code; @b Ada.Locales;> The @i is the locale associated with the active partition. Language_Code is a lower-case string representation of an ISO 639-3 alpha-3 code that identifies a language. Country_Code is an upper-case string representation of an ISO 3166-1 alpha-2 code that identifies a country. Function Language returns the code of the language associated with the active locale. If the Language_Code associated with the active locale cannot be determined from the environment then Language returns Language_Unknown. Function Country returns the code of the country associated with the active locale. If the Country_Code associated with the active locale cannot be determined from the environment then Country returns Country_Unknown. !ACATS test ACATS C-Tests are needed to test this package. !appendix From: Brad Moore Date: Tuesday, June 1, 2010 1:15 AM Here is a much simplified version of AI05-0127, my homework. [This is version /01 of this AI - Editor.] I've eliminated all locale functionality other than a capability to determine the active language and country. The idea is that once you have a portable means to determine the current locale, the application programmer can program all locale related differences needed in a portable manner. Rather than return string, I thought it was better to return Language_Code and Country_Code which are types derived from String. My thinking was it is better to have distinct types for these rather than subtypes of String to provide stronger type safety. In the ARG meeting notes from Burlington, the suggest was to move the package Ada.Locales to System.Locales. I have moved the new package to be a child of System, but modified the package name from Locales to Locale. (Plural to singular) It reads better in the code. if System.Locale.Language = "en" then ... end if; **************************************************************** From: Jean-Pierre Rosen Date: Tuesday, June 1, 2010 2:30 AM Small nit: the specification says Country_Unknown and Language_Unknown, but the discussion talks about Unknown_Country and Unknown_Language **************************************************************** From: Brad Moore Date: Tuesday, June 1, 2010 10:10 AM Yes, the specification was my intent, it should be Country_Unknown and Language_Unknown throughout. **************************************************************** From: Bob Duff Date: Tuesday, June 1, 2010 10:56 AM > The idea is that once you have a portable means to determine the > current locale, the application programmer can program all locale > related differences needed in a portable manner. I don't really see the need for this AI. For one thing, it doesn't really provide portability, since the country names and language names are impl-def. Not totally impl-def; they have to follow one of several standards (two-letter names, three-letter names, etc). The nice thing about standards is that you have so many to choose from. -- Somebody Famous. (This saying is attributed to at least Andrew S. Tanenbaum, Admiral Grace Hooper, and Ken Olsen, by various web sites. And I seem to recall hearing some Comp Sci professor at CMU saying it, circa 1978. Which leads me to say, "The nice thing about the world wide web is that there's so much misinformation to choose from.") And the supposed reason for using strings is to allow implementations to upgrade to new versions of the relevant locale standards. I'm not sure how to write portable code using such a moving target. If this stuff really is properly standardized, then we can use an enumeration type. The fact that we're using strings seems to indicate otherwise. I don't like Unknown_Country/lang being impl-def. Shouldn't we at least insist that it be distinct from defined country names? For that matter, why not nail it down (say it's "unknown country code" or something). According to the ARG minutes from Burlington (Feb 2010), 2 people voted against keeping this alive. I don't really remember, but I suspect I was one of them. The last 3 messages in the !appendix show Pascal Leroy, Bob Duff, and Robert Dewar, all suggesting to drop this AI (but note that that was a previous much-more-ambitious version). I haven't changed my mind -- I don't think even this much-simpler version is worth the trouble. If I were writing a program that needs l10n / i18n, I think I'd ignore this package, and go straight to the O.S. facilities. There really aren't that many -- windows, plus misc vesions of Unix that probably support Posix. Embedded real-time kernels can probably be ignored. In the !example, variables are left uninitialized if you're not in Canada (or if the impl chooses the numeric encoding of that country). I don't understand what "Dollars" are doing in a supposedly i18n app. I guess I don't really understand the example. I think I prefer Locales over Locale (no big deal -- I'm just used to plurals for package names). **************************************************************** From: Robert Dewar Date: Tuesday, June 1, 2010 11:02 AM I agree with everything Bob says, and I would recommend dropping this AI. **************************************************************** From: Robert Dewar Date: Tuesday, June 1, 2010 6:08 AM ... > Rather than return string, I thought it was better to return > Language_Code and Country_Code which are types derived from String. I think that types derived from String tend to be a nuisance, because various utility functions do not apply without junk conversions. > My thinking was it is better to have distinct types for these rather > than subtypes of String to provide stronger type safety. I disagree **************************************************************** From: Bob Duff Date: Tuesday, June 1, 2010 11:10 AM > I think that types derived from String tend to be a nuisance, because > various utility functions do not apply without junk conversions. But there are cases where distinct types are helpful, and I think this is one of them. See here for another example: http://www.adacore.com/2010/04/05/gem-83/ In C, you can say: printf (input_data); // a security hole, if privileged program! when you should have said printf ("%s", input_data); The idea of template-oriented formatting is a good one. In fact, we use the same idea in GNAT for error messages, and also in IAC (the CORBA IDL-to-Ada compiler). So does CodePeer (last time I checked). But it works best if the "template" type is distinct from the "string that could come from input data" type (namely String). I recently fixed a bunch of bugs of this nature in IAC. And to make sure they STAY fixed, I changed the type from String to a template type derived from String. > > My thinking was it is better to have distinct types for these rather > > than subtypes of String to provide stronger type safety. > > I disagree In this particular case, I agree with Brad's decision. As I said in my previous message, these types are really more like enums than strings. Having country codes as a separate type allows you to keep track of which strings have been verified to really be country codes, versus other strings that could contain arbitrary text. Note: In Ada 2012, I might use subtype predicates instead! ;-) Anyway, if we're going to have this AI, shouldn't there be an Is_Valid_Country function? And/or a conversion function String-->Country that checks? **************************************************************** From: Brad Moore Date: Tuesday, June 1, 2010 11:17 AM I could go either way regarding derived types vs subtypes. On the one hand I thought there might not be much need for applying utility functions on the return codes for these functions, and the stronger types might catch some errors (eg. erroneously passing a language code into a function that accepts a country code to determine the currency symbol) On the other hand, I agree that the junk conversions you mention can be an annoyance. I am happy to go with the consensus on this, but considering your comment, I am starting to think subtypes are the way to go. I presume though that it is preferable to return the Language_Code and Country_Code subtypes rather than just return string? **************************************************************** From: Bob Duff Date: Tuesday, June 1, 2010 11:25 AM > On the other hand, I agree that the junk conversions you mention can > be an annoyance. I am happy to go with the consensus on this, but > considering your comment, I am starting to think subtypes are the way > to go. Don't give in so easily. ;-) But I suppose if we drop this AI, as Robert and I suggest, we can leave the type-vs-subtype question moot. **************************************************************** From: Brad Moore Date: Tuesday, June 1, 2010 11:17 AM > For one thing, it doesn't really provide portability, since the > country names and language names are impl-def. Not totally impl-def; > they have to follow one of several standards (two-letter names, > three-letter names, etc). > > The nice thing about standards is that you have so many to choose > from. -- Somebody Famous. > (This saying is attributed to at least Andrew S. Tanenbaum, > Admiral Grace Hooper, and Ken Olsen, by various web sites. And > I seem to recall hearing some Comp Sci professor at CMU saying > it, circa 1978. Which leads me to say, "The nice thing about > the world wide web is that there's so much misinformation to > choose from.") It's not quite that bad. Really, there is only one standard for country names, and one standard for language names (ISO 3166-1 and ISO 839). Each standard provides several formats for the codes. I think my mistake was to try to get away with not specifying which of the formats was used by the Ada package. I now think it would have been better to say that the alpha-2 formats are always returned, since those are the ones used by Microsoft, POSIX, and Java today w.r.t locale identification. This would at least address your portability comment, I think, since the country names and language names would then be implementation defined. > And the supposed reason for using strings is to allow implementations > to upgrade to new versions of the relevant locale standards. I'm not > sure how to write portable code using such a moving target. > > If this stuff really is properly standardized, then we can use an > enumeration type. The fact that we're using strings seems to indicate > otherwise. I originally considered defining an enumeration that mapped to the codes defined by ISO, but I came to the conclusion that two character codes in the form of a string are better suited for this purpose. Over time, as new countries form, and new languages evolve, the ISO country and language standards will need to be revised. Adding new values to an enumeration will be cause incompatibilities that can be avoided if we stick to returning a string based value that maps to the two-character codes. If I am writing an application for my current locale, say in Canada where English and French are the official languages, it would be nice to know that introducing a new country name for some newly formed country on the other side of the planet will not break any enumeration case statements in my application. > I don't like Unknown_Country/lang being impl-def. Shouldn't we at > least insist that it be distinct from defined country names? For that > matter, why not nail it down (say it's "unknown country code" or > something). My intent was to define these as a constant, such as " ", (two spaces) which does not (and would not) map to any character codes defined by ISO. Nailing it down to a constant value sounds good to me. The point is, these are the only cases where the values returned are not defined in the ISO standard. > If I were writing a program that needs l10n / i18n, I think I'd ignore > this package, and go straight to the O.S. facilities. There really > aren't that many -- windows, plus misc vesions of Unix that probably > support Posix. Embedded real-time kernels can probably be ignored. I do have some real experience with this issue. A major system we developed for our Canadian customer required that all text displayed in all applications running on the data terminal be displayed in either English or French depending on the locale settings of the terminal. The applications originally were developed for a Unix platform, but eventually were also ported to Windows. This is one of the few areas where the code was not portable, so our source tree ended up providing and maintaining multiple implementations of a package. Admittedly, it was not a huge problem to work around, but it is messier than having one source. This complicates project make files. We were even considering bringing in some preprocessor solution for this one issue, which we ended up avoiding thankfully. To those developers on our team coming from a C/C++ environment, it was difficult to convince them that Ada's approach of not providing a preprocessor was a good one, even though I believe that was a good choice, for other technical reasons. On an aside, I was just bit last week by some C/C++ code where a system include file had redefined an enumeration literal I was trying to define to some other string. That's pretty scary stuff if you can't trust that the source code you see displayed in the editor is not what the compiler sees. In my experience, given the choice between an Ada standard package, and going straight to O.S. facilities, I would choose the Ada package almost always, unless the O.S. facilities provided features that were not present in the Ada package. > In the !example, variables are left uninitialized if you're not in > Canada (or if the impl chooses the numeric encoding of that country). > I don't understand what "Dollars" are doing in a supposedly i18n app. > I guess I don't really understand the example. The example is not a comprehensive one. I was thinking of the application we provided for the military. The application is only going to be run in a Canadian context, which is why I didn't test for other countries. I should just have checked to ensure that the country is Canada, and raised program error otherwise. In Canada, both English and French use Dollars. The example also shows how the locale capability can be used with Ada.Text_IO.Editing.Decimal_Output which is an existing package that can be used to address locale formatting of numeric and currency values. It's rather odd that we never provided a means to facilitate using locale to select the radix, separator, and currency inputs. I can probably come up with a better example. In fact, I think I would like to resubmit this AI, with a version that only uses alpha-2 codes. Before we decide to torch this AI, it would be good to have a version that at least addresses some of these comments. It shouldn't take me long to update. **************************************************************** From: Bob Duff Date: Tuesday, June 1, 2010 4:51 PM > In my experience, given the choice between an Ada standard package, > and going straight to O.S. facilities, I would choose the Ada package > almost always, unless the O.S. facilities provided features that were > not present in the Ada package. I guess that "unless" is the key point. There are approximately 3 operating systems to worry about: Windows, Linux/Unix/Posix, any other? There are hundreds of countries/languages. If I want portability across operating systems and portability acrosss countries, I'm thinking I'd rather write 2 or 3 OS-dependent versions of things, rather than hundreds. The current (thankfully simplified!) version of the AI gives a somewhat-portable way to query the country. But the OS gives much more -- for example, collating sequences. Would you rather do: if Country = "xx" then collating order for xx goes here elsif Country = "yy" then collating order for yy goes here ... 100 more elsif's. Or: if this is windows then use windows-specific stuff to get this locale's collating sequence elsif this is linux then use posix stuff else is there anything else? I think I'm choosing 2 or 3 elsifs over 100 elsifs. Of course, your example is different -- you had just 2 locales (English- and French-speaking parts of Canada), so I understand that's somewhat simpler. > > In the !example, variables are left uninitialized if you're not in > > Canada (or if the impl chooses the numeric encoding of that > > country). I don't understand what "Dollars" are doing in a > > supposedly i18n app. I guess I don't really understand the example. > > The example is not a comprehensive one. I was thinking of the > application we provided for the military. The application is only > going to be run in a Canadian context, which is why I didn't test for > other countries. I should just have checked to ensure that the country > is Canada, and raised program error otherwise. Right. Or for a program that could run outside Canada, you'd default to some locale if it's not one of the ones you've specifically coded for. > In Canada, both English and French use Dollars. I know -- I've been to both English- and French-speaking parts. It looks like monopoly money, with all those colors, but hey, who am I to judge. ;-) > The example also shows how the locale capability can be used with > Ada.Text_IO.Editing.Decimal_Output which is an existing package that > can be used to address locale formatting of numeric and currency values. > It's rather odd that we never provided a means to facilitate using > locale to select the radix, separator, and currency inputs. > > I can probably come up with a better example. In fact, I think I would > like to resubmit this AI, with a version that only uses alpha-2 codes. > Before we decide to torch this AI, it would be good to have a version > that at least addresses some of these comments. It shouldn't take me > long to update. Well, maybe you should wait to see what others think. I have never done any serious i18n work, so you should take what I say with a grain of salt. I read a book about it some years ago, and it seemed like operating systems had some fairly sophisticated stuff. Unfortunately not portable across operating systems. But portable across locales! **************************************************************** From: Brad Moore Date: Wednesday, June 2, 2010 12:53 AM ... > I think I'm choosing 2 or 3 elsifs over 100 elsifs. I agree that if one were to write an application that everyone in the world could use, you have serious locale needs and probably would want to glean all the OS capabilities you can by writing non-portable calls to the OS. I think you would be hard pressed though to find many real world example of such an application however. By the way, if you can think of a new application that everyone in the world would want to use, please let me know. :-) (I suppose a web browser is an example of one such existing application) One of the significant development costs we encountered in the Canadian applications was in the area of language translation. It's fine having OS hooks to determine decimal points, and currency symbols, but no amount of OS capabilities is going to do the hard work of determining which text to display to the user in GUI's, reports, help text, etc. Getting text from different languages to fit on the same places on a GUI window can be quite a challenge in itself. Writing an application that displays the correct text in every language would be a monumental task. Good luck just finding the translators needed for all those languages. I suspect in practice, a good majority of il8n applications involve a handful of languages at most, attempting to cover the main population of the users. For example, bank machines in my area mostly have two languages, some have 4 or 5. (eg. Chinese, English, French, Spanish, Japanese). Instruction manuals for equipment I have seen may have 3 - 5 languages depending on where the equipment is sold internationally. German might be one of the languages added to the list. Any websites I have seen typically only support up to a handful of languages. I don't recall ever encountering Swahili in my travels, (not that I would know Swahili if I saw it), someone writing an application would not bother translating to Swahili, unless there was a reasonable chance that someone speaking that language would be relatively common user of the application. So I think maybe I'm choosing up to a handful of portable elsifs over your 3 non-portable else ifs. Incidentally, collating sequence I suspect is one of the more esoteric of locale based areas. I dont think we used any locale based use of collating sequences in our applications that I can recall. The most important use of the locale we found is to select which text we wanted to display to the user. > > In fact, I think I > > would like to resubmit this AI, with a version that only uses > > alpha-2 codes. > > Well, maybe you should wait to see what others think. > Will do. **************************************************************** From: Jean-Pierre Rosen Date: Wednesday, June 2, 2010 2:49 AM > I agree that if one were to write an application that everyone in the > world could use, you have serious locale needs and probably would want > to glean all the OS capabilities you can by writing non-portable calls > to the OS. I think you would be hard pressed though to find many real > world example of such an application however. Do not forget games. "Battle for Wesnoth" is available in 49 languages (see http://www.wesnoth.org/gettext/). According to one of the main developpers of the game, Jeremy Rosen ;-), Gettext is the way to go to handle that many languages. But out of scope for us, I guess. **************************************************************** From: Bob Duff Date: Wednesday, June 2, 2010 7:34 AM > I think you would be hard pressed though to find many real world > example of such an application however. Indeed. I have never worked on any project that had anything to do with i18n, so I've got zero first-hand experience. I designed the message-printing stuff in CodePeer to support it, but as far as I know, there's still only one version of the messages (in English). AdaCore has customers all over, but GNAT and everything else we sell gives messages only in English. My bank's machine asks me whether I want to use English or Spanish. **************************************************************** From: Brad Moore Date: Thursday, June 3, 2010 9:09 AM > Do not forget games. "Battle for Wesnoth" is available in 49 languages > (see http://www.wesnoth.org/gettext/). According to one of the main > developpers of the game, Jeremy Rosen ;-), Gettext is the way to go to > handle that many languages. But out of scope for us, I guess. Thinking back on the Canadian application I've been mentioning, it also was scalable to any number of languages. For that we employed a simple flat-file database, indexed by application defined message id strings. A function that looked something like; function Lookup (Message_Id : String; Locale_Id : Integer) return String; The message id hashed into a file to get the variable length translation record, and each translation record contained a set of translations for however many languages were supported in the system. For example we had two languages, so English might map to 0, and French would map to 1. If we wanted to add Japanese next, we would give it 2, and so on. We used this mechanism for all locale related differences. We even used it for strings returning field lengths for reports, since the spacing for reports and tabular display on a GUI depended on the length of the text of column headers and so on. Once you had taken the time to enter all the translations for all the help messages, GUI labels, report headers, and so on, adding the few extra handful of translations that the OS provides (such as decimal radix point, currency symbol, probably even collating sequence, into this lookup database would be negligible compared to the work involved in doing all the other translations. Whenever we changed the translations for a new release of the application, we would run a utility to reindex the message file. In this process, we even incorporated as ASIS program that I had written to extract enumerations from the Ada source code. If a programmer changed an enumeration in the source code, this tool would detect whether or not the enumeration identifier mapped to the message identifier in the database file. This ensured that there was full coverage of translations for all enumeration values in the Ada code, as well as ordering of enumeration values matched, and the enumeration literal names matched the message id names in the translation file. The ASIS application would only be run during the reindexing process. This system worked really well, and was fast enough that lookups could be done on the fly as windows were being presented to the user. All this system needed was a way to figure out the current locale, which is what this AI is hoping to address. Everything else already was portable, and didn't require any changes when we ported from Unix to Windows. I had considered whether this AI should also provide a message lookup facility like the one I described above, but thought that would be too much. It is not that difficult to create a persistent hash table to implement the lookup function. In certain environments that have relational databases, this could be even easier. If the number of translations is small enough, someone could implement this using one of the standard containers such as Ada.Containers.Hashed_Maps. (Or create their own Persistent_Hashed_Maps container) Another alternative is to use the gettext utility that Jean-Pierre mentioned. So, Bob Duff's earlier comment having to write too many if else statements to support too many languages doesn't seem to apply if you have such a message lookup facility. **************************************************************** From: Tucker Taft Date: Thursday, June 3, 2010 9:24 AM > ...So, Bob Duff's earlier comment having to write too many if else > statements to support too many languages doesn't seem to apply if you > have such a message lookup facility. I agree that all programs that support internationalization use some kind of table lookup, rather than explicit "if...else" statements. Having an easy way to get the current locale seems useful. I agree with the desire to make this as portable as possible, so we should choose one of the representations, and it sounds like you recommend the 2-character one, which makes sense to me. If that is true, we should probably use String(1..2) explicitly (or a suitably-named subtype or type) rather than simply String. Having to manipulate arbitrary-length strings seems like unnecessary overhead if we are standardizing on 2-character locale names. **************************************************************** From: Brad Moore Date: Thursday, June 3, 2010 9:40 AM > All this system needed was a way to figure out the current locale, > which is what this AI is hoping to address. But you could use the same argument about any platform-dependent issue: - I once wrote a program that needed to query the virtual memory page size. - Lots of people want to write programs that spawn subprocesses. - Just yesterday, we had an internal discussion at AdaCore, where we decided we wanted a way to query the number of processors on the current machine. - Etc. Should we add portable ways to do the above in Ada? Well, maybe. We're moving slowly in that direction (e.g. adding Ada.Directories). But why is querying the current locale more important than any other OS-dependent thing? If you don't have it, it's no big deal -- you write a package with multiple bodies (one for windows, one for unix, ...). > So, Bob Duff's earlier comment having to write too many if else > statements to support too many languages doesn't seem to apply if you > have such a message lookup facility. Well, sure, you've moved the work. It seems like the bulk of the work for the kind of project you described is in implementing the database and the ASIS tool, and hiring a native French (or whatever) speaker to translate the messages. Writing the query-locale primitive seems trivial by comparison. I'm not strongly opposed to this AI -- I just can't get too excited about having this feature, and of course any feature, even a small one like this, has a cost. But as I've admitted several times, I am biased by having spent my whole life writing English-only software. **************************************************************** From: Tucker Taft Date: Thursday, June 3, 2010 10:07 AM > ... But why is querying the current locale more important than any > other OS-dependent thing? If you don't have it, it's no big deal -- > you write a package with multiple bodies (one for windows, one for > unix, ...) The argument for this kind of package is that without it, you can't easily share anything built on top of it. So if someone builds a nice higher-level internationalization capability, each such capability needs to invent its own way to get the locale. If you want to use two of these, say one that does nice error messages and one that handles currency well, you can't easily mix and match. As software becomes more globalized, this kind of thing seems increasingly important for an internationally-standardized language. Admittedly it is just a start, but if it can be made even simpler (e.g. by eliminating the three or four different formats), then it puts a useful "stake" in the ground for further portable packages to be built upon. If we can just agree on the package spec, we don't really care so much on whether the package is implemented by the vendor, because as we know the implementation of such a package is generally trivial on most existing O/Ss. It is the common spec that provides the real value. **************************************************************** From: Bob Duff Date: Thursday, June 3, 2010 4:01 PM ... > But you could use the same argument about any platform-dependent > issue: > > - I once wrote a program that needed to query the virtual memory > page size. Me too. But I don't think this is a common need (and when you do it, the result isn't portable even to different versions of the same OS - I ended up tuning each such program to the machine that I intended to run it on). > - Lots of people want to write programs that spawn subprocesses. We've tried previously to standardize that, but we couldn't even figure out a way to describe it portably. If anyone has a good idea, I think we would surely consider it again. > - Just yesterday, we had an internal discussion at AdaCore, > where we decided we wanted a way to query the number of > processors on the current machine. That's part of the CPU proposal, of course, which is now in AI05-0171-1. Specifically, function Number_Of_CPUs. Should be part of Ada 2012. > - Etc. Hard to comment on that. > Should we add portable ways to do the above in Ada? Well, maybe. > We're moving slowly in that direction (e.g. adding Ada.Directories). The answer is yes, but it is hard enough in some cases (spawn) that we haven't done it. > But why is querying the current locale more important than any other > OS-dependent thing? If you don't have it, it's no big deal -- you > write a package with multiple bodies (one for windows, one for unix, > ...). ... > I'm not strongly opposed to this AI -- I just can't get too excited > about having this feature, and of course any feature, even a small one > like this, has a cost. But as I've admitted several times, I am > biased by having spent my whole life writing English-only software. Well, for me, I think that all software should be written in English and people that don't know English should keep using their abacuses. :-) But I don't expect to get much support for *that* position. And this seems pretty trivial. The main issue is whether to insist on strings of a particular length or just make it String. Based on the various comments, I thought we have decided on String (because the most recent standards in this area use 3 character and longer strings sometimes), just to keep the future flexibility available. (If we insist on 2-character strings, what happens when Windows or Linux implements the 3-character names from i18n??) **************************************************************** From: Brad Moore Date: Friday, June 4, 2010 1:05 PM > I agree with the desire to make this as portable as possible, so we > should choose one of the representations, and it sounds like you > recommend the 2-character one, which makes sense to me. If that is > true, we should probably use String(1..2) explicitly (or a > suitably-named subtype or type) rather than simply String. Having to > manipulate arbitrary-length strings seems like unnecessary overhead if > we are standardizing on 2-character locale names. I have been giving this more thought, and think that for language names, it makes for sense to go with 3-character codes defined in ISO 639-2, rather than the 2-character codes defined in ISO 639-1 ISO 639-2 covers all the languages in ISO 639-1, but adds quite a number of other languages. ISO 639-1 is intended to cover all the major languages in the world ISO 639-2 is intended to cover all the languages in the world with significant bodies of literature, and also includes codes for language groups (although those probably aren't relevant to this AI). It covers most of the languages of the world. ISO 639-3 adds to the ISO 639-2 code set. It is intended to be a comprehensive list of all languages, including extinct, ancient, historic, and constructed languages. It also is a 3 character code. See http://www.loc.gov/standards/iso639-2/php/code_list.php Also is an excellent site that you can browse ISO 639-1, -2, and -3 codes. http://www.sil.org/iso639-3/codes.asp A quote from ISO's FAQ on the ISO 639 site. Q => "Why do some languages have both ISO 639-1 and 639-2 codes associated with them while others have only ISO 639-2 codes?" A => " ... However, because of the inadequacy of the alpha-two codes to represent all of the languages in the world (it can only accommodate 676 codes) and to assure backwards compatibility with existing usage compliant with RFC 4646 (and its predecessors), new language codes may be considered for inclusion in both parts or in ISO 639-2 only." Assuming that we decide to go with ISO 639-2, The question then becomes should we use the T codes or the B codes of ISO 639-2? The B codes are the codes that match the English pronunciation of the language name, whereas the T codes match the native name of the language. e.g. For French, the B code is fre while the T code is fra. For German, the B code is ger while the T code is deu. Most languages only have one code in ISO 639-2. For those languages you'd get the same code for both. The Wikipedia site below suggests that the T codes are generally preferred, http://en.wikipedia.org/wiki/ISO_639-2 I suggest we go with ISO 639-2/T. For country names defined in ISO 3166-1, the set of countries of alpha-2 codes is identical to the set of countries with alpha-3 codes. In that case, there is not much reason for recommending the alpha-2 codes vs the alpha-3 codes. Alpha-2 codes are used in domain name suffixes. Alpha-3 codes are used in places such as passport identification. They are a bit more readable than that the 2 character codes. To be consistent, I suppose I could argue that we should use 3 character codes since we would want to use 3 character codes for language names. A further note of discussion. For locales, Microsoft uses its own concept called Locale Id. Microsoft defines a locale as either a language, or a language combined with a country. A Windows locale id is a 16 bit code. See http://www.science.co.il/language/Locale-Codes.asp?s=decimal Which shows a mapping from locale id to ISO 3661-1 country name (but not ISO 639 language name). I believe ISO 639-2 would cover all the languages supported by Microsoft (and much more). I had a quick scan through the list and counted roughly 65 languages supported by Microsoft some with multiple variants based on country. eg. Engish (United States) and English (Canada) (You'd think we speak the same language, but we say our "Z"'s differently. Not to mention we also tend to favour British spellings on things, but sometimes prefer the American spelling just to keep things confusing) Microsoft's approach with locale id suggests to me that there is even more reason for providing a portable means in Ada to get the locale. General purpose translation lookup facilities such as mentioned in the previous email, would benefit from having a portable way to get locale names (language and country) on a Windows platform. An implemention of this AI on windows could do the translation from windows locale id into ISO 639-2 and ISO 3166-1, which I don't think would be hard to do. It should be a simple mapping. Assuming we decide to have the package return 3 character subtypes, rather than string, Which would be preferred? 1) type Language_Code is array (1 .. 3) of Character range 'a' .. 'z'; type Country_Code is array (1 .. 3) of Character range 'a' .. 'z'; (or leave off the constraint. The ISO standards recommend lower case, and the codes are case insensitive) 2) type Language_Code is new String (1 .. 3); type Country_Code is new String (1 .. 3); 3) subtype Language_Code is String (1 .. 3); subtype Country_Code is String (1 .. 3); 4) Have the functions return String (1 ..3) Any other suggestions? I'm leaning toward either 1) or 2). **************************************************************** From: Tucker Taft Date: Friday, June 4, 2010 2:20 PM All of this makes sense. As far as T vs. B, what do most operating systems provide? If they only provide B, then we should go with that. If they provide both, then "T" seems like the way to go. If some provide only "T" and some provide only "B", then we would have to say it is implementation defined whether the "T" or "B" version is returned, and the user would have to have both as keys in their mapping from locale ID to message contents. Having to worry about both upper and lower case is a pain. I would go with upper case only if we want to save people the trouble of doing case-insensitive lookups, since they seem to be used in upper case in many contexts, and Ada tends to favor all upper case for things like Enum'Image. By the way, are the characters used in the 3-character code guaranteed to be Latin-1, or do we need to use Wide_Character for the character codes? And I agree with making them a distinct type, if they are restricted to being exactly three characters. **************************************************************** From: Brad Moore Date: Friday, June 4, 2010 5:50 PM In Max OS X, locales are identified as per BCP 47 (RFC 4646) http://www.rfc-editor.org/rfc/bcp/bcp47.txt Locales in POSIX are identified by system environment variables, the LANG environment variable. From Wikipedia, "On Unix, Linux and other POSIX-type platforms, locale identifiers are defined similar to the BCP 47 definition of language tags, but the locale variant modifier is defined differently, and the character set is included as a part of the identifier. It is defined in this format: [language[_territory][.codeset][@modifier]] (For example, Australian English using the UTF-8 encoding is en_AU.UTF-8.) " BCP 47 (RFC 4646) identifies a format that starts with an ISO 639 code (either alpha-2 or alpha-3) followed by other optional parts separated by hyphens. The next optional part is the extended language tag which are up to 3 alpha-3 codes (separated by hyphens) from ISO 639. This mostly is not used, but some locales do use this extended language tag. Examples of language tags including extlang subtags are: * zh-yue (Cantonese Chinese) * ar-afb (Gulf Arabic) With this possibility, I would say we need to go back to language codes as being defined as variable length string types. Following this is a 4 character ISO 15924 code identifying the script associated with the language. Then comes the region identifier, which is either an alpha-2 or numeric-3 ISO 3166-1 code. Note that alpha-3 is not used here to identify the country. I believe this is how the syntax differentiates between optional extlang ISO 639 language codes and the region code. If it's alpha-3 it's an ISO 639 code, otherwise, it's an ISO 3166-1 code. This suggests that we should be using 3166-1 alpha-2 for the country codes. If we see a numeric-3 code, the implementation could convert that to an alpha-2 code. Or, alternatively we could say that the Country code function can return either a 2 or 3 character field. Some Key excerpts from BCP 47... The RFC states "the language tags described in this document are sequences of characters from the US-ASCII [ISO646] (7 bit ASCII) repertoire." This answers your question about whether we need to worry about wide characters and so on. The answer is no. "At all times, language tags and their subtags, including private use and extensions, are to be treated as case insensitive:" "This format generally corresponds to the common conventions for the various ISO standards from which the subtags are derived. These conventions include: o [ISO639-1] recommends that language codes be written in lowercase ('mn' Mongolian). o [ISO15924] recommends that script codes use lowercase with the initial letter capitalized ('Cyrl' Cyrillic). o [ISO3166-1] recommends that country codes be capitalized ('MN' Mongolia). " "When languages have both an ISO 639-1 two-character code and a three- character code (assigned by ISO 639-2, ISO 639-3, or ISO 639-5), only the ISO 639-1 two-character code is defined in the IANA registry. " This suggests ISO 639-1 (alpha-2) is used when it can be, otherwise use ISO 639-2 (or higher) to return an alpha-3 code. On my linux machine, the LANG variable is set to "en_CA.utf8". -- utf8 identifies the script. "When a language has no ISO 639-1 two-character code and the ISO 639-2/T (Terminology) code and the ISO 639-2/B (Bibliographic) code for that language differ, only the Terminology code is defined in the IANA registry." This suggests the answer to your question about "T" vs "B" is that they tend to use "T". User's of the Ada.Locale.Language function would write their applications according to the spec we provide. Based on all this, if we wanted to define as precise as possible type to describe the return values in Ada we might end up with something like; ISO_639_Max_Length : constant := 3; subtype ISO_639_Index is Positive range 1 .. 4 * ISO_639_Max_Length; type Language_Code is array (ISO_639_Index range <>) of Character range 'a' .. 'z'; -- Add an Ada 2012 invariant that says -- Language_Code'Length > 1 and -- (Language_Code'Length mod 3 = 0 or Language_Code'Length mod 3 = 2) -- This allows for extended language subtags type Country_Code is array (1 .. 2) of Character range 'A' .. 'Z'; Though the ISO standards are case insensitive, we could force return values from our package to be only upper or lower case. Or we could eliminate the constraint and use simpler string types, even though we would always return upper or lower case. **************************************************************** From: Brad Moore Date: Friday, June 4, 2010 6:57 PM - More on extended language subtags from BCP 47. " Although the ABNF production 'extlang' permits up to three extended language tags in the language tag, extended language subtags MUST NOT include another extended language subtag in their 'Prefix'. That is, the second and third extended language subtag positions in a language tag are permanently reserved and tags that include those subtags in that position are, and will always remain, invalid. For example, the macrolanguage Chinese ('zh') encompasses a number of languages. For compatibility reasons, each of these languages has both a primary and extended language subtag in the registry. A few selected examples of these include Gan Chinese ('gan'), Cantonese Chinese ('yue'), and Mandarin Chinese ('cmn'). Each is encompassed by the macrolanguage 'zh' (Chinese). Therefore, they each have the prefix "zh" in their registry records. Thus, Gan Chinese is represented with tags beginning "zh-gan" or "gan", Cantonese with tags beginning either "yue" or "zh-yue", and Mandarin Chinese with "zh-cmn" or "cmn". The language subtag 'zh' can still be used without an extended language subtag to label a resource as some unspecified variety of Chinese, while the primary language subtag ('gan', 'yue', 'cmn') is preferred to using the extended language form ("zh-gan", "zh-yue", "zh-cmn")." This suggests we might be able to stick with just returning a single alpha-2 or alpha-3 code for Language. If the locale has an extended sub-tab, return that instead of the primary language sub-tab. - Regarding numeric region (country) codes. The numeric codes identify macro-geographic (continental) or sub regions. If the region has an ISO 3166-1 code defined for it, that is what must be registered. The numeric code is only used for bigger regions larger than a country. To give an idea of what these numeric codes are; I found the list of numeric regions at http://rishida.net/utils/subtags/index.php?list=7&submit=List 001 World 002 Africa 005 South America 009 Oceania 011 Western Africa 013 Central America 014 Eastern Africa 015 Northern Africa 017 Middle Africa 018 Southern Africa 019 Americas 021 Northern America 029 Caribbean 030 Eastern Asia 034 Southern Asia 035 South-Eastern Asia 039 Southern Europe 053 Australia and New Zealand 054 Melanesia 057 Micronesia 061 Polynesia 142 Asia 143 Central Asia 145 Western Asia 150 Europe 151 Eastern Europe 154 Northern Europe 155 Western Europe 419 Latin America and the Caribbean The key thing is, the rules in BCP 47 is setup in such a way that for each locale there is only one way to define it according to the IANA registries. For a given locale, it basically uses the shortest codes defined in ISO 3166-1 and ISO 639. If we just return this minimal code then clients shouldn't have to worry about checking for all the variants between alpha-2, alpha-3, and numeric-3. So my once again revised view of return types now is that Country and Language would both return a single alpha-2 or alpha-3 code that uniquely identifies the locale. **************************************************************** From: Tucker Taft Date: Saturday, June 5, 2010 10:42 AM > ... So my once again revised view of return types now is that Country > and Language would both return a single alpha-2 or > alpha-3 code that uniquely identifies the locale. TMI! We lose portability if we try to accommodate everything. I think we should pick one, and have the body map to that, presuming that is possible. Why allow both alpha-2 and alpha-3? How is the "portable" program supposed to deal with that? **************************************************************** From: Brad Moore Date: Saturday, June 5, 2010 12:53 PM > TMI! Sorry for the data overload. Once I started poking around in BCP 47, I ran into issues I hadn't considered, and was getting quite frazzled with the approach of having a package implementation that blindly returned whatever the OS gave us. > We lose portability if we try to accommodate everything. I think we > should pick one, and have the body map to that, presuming that is > possible. Why allow both alpha-2 and alpha-3? How is the "portable" > program supposed to deal with that? Yes! If we have the implementation map the OS string to alpha-3, I think it makes things a whole lot simpler for us in the long run. I was worried that we wanted to allow returning the OS string value. While that might be trivial to implement as it doesn't require any mapping, you then have to deal with specifying what actually is returned, including special IANA rules regarding whether codes are registered or not, and how they are registered, and so on. As you mention, then clients also have a harder time trying to figure out whether to expect alpha-2 or alpha-3, or numeric-3. The BCP 47 description of macro-geographic regions also through me for a loop, and I was worried about how to deal with that. Microsoft doesn't have locales based on macro-geopraphic regions, it's either a language, or a language specialized by a country. I suspect macro-geographic regions are seldom used, if at all. Since those aren't countries, if that's what the OS gives us, I think it would be fair to return Country_Unknown for those cases, if they exist. Likely it is the Language that is important, not the country/region name in those cases. So it seems clear to me that Language should returns a 3 character ISO 639-2/T code. It is less clear whether we should go with alpha-2 or alpha-3 IS0 3166-1 codes for Country. We should pick one or the other and stick with it. Alpha-2 might require less implementation work for POSIX/OSX since BCP 47 does not allow alpha-3 country names. We could use a simple parsing of the OS value in that case. Windows would require a mapping either way. On the other hand, if we are using alpha-3 for Language, it might make sense to use alpha-3 for country also. Alpha-3 is generally more readable than alpha-2, and does not have the code space limitations of the 2 character coding scheme. Based on this, I think it makes sense to go with alpha-3 for Country name. I think we have converged a lot closer to a solution than the last writeup of AI-0127. I'm thinking I should submit an updated revision so people can get a better understanding of where we are at. **************************************************************** From: Tucker Taft Date: Saturday, June 5, 2010 1:19 PM I would follow POSIX if it has already standardized this to some degree. And yes, we need a simplified version of this incorporating your latest (and hopefully final ;-) thinking on this! **************************************************************** From: Brad Moore Date: Saturday, June 5, 2010 2:44 PM OK, based on that I will go with alpha-2 for Country codes. One more question. I am thinking of going with; type Language_Code is array (1 .. 3) of Character range 'a' .. 'z'; type Country_Code is array (1 .. 2) of Character range 'A' .. 'Z'; rather than; type Language_Code is new String (1 .. 3); type Country_Code is new String (1 .. 2); since it is a more precise definition, and better portrays to the user what they can expect as a return value. That raises the question of how to define Country_Unknown and Language_Unknown, since defining as a 2 or 3 spaces code would not match this definition. For country codes, ISO 3166-1 does define some reserved codes in two categories; - reserved codes - user defined codes. Reserved codes are codes that have become obsolete. The ISO 3166/MA, when justified, reserves these codes which it undertakes not to use for other than specified purposes during a limited or indeterminate period of time. User-assigned code elements are codes at the disposal of users who need to add further names of countries, territories, or other geographical entities to their in-house application of ISO 3166-1, and the ISO 3166/MA will never use these codes in the updating process of the standard. The following codes can be user-assigned: * Alpha-2: AA, QM to QZ, XA to XZ, and ZZ * Alpha-3: AAA to AAZ, QMA to QZZ, XAA to XZZ, and ZZA to ZZZ Of these two categories, if we wanted to define a constant for Country_Unknown, I think we would want to select a code from the user-assigned code group. According to Wikipedia, one such user-assigned coding is by the Unicode Common Locale Data Repository, which assigns ZZ to represent "Unknown or Invalid Territory" Assuming this is the way we want to go, I would propose we use "ZZ" also for that purpose in the definition of Country_Unknown. I like the idea of using a value defined within the standard rather than defining our own constant such as " ". We can then say that the values returned by the Country function are always ISO 3166-1 codes. For language codes, ISO-639 defines "und" (for undetermined) which is used in situations in which a language or languages must be indicated but the language cannot be identified. The Language_Unknown constant should be set as that code. **************************************************************** From: Tucker Taft Date: Saturday, June 5, 2010 4:12 PM XX and xxx would seem to be natural choices. Defining the arrays as limited to the specified range of characters makes sense, since just looking at the spec will eliminate a lot of questions. **************************************************************** From: Brad Moore Date: Saturday, June 5, 2010 5:11 PM The reserved XXX code for alpha-3 only applies to Country codes and ISO 3166-1. It is not reserved as far as I know in ISO 639. ISO 639 specifically defines "und" which is not a special reserved code or user assigned code, it is a regular code just like all the others. We should definitely be using "und" for Language_Unknown. Since we wouldn't use xxx for language, there is no benefit for a matching code for country. Since ZZ is the last user defined Country code, and because it already has uses for the purpose of representing an unknown country, I think we should use the ZZ code. **************************************************************** From: Brad Moore Date: Saturday, June 5, 2010 3:58 PM One other question. Does it really make sense for this package to be a child of System? System packages seem to be packages that define things such as implementation defined constants, such as the Storage_Element definition. This locale package does not have any implementation defined definitions, and feels to me more like a portable library such as Ada.Directories. I think it should be a child of Ada, rather than a child of System. **************************************************************** From: Tucker Taft Date: Saturday, June 5, 2010 4:13 PM Agreed. Make it a child of Ada, or conceivably a child of Interfaces. **************************************************************** From: Brad Moore Date: Saturday, June 5, 2010 5:55 PM I have a updated version attached that [Version /02 - Editor.] - eliminates implementation defined constants - eliminates implementation defined return values - Limits the codes returned to always be 3 characters for language codes as defined by ISO 629-2/T and 2 characters for country codes as defined by ISO 3166-1. - Changes the types to specify lower case constraints for language codes and upper case constraints for country codes. - Moves the package from a child of System to a child of Ada. **************************************************************** From: Randy Brukardt Date: Saturday, June 12, 2010 9:48 PM A couple of questions for the next version (*AFTER* the meeting): The package name changed from Locale to Locales in this version. Was that intended? The package requires ISO 639-2/T names. Why are the other parts of that standard referenced in the "Normative References" section? If we're not using them, they ought not be there. (BTW, I put those references in the required numeric order.) **************************************************************** From: Brad Moore Date: Friday, June 25, 2010 7:05 PM > A couple of questions for the next version (*AFTER* the meeting): > > The package name changed from Locale to Locales in this version. > Was that > intended? Yes, I should have mentioned that as another change, (which actually was the name of the original version of the package) The reasoning behind that change is that when the package was a child of System, it seemed to make more sense to use singular, because it read better, since "System" acted as an adjective to describe "Locale". When moved as a child of Ada, this no longer was the case. "Ada" does not have a locale. Many of the other child packages of Ada are plural as well (Directories, Assertions, Containers, etc). Under "Ada", it seems to be a better choice to go with the plural form, since "Locales" is more like a subject area, much like "Assertions" is a subject area if one wants to go about inserting an assertion in ones code. > The package requires ISO 639-2/T names. Why are the other parts of that > standard referenced in the "Normative References" section? If we're not > using them, they ought not be there. (BTW, I put those references in the > required numeric order.) I actually had removed 639-1 from my last submission, I see you added it back in. I removed 639-1 because that describes alpha-2 codes which we definitely weren't using. I am thinking we should be able to get rid of the 639-2 reference also. The 639-3 standard is a superset of 639-2. It uses the 639-2/T codes instead of the 639-2/B codes, when there are two codes for a language in 629-2. The 639-3 codes are also alpha-3 codes, as are 639-2 codes. 639-3 also better deals with certain cases that are problematic in 639-2. For example, chinese is a macro language. In 629-2/T and 639-3 this appears as the code "zho". However, if the current locale is more specificly defined to be Mandarin Chinese, 639-3 provides a code "cmn" for this purpose. 639-2 does not break down chinese any further than the macro language **************************************************************** From: Randy Brukardt Date: Thursday, August 5, 2010 9:57 PM The wording in for this package has no introduction. Compare to A.16 (Directories) or A.17 (Environment_Variables). The first sentence of the wording could be moved to be an introduction: A locale identifies a geopolitical place or region, its associated character sets, data and time formats, currency formats, and other internationalization related characteristics. "locale" probably ought to be in italics here. But this wording bothers me, as it seems to promise at lot more than this package actually is going to deliver (country and language codes). There is nothing about character sets, time formats, or currency formats here! I'd be happier if we promised less: A locale identifies a geopolitical place or region and its associated language, which can be used to determine other internationalization related characteristics. Anybody have better wording or another idea?? --- Another problem with this wording: there never is anything that says what Language and Country return (other than in the case where they don't know). We define what a Language_Code and a Country_Code is, but that wording seems to assume the only way to get such a value is from these functions (which is obviously False, given the constants and string literals in the package). Surely the *type* has nothing to do with the active locale! So, for example, the Country_Code should be defined as follows: Country_Code is an upper-case string representation of an ISO 3166-1 alpha-2 code that identifies a country. And the function Country as (with the second sentence being the existing wording): Function Country returns the code of the country associated with the active locale. If the Country_Code associated with the active locale cannot be determined from the environment then Country returns Country_Unknown. Language and Language_Code should be defined in the same way. --- Finally a trivial glitch: the runtime semantics of library packages are always defined in the Static Semantics section (don't ask me why); there shouldn't be a Dynamic Semantics section. **************************************************************** From: Brad Moore Date: Friday, August 6, 2010 3:02 PM > The first sentence of the wording could be moved to be an introduction: > Agree > A locale identifies a geopolitical place or region and its associated > language, which can be used to determine other internationalization > related characteristics. > > Anybody have better wording or another idea?? > I am fine with the wording you have suggested, unless someone comes up with something better. > So, for example, the Country_Code should be defined as follows: > > Country_Code is an upper-case string representation of an ISO 3166-1 > alpha-2 code that identifies a country. > > And the function Country as (with the second sentence being the > existing > wording): > > Function Country returns the code of the country associated with the > active locale. If the Country_Code associated with the active locale > cannot be determined from the environment then Country returns Country_Unknown. > > Language and Language_Code should be defined in the same way. Agree One other thing I have been thinking about. I think it would be nice if this package could be a remote types package. For example, I can imagine a server application that receives client requests from clients using different locales. The client could include its country/language code in the request, and the server could respond with a response suitable for the clients locale. **************************************************************** From: Randy Brukardt Date: Saturday, August 7, 2010 12:17 AM ... > One other thing I have been thinking about. I think it would be nice > if this package could be a remote types package. That seems harmless, given that the package only exports two visible string types. How could it *not* work as a remote types package?? It surely meets all of the requirements of E.2.2. I'll add the missing pragma to the specification. ****************************************************************