Version 1.1 of ai05s/ai05-0266-1.txt

Unformatted version of ai05s/ai05-0266-1.txt version 1.1
Other versions for file ai05s/ai05-0266-1.txt

!standard 1.2(8/2)          11-11-01 AI05-0266-1/01
!standard A.3.5(0)
!class Amendment 11-11-01
!status work item 11-11-01
!status received 11-09-29
!priority Low
!difficulty Easy
!subject Use the latest version of ISO/IEC 10646
!summary
(1) Ada 2012 should depend on the 2011 version of 10646.
(2) Ada.Characters.Wide_Handling should have a statement that the behavior of the functions depends on the character set standard used.
!proposal
In March of 2011, a new version of the character set standard, ISO/IEC 10646:2011, was issued. Ada 2012 should use the most recent version of the character set standard (as it does with other standards).
[Editor's note: What about C and C++? I think there have been more recent versions of those standards, too.]
Related to that is the runtime behavior of Ada.Wide_Characters.Handling. We do not want this package to be tied to a particular character set standard forever. So we should include a statement that the behavior for particular characters depends on the character set standard in use - future versions of Ada will use newer standards; programs that depend on the specific behavior for particular characters probably should not depend on this package (as well as Ada.Wide_Wide_Character.Handling).
!wording
In 1.2(8/2), change "2003" to "2011".
Add at the end of A.3.5:
The results of these functions depends on the character set standard used by a particular version of Ada. Future Ada standards will typically use newer character set standards, and these functions will change their results to reflect those standards. If a program requires behavior specifically of a particular character set standard, this package should not be used.
[Editor's note: I don't have a great way to word this, hopefully someone will have a better idea.]
!discussion
10646:2011 adds and Annex U about identifier syntax. But all it says is to go read the Unicode documents! We need to reconsider exactly which characters are allowed in identifiers in order to meet this standard, but we'll do that in a separate AI (as this topic is not as clear-cut as the others here, and we already have such an AI to deal with another, related problem).
Using 10646:2011 changes the details of "Simple Locale-independent Case Folding", "Simple Uppercase Mapping", and "Simple Lowercase Mapping", used by various rules in the standard. The former is defined to be "stable" (always changed compatibly), so there should not be any incompatiblities or inconsistencies caused by its change. Changes to "Simple Uppercase Mapping" might change the 'Image of identifiers containing obscure characters, and could make an enumeration type containing such obscure characters illegal -- but as the changes are all in unusual characters, this is unlikely to be a problem in practice. (Following the Ada 2012 rules exactly is likely to have the same level of incompatibility.)
Using 10646:2011 will categorize more characters as letters, so that they would be allowed in identifiers. But we will consider adopting the Unicode 6.0 recommendations (as referenced in 10646:2011) for identifiers, which would also subtly change the characters allowed, with an early Binding Interpretation. So we will not consider any such effects here.
!ACATS test
An ACATS C-Test to check that some characters added by 10646:2011 are supported and properly categorized.
Any tests involving identifiers should be postponed until the AI on identifiers is decided.
!ASIS
No ASIS effect. (??)
!appendix

From: Randy Brukardt
Sent: Thursday, September 29, 2011  11:20 PM

While researching a question from Erhard's review, I happened to notice that a
new edition of ISO 10646, the character set standard, was issued this year.
(It's dated March 15th.)

Ada 2005 relied on ISO 10646:2003, which corresponds to Unicode 4.0.

ISO 10646:2011 corresponds to Unicode 6.0 - which has nearly 2100 additional
characters over Unicode 5.2. (No idea how many have been added compared to
10646:2003, but it would seem to be a lot.)

Which set is used would affect the exact characters used in identifiers (many of
the new characters could be used in identifiers, and there are a few characters
which have been reclassified such that they would not be usable in identifiers).
It also would affect the results from the new packages
Ada.Wide_Characters.Handling and Ada.Wide_Wide_Characters.Handling. Presumably
(although I haven't checked this), there would be changes in case mapping as
well.

Generally, Ada has relied on the most recent version of other standards. If we
follow this, we should change to using 10646:2011. But note that doing so would
present a (very mild) incompatibility, in that there would exist identifiers
legal in Ada 2005 that would not be legal in Ada 2012. Given the fact that the
identifier rules in Ada 2005 were very screwed up, I suspect that this would be
unnoticable outside of the incompatibility documentation in the Standard.

Should we make this change? Let's discuss this a bit, and then I'll send out a
Letter Ballot to get a definitive answer.

****************************************************************

From: Jean-Pierre Rosen
Sent: Thursday, September 29, 2011  11:36 PM

> Generally, Ada has relied on the most recent version of other
> standards. If we follow this, we should change to using 10646:2011.
> But note that doing so would present a (very mild) incompatibility, in
> that there would exist identifiers legal in Ada 2005 that would not be legal
> in Ada 2012.

Anybody who used those letters in identifiers will get the trouble they deserve.
Even in French, I always advise against using accented letters - which are
pretty stable.

> Given the
> fact that the identifier rules in Ada 2005 were very screwed up, I
> suspect that this would be unnoticable outside of the incompatibility
> documentation in the Standard.

Not doing the change would even involve the risk of ISO frowning at us - and
corresponding delay in the standard.

****************************************************************

From: Randy Brukardt
Sent: Thursday, September 29, 2011  11:59 PM

> ISO 10646:2011 corresponds to Unicode 6.0 - which has nearly 2100
> additional characters over Unicode 5.2. (No idea how many have been
> added compared to 10646:2003, but it would seem to be a lot.)

Looking over this new standard, some things jump out at me:

(1) There seem to be a lot more references to Unicode. Apparently, the aversion
    to that has worn off somewhat.

(2) There is now an Annex (U) discussing identifiers. But all it says is to go
    read the Unicode document on the subject (giving a link)!

(3) Aside: I did just skim the Unicode document on identifiers. They've added
    some additional character properties specifically for identifiers. These are
    supposedly stable, in that newer Unicode versions will never take characters
    out of these categories. These would probably be better to base Ada on,
    however this would allow quite a few additional characters in identifiers
    (and would require more rewriting of the Standard). But the win is that it
    would avoid future incompatibilities. (One also could imagine adding
    functions to Wide_Character.Handling to return these properties, thus giving
    a decent way to process identifiers using those libraries.)  The document
    also suggests a different algorithm for applying normalization than Ada 2005
    does (probably because the Unicode document has changed a lot) -- we have an
    upcoming Ada 2012 BI on that issue [based on a question posted on
    Ada-Comment]. Probably should leave the question of changing the characters
    allowed until that BI.

(4) Annex C and D (referred to in our A.4.11) have been folded into the
    normative standard (although placeholders remain).

(5) Still no real information about case mapping or the like. We still have to
    reference the "documents mentioned in the note of Section 1".

****************************************************************

From: Robert Dewar
Sent: Friday, September 30, 2011  3:59 AM

> Generally, Ada has relied on the most recent version of other
> standards. If we follow this, we should change to using 10646:2011.
> But note that doing so would present a (very mild) incompatibility, in
> that there would exist identifiers legal in Ada 2005 that would not be
> legal in Ada 2012. Given the fact that the identifier rules in Ada
> 2005 were very screwed up, I suspect that this would be unnoticable
> outside of the incompatibility documentation in the Standard.
>
> Should we make this change? Let's discuss this a bit, and then I'll
> send out a Letter Ballot to get a definitive answer.

My first reaction was why not, go ahead with the change, no one uses this stuff
anyway.

Then I got to thinking that this will require several days work to research what
has changed, rerun the utilities to generate tables, rebuild the units using
these tables etc etc etc, all 100% totally useless work solely for the sake of a
reference that no one cares about.

Still I suppose we should make the change. Probably the best thing is to make
the change quietly, and then I don't think GNAT will even bother to do anything
about it till someone complains, which will be never.

****************************************************************

From: Robert Dewar
Sent: Friday, September 30, 2011  4:01 AM

> (5) Still no real information about case mapping or the like. We still
> have to reference the "documented mentioned in the note of Section 1".

I regard case wrapping for extended characters as an abomination. It is not
possible to do it "right" in a locale independent way, and doing it at all is a
huge mistake.

****************************************************************

From: Tucker Taft
Sent: Friday, September 30, 2011  11:07 AM

> While researching a question from Erhard's review, I happened to
> notice that a new edition of ISO 10646, the character set standard,
> was issued this year. (It's dated March 15th.)

I would say go for the latest.
Better now than later, especially if there are already Ada 2012 changes in this
area.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, October 11, 2011  3:43 PM

As previous noted, we need to decide whether to change to the latest version of
the character set standard. For most purposes, this is not a problem, but there
is an incompatibility as some Ada 2005 identifiers would not be legal in Ada
2012 -- these would use *very* obscure characters. (But given that the rules for
identifiers are very screwed up in Ada 2005, this incompatibility is much
smaller than the potential one caused by applying the BI on identifiers). Also
note that this will have an effect on the results from the functions in
Wide_Character.Handling for obscure characters.

Following is a Letter Ballot on this topic; please respond ASAP (but no later
than Monday, October 17th):


   The character set standard used in Ada 2012 should be:

        _____  ISO/IEC 10646:2003 (that is, no change - corresponds to Unicode
	       4.0).


        _____  ISO/IEC 10646:2011 (that is, the current standard - corresponds
	       to Unicode 6.0). If choosing this option, please select from one
	       of the following:

              _____  Keep the identifier rules as currently defined, with no
		     plans to change them.

              _____  Keep the identifier rules as currently defined, but plan to
		     issue a BI in the future [if this is appropriate after
		     study] to change to use the recommended XId_Start and
		     XId_Continue classes to define the characters that can be
		     used. (These are defined to be stable, like case folding,
		     but unlike the letter classes we currently use.) We'd
		     probably also want to add functions matching these
		     classifications to (Wide_)Wide_Characters.Handling so that
		     identifier processing can be usefully written in Ada code
		     (that's not possible now as the currently used classes
		     aren't stable and thus will change from Ada version to Ada
		     version).

              _____  Change the identifier rules now to use the Xid_Start and
		     Xid_Contain classes. (Probably would delay the Standard -
		     we'll need to consider the effect of potentially including
		     non-letters in identifiers on 'Image, among other things.)

****************************************************************

From: Randy Brukardt
Sent: Tuesday, October 11, 2011  11:56 PM

> Following is a Letter Ballot on this topic; please respond ASAP (but
> no later than Monday, October 17th):
>
>
>    The character set standard used in Ada 2012 should be:
>
>         _____  ISO/IEC 10646:2003 (that is, no change - corresponds to
>                Unicode 4.0).
>
>
>         __X___  ISO/IEC 10646:2011 (that is, the current standard -
>                 corresponds to Unicode 6.0).
>                       If choosing this option, please select from one
> of the following:
>
>               _____  Keep the identifier rules as currently defined,
> with no plans to change them.
>
>               ___Y__  Keep the identifier rules as currently defined,
> but plan to issue a BI in the future [if this is appropriate after
> study] to change to use the recommended XId_Start and XId_Continue
> classes to define the characters that can be used. (These are defined
> to be stable, like case folding, but unlike the letter classes we
> currently use.) We'd probably also want to add functions matching
> these classifications to (Wide_)Wide_Characters.Handling so that
> identifier processing can be usefully written in Ada code (that's not
> possible now as the currently used classes aren't stable and thus will
> change from Ada version to Ada version).
>
>               _____  Change the identifier rules now to use the
> Xid_Start and Xid_Contain classes. (Probably would delay the Standard
> - we'll need to consider the effect of potentially including
> non-letters in identifiers on 'Image, among other things.)

****************************************************************

From: Robert Dewar
Sent: Wednesday, October 12, 2011  5:25 AM

> Following is a Letter Ballot on this topic; please respond ASAP (but
> no later than Monday, October 17th):
>
>
>     The character set standard used in Ada 2012 should be:
>
>          __X___  ISO/IEC 10646:2003 (that is, no change - corresponds
>                  to Unicode 4.0).

just because I think this stuff is so little used (not sure it is used at all),
and it is not worth doing major implementation work to make a change that will
affect no one.

OTOH, if we do the change, I don't think GNAT will bother to follow unless some
real user complains, which will likely be never :-)

****************************************************************

From: Tucker Taft
Sent: Wednesday, October 12, 2011  7:27 AM

I'll go with Randy's recommendation (see below).

...
>          _X____  ISO/IEC 10646:2011 (that is, the current standard -
>                  corresponds to Unicode 6.0).
>                        If choosing this option, please select from one
>                        of the following:
>
>                _____  Keep the identifier rules as currently defined,
>                       with no plans to change them.
>
>                __X___  Keep the identifier rules as currently defined,
>                        but plan to issue a BI in the future [if this is
>                        appropriate ...

****************************************************************

From: Jean-Pierre Rosen
Sent: Wednesday, October 12, 2011  7:41 AM

> Following is a Letter Ballot on this topic; please respond ASAP (but
> no later than Monday, October 17th):
>
>
>    The character set standard used in Ada 2012 should be:
>
>         _____  ISO/IEC 10646:2003 (that is, no change - corresponds to
>                Unicode 4.0).
>
>
>         _X____  ISO/IEC 10646:2011 (that is, the current standard -
>                 corresponds to Unicode 6.0).
>                       If choosing this option, please select from one
>                       of the following:
>
>               _____  Keep the identifier rules as currently defined,
>                      with no plans to change them.
>
>               ___X__  Keep the identifier rules as currently defined,
> but plan to issue a BI in the future [if this is appropriate after
> study] to change to use the recommended XId_Start and XId_Continue
> classes to define the characters that can be used. (These are defined
> to be stable, like case folding, but unlike the letter classes we
> currently use.) We'd probably also want to add functions matching
> these classifications to (Wide_)Wide_Characters.Handling so that
> identifier processing can be usefully written in Ada code (that's not
> possible now as the currently used classes aren't stable and thus will change from Ada version to Ada version).
>
>               _____  Change the identifier rules now to use the
> Xid_Start and Xid_Contain classes. (Probably would delay the Standard
> - we'll need to consider the effect of potentially including
> non-letters in identifiers on 'Image, among other things.)

****************************************************************

From: Bob Duff
Sent: Wednesday, October 12, 2011  8:44 AM

> As previous noted, we need to decide whether to change to the latest
> version of the character set standard. For most purposes, this is not
> a problem, but there is an incompatibility as some Ada 2005
> identifiers would not be legal in Ada 2012 -- these would use *very*
> obscure characters. (But given that the rules for identifiers are very
> screwed up in Ada 2005, this incompatibility is much smaller than the
> potential one caused by applying the BI on identifiers).

I don't understand how the rules are "very screwed up"
(and I don't really want to -- I'm sure you've explained it before -- no need to
do so again).

But whatever the screwup is, I'm guessing implementations don't obey it, so when
talking about [in]compatibility, we should be talking about what implementations
actually do.

Anyway, my vote is for:

>         __X__  ISO/IEC 10646:2003 (that is, no change - corresponds to
>                Unicode 4.0).

because any change is going to require a lot of not-very-useful work for
implementations.

****************************************************************

From: Robert Dewar
Sent: Wednesday, October 12, 2011  8:51 AM

> But whatever the screwup is, I'm guessing implementations don't obey
> it, so when talking about [in]compatibility, we should be talking
> about what implementations actually do.

GNAT follows exactly the 2005 rules. I don't really agree they are "very screwed
up", but I know of no discrepancies between the 2005 standard and what GNAT
does. The issue is the categories and the way they are used.

> Anyway, my vote is for:
>
>>          __X__  ISO/IEC 10646:2003 (that is, no change - corresponds
>> to Unicode 4.0).
>
> because any change is going to require a lot of not-very-useful work
> for implementations.

Well, pretend to require, you can't really require implementations to do
anything :-)

****************************************************************

From: Bob Duff
Sent: Wednesday, October 12, 2011  9:12 AM

> Well, pretend to require, you can't really require implementations to
> do anything :-)

Very good point!  We (language designers) have a tendency to forget that.

****************************************************************

From: Robert Dewar
Sent: Wednesday, October 12, 2011  9:23 AM

And I think after some debacles (like the leap second nonsense) implementors are
less likely to automatically jump to implement everything :-)

****************************************************************

From: Erhard Ploedereder
Sent: Wednesday, October 12, 2011  12:07 PM

> Following is a Letter Ballot on this topic; please respond ASAP (but
> no later than Monday, October 17th):

I'll abstain on this ballot out of sheer ignorance of the issues.

****************************************************************

From: Tullio Vardanega
Sent: Wednesday, October 12, 2011  1:04 PM

So do I.

>> Following is a Letter Ballot on this topic; please respond ASAP (but
>> no later than Monday, October 17th):
> I'll abstain on this ballot out of sheer ignorance of the issues.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, October 12, 2011  7:10 PM

> > But whatever the screwup is, I'm guessing implementations don't obey
> > it, so when talking about [in]compatibility, we should be talking
> > about what implementations actually do.
>
> GNAT follows exactly the 2005 rules.

I very highly doubt this.

> I don't really agree
> they are "very screwed up", but I know of no discrepancies between the
> 2005 standard and what GNAT does. The issue is the categories and the
> way they are used.

The categories are only the tip of the iceberg. Does GNAT:

(1) allow "other-format" characters (like soft hyphens) in identifiers?
Original Ada 2005 did (later repealed for Ada 2012).

(2) use full case folding for identifier equivalence checks? That means that
"aŠ" is the same as "ass" and "ASS". (Also now changed for Ada 2012 for
compatibility reasons with Ada 95, but it was fully intended to be the case for
Ada 2005.)

(3) Returns a full case folded string from 'Image (as specified in the Ada 2005
standard), even when this would change the length and typically put the string
into lower case? (This was just a bug in Ada 2005, but there is no easy fix and
extensive changes were needed.)

My understanding from previous discussions is that GNAT does none of these.
That's probably a good thing [(3) a clearly a case of Robert's rule of the
standard saying something silly; (1) was repealed a long time ago; and (2)
caused an unintentional incompatibility], but it surely is not the same as
"follows exactly the Ada 2005 rules". It's much closer to "following the Ada
2005 as we wish they would be". ;-)

Back to the topic: the only major change from using 10646:2011 instead of
10646:2003 would be that a few obscure characters would change category, and
presumably the equivalence ("case folding") and case conversion tables also have
some changes in obscure cases. Any other changes to identifiers would need to be
discussed in the future because we need to consider all of the impacts (and we
already have an open soon-to-be AI on "normalization", which probably will
demand more changes to the rules anyway).

It should be noted that the "official" Ada rules have been coming closer to what
you want, but that the ARG remains committed to following the Unicode
recommendations as closely as makes sense for Ada. That almost certainly means
that are going to be some rules that you don't like.

I've said before, and I'll be happy to say again that I don't think characters
outside of Latin-1 should be allowed in identifiers, period, but we did not feel
that we had a choice in this matter given the directives on internationalization
of programming languages. As such, we have to make the best fit of those
recommendations with Ada.

****************************************************************

From: Robert Dewar
Sent: Wednesday, October 12, 2011  8:56 PM

> The categories are only the tip of the iceberg. Does GNAT:
>
> (1) allow "other-format" characters (like soft hyphens) in identifiers?
> Original Ada 2005 did (later repealed for Ada 2012).

yes, then changed

> (2) use full case folding for identifier equivalence checks? That
> means that "aŠ" is the same as "ass" and "ASS". (Also now changed for
> Ada 2012 for compatibility reasons with Ada 95, but it was fully
> intended to be the case for Ada 2005.)

No, but I always thought this was an absurd misreading of Ada 2005, no informed
person can have intended that reading

> (3) Returns a full case folded string from 'Image (as specified in the Ada
> 2005 standard), even when this would change the length and typically
> put the string into lower case? (This was just a bug in Ada 2005, but
> there is no easy fix and extensive changes were needed.)

It can't change the length

> My understanding from previous discussions is that GNAT does none of these.
> That's probably a good thing [(3) a clearly a case of Robert's rule of
> the standard saying something silly; (1) was repealed a long time ago;
> and (2) caused an unintentional incompatibility], but it surely is not
> the same as "follows exactly the Ada 2005 rules". It's much closer to
> "following the Ada 2005 as we wish they would be". ;-)

The rules were badly written, and have to be interpreted with lavish use of
Robert's rule

> It should be noted that the "official" Ada rules have been coming
> closer to what you want, but that the ARG remains committed to
> following the Unicode recommendations as closely as makes sense for
> Ada. That almost certainly means that are going to be some rules that you
> don't like.
>
> I've said before, and I'll be happy to say again that I don't think
> characters outside of Latin-1 should be allowed in identifiers,
> period, but we did not feel that we had a choice in this matter given
> the directives on internationalization of programming languages. As
> such, we have to make the best fit of those recommendations with Ada.

Fine, but why bother with changing them then if all you are doing is meeting
directives, rather than doing something useful. I don't see that there were any
directives mandating case folding, which remains a plain error in thinking.

****************************************************************

From: Brad Moore
Sent: Wednesday, October 12, 2011  9:42 PM

> > Well, pretend to require, you can't really require
> > implementations to do anything :-)
>
> Very good point!  We (language designers) have a tendency to
> forget that.

If that's the case then would it not it be better to at least have the RM
mention the more up to date standard, so that implementations can
go with that version if they have the time and energy to implement it?

If instead they find nobody cares or notices that the newer standard isn't
implemented, then leaving their implementation as is isn't a problem either, is
it?

I'm trying to decide how to respond to the ballot. My feeling is that Randy's
response is the best response, but I also am sympathetic to implementation
burden, if real user's aren't likely to notice one way or the other.

****************************************************************

From: John Barnes
Sent: Thursday, October 13, 2011  7:53 AM

I agree with Erhard. I am going to abstain as well.

> I'll abstain on this ballot out of sheer ignorance of the issues.

****************************************************************

From: Gary Dismukes
Sent: Thursday, October 13, 2011  1:34 PM

> I agree with Erhard. I am going to abstain as well.

Count me in the list of abstainers.  I don't understand the issues well enough.
(If forced to vote I'd go for no change.)

****************************************************************

From: Steve Baird
Sent: Thursday, October 13, 2011  1:57 PM

>> I agree with Erhard. I am going to abstain as well.
>
> Count me in the list of abstainers.  I don't understand the issues
> well enough.  (If forced to vote I'd go for no change.)
>

Ditto.

****************************************************************

From: Ed Schonberg
Sent: Thursday, October 13, 2011  2:19 PM

I abstain as well, and  for the same reasons.

****************************************************************

From: Tucker Taft
Sent: Thursday, October 13, 2011  2:31 PM

You guys are a bunch of wimps... ;-)

****************************************************************

From: Jean-Pierre Rosen
Sent: Thursday, October 13, 2011  2:40 PM

Well, let me comment why I didn't abstain. If we believe in standards, and if we
believe that the guys who design 10646 know better than us, we have to follow.
The only freedom we have is in trying to do so in a manner that is not too
disruptive.

****************************************************************

From: Robert Dewar
Sent: Thursday, October 13, 2011  4:00 PM

perhaps we have to follow, but not to race, we have not had enough time to study
this change, let's leave it for Ada 2020, and perhaps issue a BI that allows
implementations to change before then, just as we did for 8-bit characters.

****************************************************************

From: Randy Brukardt
Sent: Thursday, October 13, 2011  4:26 PM

The only problem with that is that it would change the run-time behavior of
[Wide_]Wide_Characters.Handling (since a few characters change classifications).
It seems like a bad idea to have different implementations having different
interpretations of the correct behavior of these functions.

OTOH, the identifier syntax changes definitely need study before we adopt them
(or not), no one can reasonably implement what the Ada 2005 actually says, so
implementations will inevitably differ subtly on this in any case, and there
seems to be little evidence that programmers are using this, so deferring the
change there is better.

ISO 10646:2011 has an Annex (annex U) that specifically says that identifiers in
programming languages should follow the Unicode recommendations (giving a link,
not including them). But there is a lot of wiggle room in those Unicode
recommendations.

One alternative to "fix" the run-time issue with [Wide_]Wide_Characters.Handling
would be to say that it is implementation-defined exactly which character set
standard it depends on. Or some other statement that programmers should expect
there will be changes in character classifications, case conversions, and the
like in future standards, so we can ignore the "compatibility" issue in the
future. (After all, for most applications, it won't make any difference, or it
would be *better* for the package to use the most recent character set standard
- or at least one that applies to the target system; tying it for all time to
any particular character set standard [which we know is going to change] is
rather silly.)

****************************************************************

From: Robert Dewar
Sent: Thursday, October 13, 2011  4:42 PM

> The only problem with that is that it would change the run-time
> behavior of [Wide_]Wide_Characters.Handling (since a few characters
> change classifications). It seems like a bad idea to have different
> implementations having different interpretations of the correct behavior of
> these functions.

This is more of a theoretical concern than an actual one. And changing the
standard is not going to have any immediate effect on GNAT in the immediate
future anyway (we have already frozen the feature set for the 2012 releases of
GNAT).

And of course your suggestion leads to different implementations having even
mnore different intepretations (how many other Ada 2012 compilers do you expect
to see in the near future?), since it is much more likely that the two different
implementations involved will be an Ada 2005 one and an Ada 2012 one.

Furthermore, we did a much bigger incompatible change with 7 to 8-bit characters
and it caused very little trouble.

> One alternative to "fix" the run-time issue with
> [Wide_]Wide_Characters.Handling would be to say that it is
> implementation-defined exactly which character set standard it depends on.

What on earth would that achieve

> Or some other statement that programmers should expect there will be
> changes in character classifications, case conversions, and the like
> in future standards, so we can ignore the "compatibility" issue in the
> future. (After all, for most applications, it won't make any
> difference, or it would be
> *better* for the package to use the most recent character set standard
> - or at least one that applies to the target system; tying it for all
> time to any particular character set standard [which we know is going
> to change] is rather silly.)

That's merely a formalistic argument, no programmer will change their behavior
on the basis of such a statement in the RM.

****************************************************************

From: Randy Brukardt
Sent: Thursday, October 13, 2011  7:28 PM

...
> > The only problem with that is that it would change the run-time
> > behavior of [Wide_]Wide_Characters.Handling (since a few characters
> > change classifications). It seems like a bad idea to have different
> > implementations having different interpretations of the correct
> > behavior of these functions.
>
> This is more of a theoretical concern than an actual one. And changing
> the standard is not going to have any immediate effect on GNAT in the
> immediate future anyway (we have already frozen the feature set for
> the
> 2012 releases of GNAT).

It's not that theoretical: Ada.Wide_Wide_Characters.Handling is easy to
implement and probably will be supported by a number of Ada compilers in the
near future. And it has nothing to do with "features": the package exists in any
case, the question is exactly what it should return.

> And of course your suggestion leads to different implementations
> having even mnore different intepretations (how many other Ada 2012
> compilers do you expect to see in the near future?), since it is much
> more likely that the two different implementations involved will be an
> Ada 2005 one and an Ada 2012 one.

No Ada 2005 implementation has Ada.Wide_Wide_Characters.Handling -- it's an Ada
2012 package. If it does have it, it's formally an implementation-defined
package and thus irrelevant.

Let me say again, I am *not* talking in any way about identifiers or their
syntax. They have absolutely nothing to do with the package
Ada.Wide_Wide_Characters.Handling.

> Furthermore, we did a much bigger incompatible change with 7 to 8-bit
> characters and it caused very little trouble.

I don't see how this has anything whatsoever to do with the case in point.

...
> > Or some other statement that programmers should expect there will be
> > changes in character classifications, case conversions, and the like
> > in future standards, so we can ignore the "compatibility" issue in
> > the future. (After all, for most applications, it won't make any
> > difference, or it would be
> > *better* for the package to use the most recent character set
> > standard
> > - or at least one that applies to the target system; tying it for
> > all time to any particular character set standard [which we know is
> > going to change] is rather silly.)
>
> That's merely a formalistic argument, no programmer will change their
> behavior on the basis of such a statement in the RM.

Probably not, and that's OK -- the primary thing is to warn programmers that the
behavior of these functions on currently undefined code points is likely to
change in future versions of Ada. As with any case of a "bounded error", neither
the compiler implementer nor programmers are likely to pay much attention to the
rule -- but at least they were warned in print.

Anyway, let me ask you specifically what you think this package should do for
new/changed characters. I'm specifically talking about the behavior of functions
in Wide_Wide_Characters.Handling like Is_Letter and Is_Upper when they are
passed a character with a code position corresponding to a new character defined
in 10646:2011 (or some later version):

   (1) Wide_Wide_Characters.Handling returns values based on 10646:2003 forever.
       Very compatible, but also very out of date in the future (Ada 2012 is
       expected to last until 2020, at which point 10646:2003 will be 17 years
       old and probably will have been replaced at least one more time).

   (2) Wide_Wide_Characters.Handling returns values based on 10646:2003 for Ada
       2012, updated to use some newer standard down the road. Updating to use
       some newer standard will be run-time incompatible - a few characters that
       are letters in 2003 are not letters in 2011.

       (2a) Do the above, but indicate to users of the package that the results
	    may change in the future as character sets evolve.

   (3) Wide_Wide_Characters.Handling returns values based on 10646:2011 forever.
       Also very compatible, but will also get out of date.

   (4) Wide_Wide_Characters.Handling returns values based on 10646:2003 for Ada
       2012, updated to use some newer standard down the road. Similar to (2)
       above.

       (4a) Do the above, and also something similar to (2a).

   (5) Wide_Wide_Characters.Handling returns values based on an
       implementation-defined character set standard. Lets Robert do whatever he
       wants. :-)

We have to make *some* choice of these options: users need to know what they can
count on, we need to know how far ACATS and implementer internal tests can go,
etc. Ignoring the question results in (1) or (5), depending on who's doing the
interpreting.

My personal preference is (4a), followed by (2a). But I think we need some
statement in the Standard so down the road we do not feel compelled to keep
exact run-time compatibility as we do for Ada.Characters.Handling. Else Ada will
be stuck sooner or later with an obsolete character set standard.

I agree with you that it's too late now to adopt the 10646:2011 identifier
recommendations, but that is a very separate issue from the one of run-time
character classifications. I'm primarily interested in the latter now.

****************************************************************

From: Robert Dewar
Sent: Thursday, October 13, 2011  9:03 PM

> Let me say again, I am *not* talking in any way about identifiers or
> their syntax. They have absolutely nothing to do with the package
> Ada.Wide_Wide_Characters.Handling.

OK, got it, was confused

>> Furthermore, we did a much bigger incompatible change with 7 to 8-bit
>> characters and it caused very little trouble.
>
> I don't see how this has anything whatsoever to do with the case in point.

it was a case where we made a big change between versions of the standard.

>     (4) Wide_Wide_Characters.Handling returns values based on
> 10646:2003 for Ada 2012, updated to use some newer standard down the
> road. Similar to (2) above.
>
>         (4a) Do the above, and also something similar to (2a).

This (4a) is the one I would choose

> My personal preference is (4a), followed by (2a). But I think we need
> some statement in the Standard so down the road we do not feel
> compelled to keep exact run-time compatibility as we do for
> Ada.Characters.Handling. Else Ada will be stuck sooner or later with an
> obsolete character set standard.

Well I chose 4a before reading it was your first choice.

> I agree with you that it's too late now to adopt the 10646:2011
> identifier recommendations, but that is a very separate issue from the
> one of run-time character classifications. I'm primarily interested in the
> latter now.

So it lookse like 4a might be viable as a consensus decision here?

****************************************************************

From: Randy Brukardt
Sent: Thursday, October 13, 2011  9:19 PM

...
> >     (4) Wide_Wide_Characters.Handling returns values based on
> > 10646:2003 for Ada 2012, updated to use some newer standard down the
> > road. Similar to (2) above.
> >
> >         (4a) Do the above, and also something similar to (2a).
>
> This (4a) is the one I would choose

Sorry, I botched this item, I put the wrong year on the Standard. As written,
this is identical to (2) and (2a). I meant (4) to be the one that uses
10646:2011, (2) is the one that uses 10646:2003. I suspect that you meant (2a),
but I'd like a clarification from you.

...
> So it lookse like 4a might be viable as a consensus decision here?

Except that I screwed up the choices. Please consider (4) as using 10646:2011,
and vote again.

****************************************************************

From: Robert Dewar
Sent: Friday, October 14, 2011  9:28 AM

>> So it lookse like 4a might be viable as a consensus decision here?
>
> Except that I screwed up the choices. Please consider (4) as using
> 10646:2011, and vote again.

Now I am confused, can you send a new email with the newly updated choices
clear, so I am not trying to create a virtual result from synchronizing old
emails?

****************************************************************

From: Randy Brukardt
Sent: Friday, October 14, 2011  3:16 PM

Sorry about the confusion. I created (4) and (4a) with cut-and-paste and
insufficiently updated them. Here is the complete list:

What is the behavior of functions in Wide_Wide_Characters.Handling like
Is_Letter and Is_Upper when they are passed a character with a code position
corresponding to a new character defined in 10646:2011 (or some later version):

   (1) Wide_Wide_Characters.Handling returns values based on 10646:2003 forever.
       Very compatible, but also very out of date in the future (Ada 2012 is
       expected to last until 2020, at which point 10646:2003 will be 17 years
       old and probably will have been replaced at least one more time).

   (2) Wide_Wide_Characters.Handling returns values based on 10646:2003 for Ada
       2012, updated to use some newer standard down the road. Updating to use
       some newer standard will be run-time incompatible - a few characters that
       are letters in 2003 are not letters in 2011 (but these are unlikely
       corner cases, not the commonly used letters) and similarly for other
       classifications.

       (2a) Do the above, but indicate to users of the package that the results
	    may change in the future as character sets evolve.

   (3) Wide_Wide_Characters.Handling returns values based on 10646:2011 forever.
       Also very compatible, but will also get out of date.

   (4) Wide_Wide_Characters.Handling returns values based on 10646:2011 for Ada
       2012, and will be update to use newer standards down the road. Similar to
       (2) above, but using the 2011 character standard now. Future changes
       probably would be run-time incompatible, but most likely in unlikely
       corner cases.

       (4a) Do the above, and also something similar to (2a).

   (5) Wide_Wide_Characters.Handling returns values based on an
       implementation-defined character set standard. Lets Robert do whatever he
       wants. :-)

We have to make *some* choice of these options: users need to know what they can
count on, we need to know how far ACATS and implementer internal tests can go,
etc. Ignoring the question results in (1) or (5), depending on who's doing the
interpreting.

My personal preference is (4a) [because I can't think of any good reason not to
use the "current" classifications here - it is explicitly not necessarily the
same as used for identifiers], followed by (2a). But I think we need some
statement in the Standard so down the road we do not feel compelled to keep
exact run-time compatibility as we do for Ada.Characters.Handling. Else Ada will
be stuck sooner or later with an obsolete character set standard.

****************************************************************

From: Robert Dewar
Sent: Friday, October 14, 2011  3:21 PM

Right, Robert's vote is for 2a, which is basically status quo with an indication
that updates may occur based on subsequent versions of the standard.

****************************************************************

From: Jean-Pierre Rosen
Sent: Saturday, October 15, 2011  12:39 AM

Since we restart from scratch, let me cast again my vote for 4a, precisely
because I don't fully understand the issue.

Character set issues are very complex. I assume that the people at 10646 are
very aware of compatibility issues, and that every non-upward compatible change
is the result of a carefully evaluated trade-off. At some point, you have to
trust the knowledge of other people. What would you say, if someone from the
10646 committee came to our meeting and told us that we got the accessibility
rules completely wrong ;-)?

****************************************************************

From: Robert Dewar
Sent: Saturday, October 15, 2011  7:22 AM

> Since we restart from scratch, let me cast again my vote for 4a,
> precisely because I don't fully understand the issue.

The reason I prefer 2a to 4a is that it will reflect reality. There is no way
that anyone at AdaCore will do other than 2a in the short term in the absence of
any customer demand. When there is customer demand to update to a new version,
we can do so at that point, which would then be totally consistent with the ida
of 2a.

To me it is just too late to be making the immediate change to 4a, when no one
has investigated the implications or the impact of any upward incompatibility.

BTW, the assumption that the appropriate standards committee has properly
considered the compatibility issues is dubious. In practice standards committees
often get more concerned with doing things right, than maintaining compatibility
(*)

(*) look at what we did in Ada with limited returns for Ada 2005, which severely
impacted the ability of many to move from Ada 95 to Ada 2005.

****************************************************************

From: Robert Dewar
Sent: Saturday, October 15, 2011  7:24 AM

By the way, I really think Randy's suggestion here of specifically allowing for
updating the standard used is an excellent one, MUCH better than just closing
our eyes and mandating the status quo till the next version.

****************************************************************

From: Tucker Taft
Sent: Saturday, October 15, 2011  10:39 AM

I'll go for 4a as well.  I understand that GNAT has already implemented
something else, but Ada 2012 doesn't even exist yet, so it is premature to have
its content depend on what has or has not already been implemented by particular
implementations.

For those looking at the Ada 2012 standard when it comes out in 2012 (or 2013?),
it makes no sense to me to tie it to an already out-of-date standard.

As usual, implementations will do what they do based on market demands.  If no
one cares about these details anyway, we might as well get the words of the
standard right, even if the reality is not going to match the words on day one.

****************************************************************

From: Robert Dewar
Sent: Saturday, October 15, 2011  3:42 PM

Well my argument was about what will or will not be implemented in GNAT, not
what already has been implemented in GNAT. From one point of view I don't care
too much between 2a and 4a, since it won't make any difference to implementation
plans in practice.

I still don't like that no one on the ARG has carefully examined the two
versions of the standard to understand what level of compatibility problems
arise. It seems unwise to just adopt a standard without carefully examining it.
If we want to adopt this new standard right away, we should have at least one
person carefully examine the two standards and write up a document describing
the differences from a programmer point of view.

Aren't we sort of obligated to be very careful when it comes to introducing
non-upwards compatibile changes, and at least document these changes carefully?

****************************************************************

From: Randy Brukardt
Sent: Saturday, October 15, 2011  6243 PM

...
> I still don't like that no one on the ARG has carefully examined the
> two versions of the standard to understand what level of compatibility
> problems arise. It seems unwise to just adopt a standard without
> carefully examining it.

I'm not sure what level of care you are actually requiring; I spent more than an
hour reading it before I sent my original messages in this thread. And I
summarized the changes that I saw in the original messages. I didn't do a
character-by-character comparison, but that would be rather silly (I have to
presume that the summaries of changes are accurate). Note that this character
set standard is freely available, anyone can download it and read it as I did.

> ... If we
> want to adopt this new standard right away, we should have at least
> one person carefully examine the two standards and write up a document
> describing the differences from a programmer point of view.

I agree that we need a lot of care before adopting new identifier syntax, but
*NO ONE* has suggested that at this point. (That will be an open issue for
discussion at a near future ARG meeting - probably not the next one because I
doubt I'll have time to write it up.)

But for other things, this would be silly, because it is the same list that
happens for any character change in Ada: some characters change categories. Some
characters that were not previously graphic characters become them, and so on.
We already have all of those changes documented as incompatibilities in Ada 2005
(because it did change character set standards). This would cause the same sorts
of changes (only in obscure, rarely used characters). The Unicode change
documents describe the changed characters in detail -- are you saying that we
have to copy all of that and send it to you so that you can see those exact
details? It's all online for anyone that cares to look.

> Aren't we sort of obligated to be very careful when it comes to
> introducing non-upwards compatibile changes, and at least document
> these changes carefully?

Currently, we're talking specifically about package
Wide_Wide_Characters.Handling, which is new in Ada 2012. There is no possibility
of creating a "non-upwards compatible" change in a new package! I realize that
GNAT has an implementation-defined equivalent, but no one is going to
accidentally change from the GNAT-only package to a language-defined one without
intending it.

If we adopt the new standard globally, there will be more changes, the main one
being that more characters would be considered graphic characters by 'Image
(changing their image from the "Hex00xxxx" form), that's technically
incompatible but hardly very interesting. (Especially as we now allow 'Value to
take the hex form for all characters, so there would be no incompatibility in
'Value unless an implementer wanted to introduce it.) Identifiers would allow
more characters (as some of the new characters are "letters") - which has no
compatibility issues, and a handful of previously allowed characters would be
banned (which would be incompatible, but again these are rarely used characters,
and probably should never have been allowed in the first place). I'd rather not
make any identifier changes, but that would be hard to do using the new standard
(which changes classifications of a few characters).

There would be more changes if we adopted the identifier recommendations, but as
I've said there is no way I would recommend that for Ada 2012 -- it's just too
much change at too late a date. The other changes are pretty minimal, however.

Since Ada 2012 is not going to be frozen until after the upcoming ARG meeting,
we can discuss this during the next meeting. And it seems pretty obvious that we
ought to.

****************************************************************

From: Robert Dewar
Sent: Saturday, October 15, 2011  6:35 PM

> Currently, we're talking specifically about package
> Wide_Wide_Characters.Handling, which is new in Ada 2012. There is no
> possibility of creating a "non-upwards compatible" change in a new package!
> I realize that GNAT has an implementation-defined equivalent, but no
> one is going to accidentally change from the GNAT-only package to a
> language-defined one without intending it.

OK, that's fair enough

> Since Ada 2012 is not going to be frozen until after the upcoming ARG
> meeting, we can discuss this during the next meeting. And it seems
> pretty obvious that we ought to.

Also fair enough, I don't actually think it makes too much difference if we
choose 2a or 4a, I don't think it will make any difference to anyone
(programmers, or implementors, or reviewers or anyone else :-)) So if it really
makes people have a better warm feeling to have the standard say 4a, then no
problem as far as I am concerned.

****************************************************************


Questions? Ask the ACAA Technical Agent