Version 1.5 of ai12s/ai12-0088-1.txt

Unformatted version of ai12s/ai12-0088-1.txt version 1.5
Other versions for file ai12s/ai12-0088-1.txt

!standard A.4.11(54/3)          13-11-12 AI05-0088-1/02
!standard A.4.11(55/3)
!class binding interpretation 13-10-31
!status work item 13-10-31
!status received 13-09-26
!priority Low
!difficulty Easy
!qualifier Omission
!subject UTF_Encoding.Conversions and overlong characters on input
!summary
An overlong encoding in the input to Encode and Convert should raise Encoding_Error.
!question
A.4.11(50/3) says that "For UTF-8, no overlong encoding is returned". It does not say what happens if there is a character with overlong encoding on input.
Overlong encodings used to be tolerated by 10646, but are now considered invalid. Therefore, an overlong input should raise Encoding_Error per A.4.11(54/3). Should this be enforced? (Yes.)
!recommendation
(See Summary.)
!wording
Modify A.4.11(54/3):
* By a {Convert or} Decode function when a UTF encoded string contains an
invalid encoding sequence{@redundant[, including an overlong encoding for UTF-8]}.
Modify A.4.11(55/3):
* By a {Convert or} Decode function when the expected encoding is UTF-16BE
or UTF-16LE and the input string has an odd length.
!discussion
This is not quite compatible, in that an overlong encoding on input used to be tolerated, but it had to be converted to a proper encoding on output. The only known Ada 2012 implementation at this writing fails to do this conversion, so the output violates A.4.11(50/3).
Note that under no circumstances is a UTF-8 to UTF-8 conversion a no-op even with the original Ada 2012 rules: there must be a check for an overlong encoding, with those encodings removed. It's probably easier to raise Encoding_Error than fixing the overlong encodings.
Also note that A.4.11(54/3) and A.4.11(55/3) are written to apply only to Decode functions, but should also apply to Convert functions (both take encoded input).
!ASIS
No changes needed.
!ACATS test
An ACATS C-Test is needed to check that the handling of overlong encodings is correct.
!appendix

From: Jean-Pierre Rosen
Sent: Thursday, September 26, 2013  10:46 AM

An issue that came up while writing the test for A.4.11:

A.4.11(50/3) says that "For UTF-8, no overlong encoding is returned". It
does not say what happens if there is a character with overlong encoding
on input.

If I believe Wikipedia, overlong encodings used to be tolerated by
10646, but are now considered invalid. Therefore, an overlong input
should raise Encoding_Error per A.4.11(54/3) (Note also that
A.4.11(54/3) and A.4.11(55/3) apply only to Encode functions, but should
also apply to Convert functions).

As currently written, the test for conversions passes if:
- an overlong character on input raises Encoding_Error, or
- an overlong character on input is accepted and produces the valid
corresponding encoding on output.

Note that this means (in both cases) that converting from UTF-8 to UTF-8
is /not/ a no-op; Gnat fails on that (it does not raise Encoding_Error
and produces an overlong character on output).

So:
1) Is my interpretation correct? Or should the test require the raising
of Encoding_Error?

2) Should I turn this into an AI? (I fear I know the answer to this
question)

****************************************************************

From: Jeff Cousins
Sent: Monday, September 30, 2013  8:04 AM

I've had a word with the UK NB expert on character sets and he agreed with
Wikipedia.

****************************************************************

From: Jean-Pierre Rosen
Sent: Monday, November 4, 2013  6:45 AM

Some times ago, I raised a question about overlong encoding in convert
functions. Jeff confirmed that it was now considered invalid, therefore my test
should be changed to require the raising of Encoding_Error (change line 203 from
Report.Comment to Report.Failed).

Or do you want some discussion first (i.e. that I put that into an AI)?

There is also the issue that  that A.4.11(54/3) and A.4.11(55/3) apply only to
Encode functions, but should also apply to Convert functions. Same AI, or a
different one (or swept under the carpet)?

****************************************************************

From: Randy Brukardt
Sent: Monday, November 4, 2013  12:12 PM

You apparently didn't read the preliminary agenda I sent out:

AI12-0088-1/01   2013-10-31 --  UTF_Encoding.Conversions and overlong
characters on input

You may want to check that everything is included (I believe so, but I do make a
mistake every year or so. :-)

****************************************************************

From: Jeff Cousins
Sent: Monday, November 4, 2013  12:15 PM

Randy has suggested a Binding Interpretation AI which seems right to me.
But isn't it Decode and Convert rather than Encode, i.e. something with a UTF-8
input rather than output?

****************************************************************

From: Randy Brukardt
Sent: Monday, November 4, 2013  12:39 PM

Right. The paragraphs J-P referenced apply to *Decode*, not Encode, but they do
need to apply to Convert as well.

****************************************************************

From: Jean-Pierre Rosen
Sent: Tuesday, November 5, 2013   2:35 AM

> You apparently didn't read the preliminary agenda I sent out:
>
> AI12-0088-1/01   2013-10-31 --  UTF_Encoding.Conversions and overlong
> characters on input

Hmmm yes, effect of local overload + lack of motivation since I'm not going...

> You may want to check that everything is included (I believe so, but I
> do make a mistake every year or so. :-)

Just a nit. !Question says:

Overlong encodings used to be tolerated by 10646, but are now considered
invalid. Therefore, an overlong input should raise Encoding_Error per
A.4.11(54/3). Should this be changed? (Yes.)

As written, "should this be changed" seems to apply to "an overlong input should
raise Encoding_Error"; suggestion: "should this be enforced?"

****************************************************************

From: Robert Dewar
Sent: Tuesday, November 5, 2013  8:26 AM

> Overlong encodings used to be tolerated by 10646, but are now
> considered invalid. Therefore, an overlong input should raise
> Encoding_Error per A.4.11(54/3). Should this be changed? (Yes.)

Seems to me that since Ada implementations are free to use any version of 10646
they choose, that it is up to them to make sure that the correct rules are
chosen. If the test is changed, it should be conditioned by a test of the
version of 10646 in use, that's why we have the query!

****************************************************************

From: Robert Dewar
Sent: Tuesday, November 5, 2013  9:35 AM

I am dubious in general about trying to keep up with 10646 in the ACATS tests.
In the abstract it makes sense, but in practice I think it is 100% useless.
Certainly GNAT has no intention of moving to a more recent 10646 standard (no
one is using this stuff significantly, and so there is just no customer demand
basis for messing around. We give the right version identifier for the version
of 10646 that we use, and if any tests expect changes past that version, we will
just have to waste time protesting them (or not, we might just remove these
tests).

****************************************************************

From: Jean-Pierre Rosen
Sent: Tuesday, November 5, 2013  10:03 AM

> I am dubious in general about trying to keep up with 10646 in the
> ACATS tests. In the abstract it makes sense, but in practice I think
> it is 100% useless. Certainly GNAT has no intention of moving to a
> more recent 10646 standard (no one is using this stuff significantly,
> and so there is just no customer demand basis for messing around. We
> give the right version identifier for the version of 10646 that we
> use, and if any tests expect changes past that version, we will just
> have to waste time protesting them (or not, we might just remove these
> tests).

You comment seems to be far away from the initial issue, which is to decide
whether overlong encoding is "invalid" or not. If it is, Encoding_Error is
raised by Convert. If it is not (which you might argue), then you have to remove
them when converting from UTF8 to UTF8, since the RM is very clear that no
overlong encoding is returned, ever.

This has nothing to do with unstability of character sets...

****************************************************************

From: Robert Dewar
Sent: Tuesday, November 5, 2013  10:15 AM

> This has nothing to do with unstability of character sets...

OK, a previous message said that overlong encodings being considered invalid was
an artifact of a later version of 10646, if that is incorrect, I withdraw my
comment.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, November 5, 2013  12:05 PM

That's not the right question. The question is whether there is any need for
portability issues related to overlong encodings. With the character set choice,
there certainly are good reasons for not requiring constant change. But the
issue of overlong encodings isn't going to change from standard to standard -
they're always going to exist. We might as well require exactly one semantics
related to them.

In any case, the RM prohibits *returning* overlong encodings, so Convert (UTF8
-> UTF8) is never a no-op (at the least, the implementation has to convert any
overlong encodings). It would seem easier to *detect* overlong encodings on
input rather than converting them (J-P reports that GNAT does neither, which is
certainly wrong), but YMMV.

I think we simply need to choose either allowing overlong encodings on input
(thus requiring conversion) or requiring them to be detected and raising an
exception. Either is fine, allowing variation for no reason is not fine.

****************************************************************

From: Jeff Cousins
Sent: Wednesday, November 6, 2013  3:33 AM

Overlong UTF-8 encodings became invalid at Unicode v3.1, so if GNAT is v4.0 for
the purposes of classifying characters then it really ought to reject overlong
encodings for consistency.

On the other hand, it might be wise to add a note to A.4.11, similar to that in
A.3.5 63/3,  pointing out that what is regarded as valid is dependent on Unicode
version.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, November 12, 2013  7:40 PM

This really is about what *our* standard considers invalid (or not). There is no
reason that I can see for allowing variation on this point, no matter what
Unicode version is used. After all, we don't allow our routines to *produce*
overlong encodings, so there doesn't seem to be much need to allow them on
input, either. But if we *do* allow them, then they should be allowed regardless
of the Unicode version.

****************************************************************

From: Robert Dewar
Sent: Tuesday, November 12, 2013  9:45 PM

The answer to this is simple: but it depends on the answer to this
question: Do overlong encodings appear in practice? If we don't know that the
answer to this is NO, then we should allow overlong encodings IMO.

****************************************************************

From: Robert Dewar
Sent: Tuesday, November 12, 2013  9:49 PM

I definitely vote for

a) never generating overlong encodings on output. This seems right, since no
one can possibly reasonably be insisting on seeing overlong encodings, and you
want the output of these routines to be deterministic in any case.

b) allowing overlong encodings on input. This seems right, since there may
well be data sets that have overlong encodings, and it will be a pain if they
are rejected.

The only purpose of forbidding overlong encodings is if you specifically want
to be able to verify that there are no overlong encodings. If we need this
functionality, it should be separate from basic input.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, November 12, 2013  10:35 PM

I think this is how the language works today (although J-P reports GNAT gets
it wrong for UTF-8 => UTF-8 convert). Perhaps we should just forget the any
change here?

****************************************************************

From: Robert Dewar
Sent: Wednesday, November 13, 2013  9:20 AM

> I think this is how the language works today (although J-P reports
> GNAT gets it wrong for UTF-8 => UTF-8 convert). Perhaps we should just
> forget the any change here?

Interesting, I guess GNAT treats it as a noop, which would seem wrong, but we
don't seem to have a filed bug report as far as I can see.

****************************************************************

From: Jean-Pierre Rosen
Sent: Wednesday, November 13, 2013  9:32 AM

> I think this is how the language works today (although J-P reports
> GNAT gets it wrong for UTF-8 => UTF-8 convert). Perhaps we should just
> forget the any change here?

Forget what? There is an ACATS test, and it is necessary to know whether raising
Encoding_Error should be allowed/required/forbidden.

****************************************************************

From: Robert Dewar
Sent: Wednesday, November 13, 2013  9:41 AM

But for sure it is wrong if GNAT does nothing on a
UTF-8 to UTF-8 conversion. With the current approach of allowing overlong
encodings on input, but not on output, it should be doing that conversion. And
if we (unwisely IMO) change to disallow overlong encodings on input, then an
exception should be raised. So for sure there is a GNAT bug here (not exactly a
critical one :-))

****************************************************************

From: Jean-Pierre Rosen
Sent: Wednesday, November 13, 2013  9:43 AM

>> I think this is how the language works today (although J-P reports
>> GNAT gets it wrong for UTF-8 => UTF-8 convert). Perhaps we should
>> just forget the any change here?
>
> Interesting, I guess GNAT treats it as a noop, which would seem wrong,
> but we don't seem to have a filed bug report as far as I can see.

Right. I discovered the issue, then checked the RM. Since the situation was
unclear from a language point of view, I refered the issue to the ARG. But I
sent the test program to Ed.

****************************************************************

From: Ed Schonberg
Sent: Wednesday, November 13, 2013  9:49 PM

As I think I reported to JPR, this test (CXA4b01)  is handed oorrectly in the
development version of GNAT.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, November 13, 2013  11:29 AM

> Forget what? There is an ACATS test, and it is necessary to know
> whether raising Encoding_Error should be allowed/required/forbidden.

Forget changing the text of the Standard. (I realize that you identified another
hole in the wording which we will fix, I'm only talking about overlong
encodings.) There is nothing in the current text that says anything about
Encoding_Error being raised for overlong encodings, ergo, it does not happen.
Perhaps we should add an AARM note to reinforce that overlong encodings are not
invalid, but no text changes.

I've come to agree with Robert's point, it being one of capability. If we allow
overlong encodings, and they never happen in practice, then the rule has no
effect on users (there is a bit of work for implementers). If we *disallow*
overlong encodings, and they do happen in practice, the user is stuck. They
can't process the text in question, and there is no workaround using the
language-defined libraries. They'll have to use some third-party library or
roll-their-own. There doesn't seem to be any reason to put users into such a
position, even if it is rare (and not likely using the latest character set
definition). Thus I think we should allow such encodings on input - and I see no
advantage to allowing variation here (the idea of an overlong encoding isn't
going to ever change in future character set standards, as it is a basic
consequence of the UTF-8 encoding).

****************************************************************

From: Robert Dewar
Sent: Wednesday, November 13, 2013  11:38 AM

If it seems a valuable capability to be able to check for overlong encodings, we
can always add that as a new feature in the future.

****************************************************************

From: Robert Dewar
Sent: Wednesday, November 13, 2013  11:41 AM

> I've come to agree with Robert's point, it being one of capability. If
> we allow overlong encodings, and they never happen in practice, then
> the rule has no effect on users (there is a bit of work for implementers).

Regarding bit of work for implementors:

Actually I don't think so, when you identify a particular encoding, you extract
the corresponding code, and it takes an extra test to ensure that this code is
in the expected range for the encoding sequence. So the natural thing is to
allow overlong encodings and it takes a bit of work to check for them :-)

****************************************************************

From: Jeff Cousins
Sent: Thursday, November 14, 2013  4:25 PM

I think overlong encodings are used in the C world where zeros of data are
encoded in overlong form so that if a zero appears in the byte stream it is
taken as a string terminator.

Does anyone know anyone who actually uses the UTF routines to ask what they'd
prefer?

****************************************************************

Questions? Ask the ACAA Technical Agent