!standard A.4.11(54/3) 13-12-06 AI05-0088-1/03 !standard A.4.11(55/3) !class binding interpretation 13-10-31 !status Corrigendum 2014 13-12-06 !status ARG Approved 5-0-5 13-11-15 !status work item 13-10-31 !status received 13-09-26 !priority Low !difficulty Easy !qualifier Omission !subject UTF_Encoding.Conversions and overlong characters on input !summary An overlong encoding in the input to Encode and Convert should not raise Encoding_Error. !question A.4.11(50/3) says that "For UTF-8, no overlong encoding is returned". It does not say what happens if there is a character with overlong encoding on input. Overlong encodings used to be tolerated by 10646, but are now considered invalid. Therefore, an overlong input should raise Encoding_Error per A.4.11(54/3). Should this be enforced? (No.) !recommendation (See Summary.) !wording Modify A.4.11(54/3): * By a {Convert or} Decode function when a UTF encoded string contains an invalid encoding sequence. AARM To Be Honest: An overlong encoding is not invalid for the purposes of this check, and this does not depend on the character set version in use. Some recent character set standards declare overlong encodings invalid; it would be unfriendly to users for Convert or Decode to do this. Modify A.4.11(55/3): * By a {Convert or} Decode function when the expected encoding is UTF-16BE or UTF-16LE and the input string has an odd length. !discussion You cannot optimize away a conversion from UTF-8 to UTF-8 because there must be a check for an overlong encoding, in order to remove those encodings from the result. The only known Ada 2012 implementation at this writing fails to do this conversion, so the output violates A.4.11(50/3). We considered declaring that overlong encodings are invalid (so far as this package is concerned), but that's unfriendly to users. If an overlong encoding appears in some text that they are processing, an Encoding_Error exception would prevent getting any results, even though there is a well-defined meaning to the text. It's better to accept the overlong encoding and convert it to the standard form. One can detect overlong encodings by doing a Convert from UTF-8 to UTF-8; if the length changes, the input contains an overlong encoding. Note that A.4.11(54/3) and A.4.11(55/3) are written to apply only to Decode functions, but should also apply to Convert functions (both take encoded input). !corrigendum A.4.11(54/3) @drepl @xbullet @dby @xbullet !corrigendum A.4.11(55/3) @drepl @xbullet @dby @xbullet !ASIS No changes needed. !ACATS test An ACATS C-Test is needed to check that the handling of overlong encodings is correct. !appendix From: Jean-Pierre Rosen Sent: Thursday, September 26, 2013 10:46 AM An issue that came up while writing the test for A.4.11: A.4.11(50/3) says that "For UTF-8, no overlong encoding is returned". It does not say what happens if there is a character with overlong encoding on input. If I believe Wikipedia, overlong encodings used to be tolerated by 10646, but are now considered invalid. Therefore, an overlong input should raise Encoding_Error per A.4.11(54/3) (Note also that A.4.11(54/3) and A.4.11(55/3) apply only to Encode functions, but should also apply to Convert functions). As currently written, the test for conversions passes if: - an overlong character on input raises Encoding_Error, or - an overlong character on input is accepted and produces the valid corresponding encoding on output. Note that this means (in both cases) that converting from UTF-8 to UTF-8 is /not/ a no-op; Gnat fails on that (it does not raise Encoding_Error and produces an overlong character on output). So: 1) Is my interpretation correct? Or should the test require the raising of Encoding_Error? 2) Should I turn this into an AI? (I fear I know the answer to this question) **************************************************************** From: Jeff Cousins Sent: Monday, September 30, 2013 8:04 AM I've had a word with the UK NB expert on character sets and he agreed with Wikipedia. **************************************************************** From: Jean-Pierre Rosen Sent: Monday, November 4, 2013 6:45 AM Some times ago, I raised a question about overlong encoding in convert functions. Jeff confirmed that it was now considered invalid, therefore my test should be changed to require the raising of Encoding_Error (change line 203 from Report.Comment to Report.Failed). Or do you want some discussion first (i.e. that I put that into an AI)? There is also the issue that that A.4.11(54/3) and A.4.11(55/3) apply only to Encode functions, but should also apply to Convert functions. Same AI, or a different one (or swept under the carpet)? **************************************************************** From: Randy Brukardt Sent: Monday, November 4, 2013 12:12 PM You apparently didn't read the preliminary agenda I sent out: AI12-0088-1/01 2013-10-31 -- UTF_Encoding.Conversions and overlong characters on input You may want to check that everything is included (I believe so, but I do make a mistake every year or so. :-) **************************************************************** From: Jeff Cousins Sent: Monday, November 4, 2013 12:15 PM Randy has suggested a Binding Interpretation AI which seems right to me. But isn't it Decode and Convert rather than Encode, i.e. something with a UTF-8 input rather than output? **************************************************************** From: Randy Brukardt Sent: Monday, November 4, 2013 12:39 PM Right. The paragraphs J-P referenced apply to *Decode*, not Encode, but they do need to apply to Convert as well. **************************************************************** From: Jean-Pierre Rosen Sent: Tuesday, November 5, 2013 2:35 AM > You apparently didn't read the preliminary agenda I sent out: > > AI12-0088-1/01 2013-10-31 -- UTF_Encoding.Conversions and overlong > characters on input Hmmm yes, effect of local overload + lack of motivation since I'm not going... > You may want to check that everything is included (I believe so, but I > do make a mistake every year or so. :-) Just a nit. !Question says: Overlong encodings used to be tolerated by 10646, but are now considered invalid. Therefore, an overlong input should raise Encoding_Error per A.4.11(54/3). Should this be changed? (Yes.) As written, "should this be changed" seems to apply to "an overlong input should raise Encoding_Error"; suggestion: "should this be enforced?" **************************************************************** From: Robert Dewar Sent: Tuesday, November 5, 2013 8:26 AM > Overlong encodings used to be tolerated by 10646, but are now > considered invalid. Therefore, an overlong input should raise > Encoding_Error per A.4.11(54/3). Should this be changed? (Yes.) Seems to me that since Ada implementations are free to use any version of 10646 they choose, that it is up to them to make sure that the correct rules are chosen. If the test is changed, it should be conditioned by a test of the version of 10646 in use, that's why we have the query! **************************************************************** From: Robert Dewar Sent: Tuesday, November 5, 2013 9:35 AM I am dubious in general about trying to keep up with 10646 in the ACATS tests. In the abstract it makes sense, but in practice I think it is 100% useless. Certainly GNAT has no intention of moving to a more recent 10646 standard (no one is using this stuff significantly, and so there is just no customer demand basis for messing around. We give the right version identifier for the version of 10646 that we use, and if any tests expect changes past that version, we will just have to waste time protesting them (or not, we might just remove these tests). **************************************************************** From: Jean-Pierre Rosen Sent: Tuesday, November 5, 2013 10:03 AM > I am dubious in general about trying to keep up with 10646 in the > ACATS tests. In the abstract it makes sense, but in practice I think > it is 100% useless. Certainly GNAT has no intention of moving to a > more recent 10646 standard (no one is using this stuff significantly, > and so there is just no customer demand basis for messing around. We > give the right version identifier for the version of 10646 that we > use, and if any tests expect changes past that version, we will just > have to waste time protesting them (or not, we might just remove these > tests). You comment seems to be far away from the initial issue, which is to decide whether overlong encoding is "invalid" or not. If it is, Encoding_Error is raised by Convert. If it is not (which you might argue), then you have to remove them when converting from UTF8 to UTF8, since the RM is very clear that no overlong encoding is returned, ever. This has nothing to do with unstability of character sets... **************************************************************** From: Robert Dewar Sent: Tuesday, November 5, 2013 10:15 AM > This has nothing to do with unstability of character sets... OK, a previous message said that overlong encodings being considered invalid was an artifact of a later version of 10646, if that is incorrect, I withdraw my comment. **************************************************************** From: Randy Brukardt Sent: Tuesday, November 5, 2013 12:05 PM That's not the right question. The question is whether there is any need for portability issues related to overlong encodings. With the character set choice, there certainly are good reasons for not requiring constant change. But the issue of overlong encodings isn't going to change from standard to standard - they're always going to exist. We might as well require exactly one semantics related to them. In any case, the RM prohibits *returning* overlong encodings, so Convert (UTF8 -> UTF8) is never a no-op (at the least, the implementation has to convert any overlong encodings). It would seem easier to *detect* overlong encodings on input rather than converting them (J-P reports that GNAT does neither, which is certainly wrong), but YMMV. I think we simply need to choose either allowing overlong encodings on input (thus requiring conversion) or requiring them to be detected and raising an exception. Either is fine, allowing variation for no reason is not fine. **************************************************************** From: Jeff Cousins Sent: Wednesday, November 6, 2013 3:33 AM Overlong UTF-8 encodings became invalid at Unicode v3.1, so if GNAT is v4.0 for the purposes of classifying characters then it really ought to reject overlong encodings for consistency. On the other hand, it might be wise to add a note to A.4.11, similar to that in A.3.5 63/3, pointing out that what is regarded as valid is dependent on Unicode version. **************************************************************** From: Randy Brukardt Sent: Tuesday, November 12, 2013 7:40 PM This really is about what *our* standard considers invalid (or not). There is no reason that I can see for allowing variation on this point, no matter what Unicode version is used. After all, we don't allow our routines to *produce* overlong encodings, so there doesn't seem to be much need to allow them on input, either. But if we *do* allow them, then they should be allowed regardless of the Unicode version. **************************************************************** From: Robert Dewar Sent: Tuesday, November 12, 2013 9:45 PM The answer to this is simple: but it depends on the answer to this question: Do overlong encodings appear in practice? If we don't know that the answer to this is NO, then we should allow overlong encodings IMO. **************************************************************** From: Robert Dewar Sent: Tuesday, November 12, 2013 9:49 PM I definitely vote for a) never generating overlong encodings on output. This seems right, since no one can possibly reasonably be insisting on seeing overlong encodings, and you want the output of these routines to be deterministic in any case. b) allowing overlong encodings on input. This seems right, since there may well be data sets that have overlong encodings, and it will be a pain if they are rejected. The only purpose of forbidding overlong encodings is if you specifically want to be able to verify that there are no overlong encodings. If we need this functionality, it should be separate from basic input. **************************************************************** From: Randy Brukardt Sent: Tuesday, November 12, 2013 10:35 PM I think this is how the language works today (although J-P reports GNAT gets it wrong for UTF-8 => UTF-8 convert). Perhaps we should just forget the any change here? **************************************************************** From: Robert Dewar Sent: Wednesday, November 13, 2013 9:20 AM > I think this is how the language works today (although J-P reports > GNAT gets it wrong for UTF-8 => UTF-8 convert). Perhaps we should just > forget the any change here? Interesting, I guess GNAT treats it as a noop, which would seem wrong, but we don't seem to have a filed bug report as far as I can see. **************************************************************** From: Jean-Pierre Rosen Sent: Wednesday, November 13, 2013 9:32 AM > I think this is how the language works today (although J-P reports > GNAT gets it wrong for UTF-8 => UTF-8 convert). Perhaps we should just > forget the any change here? Forget what? There is an ACATS test, and it is necessary to know whether raising Encoding_Error should be allowed/required/forbidden. **************************************************************** From: Robert Dewar Sent: Wednesday, November 13, 2013 9:41 AM But for sure it is wrong if GNAT does nothing on a UTF-8 to UTF-8 conversion. With the current approach of allowing overlong encodings on input, but not on output, it should be doing that conversion. And if we (unwisely IMO) change to disallow overlong encodings on input, then an exception should be raised. So for sure there is a GNAT bug here (not exactly a critical one :-)) **************************************************************** From: Jean-Pierre Rosen Sent: Wednesday, November 13, 2013 9:43 AM >> I think this is how the language works today (although J-P reports >> GNAT gets it wrong for UTF-8 => UTF-8 convert). Perhaps we should >> just forget the any change here? > > Interesting, I guess GNAT treats it as a noop, which would seem wrong, > but we don't seem to have a filed bug report as far as I can see. Right. I discovered the issue, then checked the RM. Since the situation was unclear from a language point of view, I refered the issue to the ARG. But I sent the test program to Ed. **************************************************************** From: Ed Schonberg Sent: Wednesday, November 13, 2013 9:49 PM As I think I reported to JPR, this test (CXA4b01) is handed oorrectly in the development version of GNAT. **************************************************************** From: Randy Brukardt Sent: Wednesday, November 13, 2013 11:29 AM > Forget what? There is an ACATS test, and it is necessary to know > whether raising Encoding_Error should be allowed/required/forbidden. Forget changing the text of the Standard. (I realize that you identified another hole in the wording which we will fix, I'm only talking about overlong encodings.) There is nothing in the current text that says anything about Encoding_Error being raised for overlong encodings, ergo, it does not happen. Perhaps we should add an AARM note to reinforce that overlong encodings are not invalid, but no text changes. I've come to agree with Robert's point, it being one of capability. If we allow overlong encodings, and they never happen in practice, then the rule has no effect on users (there is a bit of work for implementers). If we *disallow* overlong encodings, and they do happen in practice, the user is stuck. They can't process the text in question, and there is no workaround using the language-defined libraries. They'll have to use some third-party library or roll-their-own. There doesn't seem to be any reason to put users into such a position, even if it is rare (and not likely using the latest character set definition). Thus I think we should allow such encodings on input - and I see no advantage to allowing variation here (the idea of an overlong encoding isn't going to ever change in future character set standards, as it is a basic consequence of the UTF-8 encoding). **************************************************************** From: Robert Dewar Sent: Wednesday, November 13, 2013 11:38 AM If it seems a valuable capability to be able to check for overlong encodings, we can always add that as a new feature in the future. **************************************************************** From: Robert Dewar Sent: Wednesday, November 13, 2013 11:41 AM > I've come to agree with Robert's point, it being one of capability. If > we allow overlong encodings, and they never happen in practice, then > the rule has no effect on users (there is a bit of work for implementers). Regarding bit of work for implementors: Actually I don't think so, when you identify a particular encoding, you extract the corresponding code, and it takes an extra test to ensure that this code is in the expected range for the encoding sequence. So the natural thing is to allow overlong encodings and it takes a bit of work to check for them :-) **************************************************************** From: Jeff Cousins Sent: Thursday, November 14, 2013 4:25 PM I think overlong encodings are used in the C world where zeros of data are encoded in overlong form so that if a zero appears in the byte stream it is taken as a string terminator. Does anyone know anyone who actually uses the UTF routines to ask what they'd prefer? ****************************************************************