!standard 2.2(14) 06-01-06 AC95-00127/01 !class Amendment 06-01-06 !status received no action 06-01-06 !status received 05-12-06 !subject Line length, wide characters, and source representation !summary !appendix From: Robert Dewar Date: Tuesday, December 6, 2005 5:26 AM Can someone help on this .. what exactly does the RM mean by line length (when it recommends that at least 200 be allowed). Is a tab (HT) one character? Is a wide character represented in three bytes one character? I would think the answers are yes and yes ... but was not really able to prove it from the RM. **************************************************************** From: Robert A. Duff Date: Tuesday, December 6, 2005 6:10 AM I also think the answers are yes and yes. I'm also not _sure_ I can prove that from the RM, but it seems likely that "line length" should be interpreted to mean number of characters, and that we're talking about (source) characters as defined in 2.1. So HT is intended to be one character, and a character's representation (e.g. as three bytes) is irrelevant -- it's still one character. What else could "line length" mean? Some notion of "number of bytes" is not present in the RM. But I do not claim to understand all the wide-wide-ever-so-wide business, nor the FFFE/FFFF business, and other weirdness in chap 2. I don't much like the 200-character kludge, but that's neither here nor there. At SofCheck, we ran into a mildly amusing (or mildly annoying) anomaly a few times. We told GNAT to complain if lines were longer than 80 characters. We normally eschewed tabs, but tabs crept into some sources, and GNAT didn't complain, but then when some tool turned tabs into spaces, GNAT _did_ complain. So some regression-tested code failed to regression test! **************************************************************** From: Tucker Taft Date: Tuesday, December 6, 2005 7:20 PM Section 2.2 seems pretty clear to me that we are talking characters (not bytes), and it is independent of representation. Character_Tabulation is a character that is a format effector, and is the one format effector that counts as one of the 200 characters, since the others are all considered line terminators which are specifically not counted as part of the 200. It seems you could choose to interpret the physical HT character as representing the appropriate number of logical space characters rather than a single Character_Tabulation character, and your customers might thank you for it (or might not, if you don't agree about where the tab stops are). **************************************************************** From: Robert Dewar Date: Tuesday, December 6, 2005 8:45 PM > Section 2.2 seems pretty clear to me that we are talking > characters (not bytes), and it is independent of representation. > Character_Tabulation is a character that is a format effector, > and is the one format effector that counts as one of the 200 > characters, since the others are all considered line terminators > which are specifically not counted as part of the 200. Right, which is indeed yes and yes to my questions, though I am not so quick to agree it is pretty clear. After all the RM has nothing to se per se about how the source is represented. > It seems you could choose to interpret the physical HT character > as representing the appropriate number of logical space characters > rather than a single Character_Tabulation character, and your > customers might thank you for it (or might not, if you don't > agree about where the tab stops are). Well once you allow that interpretation, I think you allow a lot of other possible "deviations" from the simple model above, so I come to conclude that a physical HT character should better represent a tabulation character. **************************************************************** From: Robert Dewar Date: Tuesday, December 6, 2005 8:52 PM > What else could "line length" mean? Some notion of "number of bytes" > is not present in the RM. Line length could mean the length of the line logically including the effect of HT as blanks. And Tuck after all considers this a possible interpretation (what you say in this case is that the hard HT in the source is simply a representation of multiple blanks), but I think that's wrong, because if an HT represents a sequence of blanks, how do you represent HT itself ,,, > At SofCheck, we ran into a mildly amusing (or mildly annoying) anomaly a few > times. We told GNAT to complain if lines were longer than 80 characters. We > normally eschewed tabs, but tabs crept into some sources, and GNAT didn't > complain, but then when some tool turned tabs into spaces, GNAT _did_ > complain. So some regression-tested code failed to regression test! I strongly recommend not allowing tabs to creep into sources (GNAT style option -gnatyh) **************************************************************** From: Robert Dewar Date: Tuesday, December 6, 2005 9:02 PM Actually I come to realize that the 200 character limit of the RM is probably different in nature from a style check that limits lines to a certain length. Why? because a style check is really about limiting the visual width of the source (I can't see why else you would impose a particular limit like 79, the GNAT standard, or 80). Well in visual width terms, wide characters occupy only one character with most representations, assuming we are on a display which handles the appropriate wide character representation. Except ... brackets notation for wide characters, e.g. ["1234"] is *precisely* intended for environments where you do not want to rely on such graphic interpretations, and I think in a style check mode ["1234"] should count as 8 characters, whereas the corresponding UTF-8 sequence would count only as 1 character. Similarly, I think tabs should count as spaces. Note that Bob Duff's story about blankifying tabs is a convincing argument for this treatment. To satisfy the 200 character (in the different yes/yes sense) limit of the RM, all we have to do is to have a default style check that is large enough (in GNAT, the default line limit is 32767, in the style check sense, as described above, which is easily enough to guarantee 200 in the RM sense. Now, question, is it really important for portability to have a mode in which the compiler enforces 200 in the character count sense. A bit of an obscure mess I am afraid ... **************************************************************** From: Tucker Taft Date: Tuesday, December 6, 2005 9:53 PM We have always supported exactly 200 characters as the max, to ensure portability. I can't imagine any good reason to use much more than 100 -- the code definitely becomes harder to read at some point. With static string concatenation, there is no need to have particularly long string literals. I think your idea of making the GNAT limit be a "visual" limit would be great, because as you say, that is the point. **************************************************************** From: Robert Dewar Date: Tuesday, December 6, 2005 10:32 PM > We have always supported exactly 200 characters as the max, > to ensure portability. I can't imagine any good reason to > use much more than 100 -- the code definitely becomes > harder to read at some point. With static string concatenation, > there is no need to have particularly long string literals. The time that long lines are useful is for automatically generated stuff, sometimes we have encountered tools generating very long lines. > I think your idea of making the GNAT limit be a "visual" > limit would be great, because as you say, that is the point. OK, that seems right to me too ... **************************************************************** From: Tucker Taft Date: Tuesday, December 6, 2005 10:45 PM > The time that long lines are useful is for automatically generated > stuff, sometimes we have encountered tools generating very long > lines. Good point, though perhaps judicious use of "fold -s" (or a slightly smarter one that knows about string literals) might handle that situation. In any case it seems to me that the "default mode" is safe at limiting lines to 200 characters, while an "extended line" mode could be used for the odd machine-generated source file. **************************************************************** From: Jean-Pierre Rosen Date: Wednesday, December 7, 2005 3:15 AM > Now, question, is it really important for portability > to have a mode in which the compiler enforces 200 in > the character count sense. > > A bit of an obscure mess I am afraid ... To me, the important thing is the "at least" part, which implies that identifiers up to 200 characters long are guaranteed portable. And for identifiers, 200 characters are close to infinity... **************************************************************** From: Robert Dewar Date: Wednesday, December 7, 2005 6:23 AM > To me, the important thing is the "at least" part, which implies that > identifiers up to 200 characters long are guaranteed portable. > > And for identifiers, 200 characters are close to infinity... Not for the ACATS tests, where identifiers mysteriously expand to meet your actual max line length (and now for the ACATS tests we have the interesting question of what you say is your max line length, given these two different interpretations. I would say the ACATS test is confused here, and should just test identifiers up to 200 characters. In fact the language would be improved by saying max identifier length is 200 period. No point in allowing junk non-portable identifiers longer than this, and even less point in having the ACATS enforce this allowance if your compiler allows longer source lines. **************************************************************** From: Pascal Leroy Date: Wednesday, December 7, 2005 4:15 AM > And Tuck after all > considers this a possible interpretation (what you say in > this case is that the hard HT in the source is simply a > representation of multiple blanks), but I think that's wrong, > because if an HT represents a sequence of blanks, how do you > represent HT itself Bogus question, you just choose a convention, perhaps using some unassigned code point, or some otherwise impossible encoding. That would be distinctly unhelpful to users but it be legitimate, so surely it doesn't invalidate Tuck's interpretation (with which I happen to agree). **************************************************************** From: Robert Dewar Date: Wednesday, December 7, 2005 6:25 AM Well I take Tuck's interpretation to say that a line of 200 tabs could be interpreted as being longer than 200 characters, and therefore rejected. I agree this is allowable on legalistic terms, but it would seem a very bad choice to me in terms of portability. Are you really saying you think this is an OK approach? **************************************************************** From: Pascal Leroy Date: Wednesday, December 7, 2005 6:52 AM > Well I take Tuck's interpretation to say that a line of 200 > tabs could be interpreted as being longer than 200 > characters, and therefore rejected. That's my interpretation too. > I agree this is allowable on legalistic terms, but it would > seem a very bad choice to me in terms of portability. I agree with both parts of this sentence. > Are you really saying you think this is an OK approach? As you say, this is legitimate but unhelpful. At any rate we don't want to be in the business of legislating the effect of tabs, so I'd rather leave these questions unanswered. A user who writes a program with (large numbers of) tabs is treading on thin ice anyway. In particular it is very much unclear what is the effect of a tab in a string literal (illegal? some number of spaces? how many?). **************************************************************** From: Robert Dewar Date: Wednesday, December 7, 2005 7:02 AM Right but don't talk in terms of extreme usage, just think of someone using an editor that routinely substitutes tabs for spaces, and now has a line longer than 200 but does not realize it. It would be nice if that program were clearly portable even with the tab characters. Sometimes, I think we have gone too far with this insistence on not specifying the representation of sources. At the very least, the RM should have an implementation advice section saying how sources are to be represented in "normal" environments. **************************************************************** From: Pascal Leroy Date: Wednesday, December 7, 2005 8:06 AM > Right but don't talk in terms of extreme usage, just think of > someone using an editor that routinely substitutes tabs for > spaces, and now has a line longer than 200 but does not > realize it. It would be nice if that program were clearly > portable even with the tab characters. Like with every non-portability that can be detected statically, it would be nice if a warning were emitted in this situation. > Sometimes, I think we have gone too far with this insistence > on not specifying the representation of sources. At the very > least, the RM should have an implementation advice section > saying how sources are to be represented in "normal" environments. I have some sympathy with this viewpoint. Long long ago we had an environment where we didn't even have a source text, but we discovered the hard way that it cannot be made to work right in all cases. So in practice there are source texts, and there is only a rather small set of sensible choices for the source representation. On the other hand, I am not sure that I want to get into the catfight of arguing about the meaning of "tabs, line terminators and everything". Not to mention the difficulty of agreeing on a representation of Unicode characters. We only support UTF-8, so an IA that would recommend anything else would not be acceptable to me. I guess I am arguing in favor of letting that sleeping dog lie... **************************************************************** From: Robert Dewar Date: Wednesday, December 7, 2005 8:13 AM > I have some sympathy with this viewpoint. Long long ago we had an > environment where we didn't even have a source text, but we discovered the > hard way that it cannot be made to work right in all cases. So in > practice there are source texts, and there is only a rather small set of > sensible choices for the source representation. Agreed. > On the other hand, I am not sure that I want to get into the catfight of > arguing about the meaning of "tabs, line terminators and everything". Not > to mention the difficulty of agreeing on a representation of Unicode > characters. We only support UTF-8, so an IA that would recommend anything > else would not be acceptable to me. I actually think that all compilers handle "tabs, line terminators and everything" pretty consistently, and don't think it would be a catfight. As for wide characters, I think it would be fine to mandate that compilers (or compilation systems, preprocessors etc are fine) handle UTF-8 and brackets notation period. I think you want to include brackets notation, since this is a historically portable form used by the ACATS tests, but certainly changing the ACATS tests to UTF-8 is discussable and should be discussed. > I guess I am arguing in favor of letting that sleeping dog lie... Well unfortunately the line length limit issue is not just theoretical for us, and this dog just woke up :-) **************************************************************** From: Robert A. Duff Date: Wednesday, December 7, 2005 8:28 AM > Well I take Tuck's interpretation to say that a line of 200 tabs could > be interpreted as being longer than 200 characters, and therefore > rejected. Right. Earlier, I said a tab counts as one character. And that's true, but if a tab represents spaces, then it's not a tab. Some other sequence of bytes might represent tab. Note that several rules can be made meaningless by this reasoning. For example, somebody might think it would be nice to allow string literals to cross line boundaries. So define LF and CR as representations of nothing, and NUL as a representation of LF, and NUL-NUL as a representation of NUL, and ... I can't get too hot and bothered by such possibilities, since even without them, it would be trivial to write a conversion program, if I really want to use some weird representation (maybe = represents := and == represents =). The ARG obviously has nothing to say about text-file conversion utilities! > I agree this is allowable on legalistic terms, but it would seem a very > bad choice to me in terms of portability. I agree. > Are you really saying you think this is an OK approach? Depends what you mean by "OK". As you said in the previous paragraph, it is allowed, but .... However, this is the ARG -- we're not in the business of advising implementers about wise choices of source representation. (Too bad, perhaps -- in this day and age it's probably feasible to define a portable source representation.) I do not think an implementation should fail validation because it unwisely chooses a weird source representation. And GNAT supports _way_ more than 200, however you count, so there's no danger there. I would prefer the GNAT line-length style check to count tabs as (up to) 8 blanks, for the reason I illustrated with my little anecdote yesterday -- namely, it eases working on a project where some folks are continually inserting tabs, while automatic tools periodically remove them. But this has little to do with the 200-character limit, because nobody sets that option to 200 -- they set it to 79, or 80, or perhaps 100. I don't think it's important to have a mode that guarantees portability _from_ GNAT in this regard. But wait! Wouldn't my preference above be an upward incompatible change to GNAT? P.S. All tab characters should be loaded into a rocket ship and launched into the center of the Sun, never to be seen or heard from again. How much human intellectual effort has been wasted on such nonsense?! **************************************************************** From: Gary Dismukes Date: Wednesday, December 7, 2005 12:56 PM > P.S. All tab characters should be loaded into a rocket ship and launched into > the center of the Sun, never to be seen or heard from again. How much human > intellectual effort has been wasted on such nonsense?! Hear, hear! (Though I'd suggest a black hole:-) I despise tabs. **************************************************************** From: Robert Dewar Date: Wednesday, December 7, 2005 2:50 PM > Note that several rules can be made meaningless by this reasoning. For example, > somebody might think it would be nice to allow string literals to cross line > boundaries. So define LF and CR as representations of nothing, and NUL as a > representation of LF, and NUL-NUL as a representation of NUL, and ... > > I can't get too hot and bothered by such possibilities Exactly (though come to think of it, I am surprised that Ada 2005 did not fix the problem of being able to write long string literals in a mechanical way -- as we know concatenation does not always work). > I would prefer the GNAT line-length style check to count tabs as (up to) 8 > blanks, for the reason I illustrated with my little anecdote yesterday -- > namely, it eases working on a project where some folks are continually inserting > tabs, while automatic tools periodically remove them. But this has little to do > with the 200-character limit, because nobody sets that option to 200 -- they set > it to 79, or 80, or perhaps 100. Right I agree > I don't think it's important to have a mode that guarantees portability _from_ > GNAT in this regard. > > But wait! Wouldn't my preference above be an upward incompatible change to > GNAT? More like fixing a bug I think, very few people use tabs in Ada code in our experience. > P.S. All tab characters should be loaded into a rocket ship and launched into > the center of the Sun, never to be seen or heard from again. How much human > intellectual effort has been wasted on such nonsense?! When I wrote operating systems for Honeywell, I took the view that tabs were simply a way of compressing blanks in files, and it was a characteristic of a file whether it allowed this compression. Reading the file expanded the blanks, and writing the file put them back in. Anything else is insanity :-) **************************************************************** From: Randy Brukardt Date: Wednesday, December 7, 2005 3:15 PM ... > When I wrote operating systems for Honeywell, I took the view that tabs were > simply a way of compressing blanks in files, and it was a characteristic of > a file whether it allowed this compression. Reading the file expanded the > blanks, and writing the file put them back in. Indeed, the reason that all of the Janus/Ada files have tabs is that we needed that compression back in the old days -- else the source files were much larger and they didn't fit on a floppy disk. Since the machines in question *only* had floppy disks, it was important to get an entire pass on one disk (else the program couldn't be linked). I suppose the reason for using them is a bit OBE, but it takes me a long time to change... :-) My DOS editor works this way; converting to blanks on reading, and reconverting to tabs on writing. > Anything else is insanity :-) Alas, I can't use that mode in my DOS editor, because it does this in Ada strings -- which we know is wrong. So I have to turn that conversion off. That apparently makes me insane (since I still insist on the tabs in the source files), which probably confirms what some of you think anyway. :-) :-) **************************************************************** From: Robert Dewar Date: Wednesday, December 7, 2005 3:16 PM > Alas, I can't use that mode in my DOS editor, because it does this in Ada > strings -- which we know is wrong. So I have to turn that conversion off. Yes, you can, if tabs in files are only ways of compressing blanks, and get expanded back to blanks on reading the file, it is just fine to do this within strings. **************************************************************** From: Robert A. Duff Date: Wednesday, December 7, 2005 4:35 PM > When I wrote operating systems for Honeywell, I took the view that tabs were > simply a way of compressing blanks in files, and it was a characteristic of > a file whether it allowed this compression. Reading the file expanded the > blanks, and writing the file put them back in. So no software outside the OS / file system sees the tabs? I could live with that. But then why not use a much more space-efficient compression algorithm? And make it work for binary files, too. **************************************************************** From: Robert Dewar Date: Wednesday, December 7, 2005 5:46 PM Indeed, that would be even better, but in those days, elaborate compression was out of the question given available compute power (we were dealing with processors about 1/8000 of the power of current PC's). **************************************************************** From: Pascal Leroy Date: Wednesday, December 7, 2005 8:35 AM > I think you want to include > brackets notation, since this is a historically portable form > used by the ACATS tests, but certainly changing the ACATS > tests to UTF-8 is discussable and should be discussed. Just for the record, we don't plan to support the bracket notation as a first-class format in our compiler (or any other tool). It's fine with us if the ACATS uses that notation: we'll just convert the files to UTF-8. using some preprocessor. But native support for the bracket notation would have absolutely zero benefit for our customers, and it would be quite a bit of work. **************************************************************** From: Robert Dewar Date: Wednesday, December 7, 2005 10:14 AM Fair enough, a preprocessor solution is fine. We find it useful in GNAT to support the brackets notation natively, because it means that we can easily deal with wide characters without needing special fonts and fiddling. This is useful in tests, and in the runtime library: package Ada.Numerics is pragma Pure; Argument_Error : exception; Pi : constant := 3.14159_26535_89793_23846_26433_83279_50288_41971_69399_37511; ["03C0"] : constant := Pi; -- This is the greek letter Pi (for Ada 2005 AI-388). Note that it is -- conforming to have this present even in Ada 95 mode, because there is -- no way for a normal mode Ada 95 program to reference this identifier. e : constant := 2.71828_18284_59045_23536_02874_71352_66249_77572_47093_69996; end Ada.Numerics; We really do NOT want to go to wide character encodings in the standard run-time library :-) **************************************************************** From: Randy Brukardt Date: Wednesday, December 7, 2005 12:59 PM > > to mention the difficulty of agreeing on a representation of Unicode > > characters. We only support UTF-8, so an IA that would recommend anything > > else would not be acceptable to me. > > I actually think that all compilers handle "tabs, line terminators and > everything" pretty consistently, and don't think it would be a catfight. I agree. I would be stunned if there were Ada compilers that couldn't natively compile the ACATS tests. Which are formatted in MS-DOS ASCII format (that is, both CR and LF ending lines), with each control character mapped as it is implied in the RM. Remember that the ACATS even has exhaustive tests that you can't put control characters in string and character literals. I don't remember any real disputes on these tests. (We used to split them because the -Z ended the file, but that's all I know of.) So, as a practical matter, Ada does have a defacto, defined source representation (excluding wide character issues). It's annoying that the standard doesn't recognize the truth of this. > As for wide characters, I think it would be fine to mandate that compilers > (or compilation systems, preprocessors etc are fine) handle UTF-8 and > brackets notation period. I think you want to include brackets notation, > since this is a historically portable form used by the ACATS tests, but > certainly changing the ACATS tests to UTF-8 is discussable and should be > discussed. My preference is to use UTF-8 representations for all wide and wide wide character tests. But that certainly is open for discussion. And if we indeed do that, I would think that we again would have a de-facto default source representation. And again it would be annoying that the standard doesn't recognize that. **************************************************************** From: Robert Dewar Date: Wednesday, December 7, 2005 2:56 PM > My preference is to use UTF-8 representations for all wide and wide wide > character tests. But that certainly is open for discussion. So let me argue against this. There are three problems with this proposal 1. Very often you will be reading ACATS source texts in environments where UTF-8 sequences display as gobbledygook (this is becoming a technical term for wide stuff not properly displayed :-) That's distinctly unhelpful. 2. Even if you are in an environment where wide characters are displayed, they may not be displayed using the approved ISO 10646 graphics. 3. Even if you are in an environment where wide characters are displayed using the approved ISO 10646 graphics, you may be unfamiliar with e.g. TAMIL SIGN VISARGA, so you will still have difficulty understanding the test, whereas ["0B83"] is unambiguously intepretable in all three cases. **************************************************************** From: Pascal Leroy Date: Thursday, December 8, 2005 1:46 AM Robert opined: > 1. Very often you will be reading ACATS source texts in > environments where UTF-8 sequences display as gobbledygook > (this is becoming a technical term for wide stuff not > properly displayed :-) That's distinctly unhelpful. > > 2. Even if you are in an environment where wide characters > are displayed, they may not be displayed using the approved > ISO 10646 graphics. > > 3. Even if you are in an environment where wide characters > are displayed using the approved ISO 10646 graphics, you may > be unfamiliar with e.g. TAMIL SIGN VISARGA, so you will still > have difficulty understanding the test, whereas ["0B83"] is > unambiguously intepretable in all three cases. I cannot get excited about (3). I don't believe that ["0B83"] is giving you any useful information regarding the character. I for one would certainly have to look at the Unicode properties file to find out if it's a letter, a number, etc. So hopefully the ACATS tests for wide-wide characters will include comments like "TAMIL SIGN VISARGA is a letter_other, and this test checks that it can appear anywhere in an identifier". I agree that (1) and (2) are annoyances. On the other hand, as Randy points out the ACATS gives us an opportunity to establish a de-facto source representation and that's A Good Thing for the community. As we all know the ACATS is a powerful way to establish commonality beyond the scope of the RM if only because filing a petition is such a pain. We have to keep in mind that the ACATS is only read by compiler writers and validation people, and they are used to reading various forms of gobbledygook anyway. So yeah, it's going to make their life a bit more complicated, but shrug. All in all, I guess I am siding with Randy here. ****************************************************************