Version 1.1 of acs/ac-00127.txt

Unformatted version of acs/ac-00127.txt version 1.1
Other versions for file acs/ac-00127.txt

!standard 2.2(14)          06-01-06 AC95-00127/01
!class Amendment 06-01-06
!status received no action 06-01-06
!status received 05-12-06
!subject Line length, wide characters, and source representation
!summary
!appendix

From: Robert Dewar
Date: Tuesday, December  6, 2005  5:26 AM

Can someone help on this .. what exactly does the RM mean
by line length (when it recommends that at least 200 be
allowed).

Is a tab (HT) one character?

Is a wide character represented in three bytes one character?

I would think the answers are yes and yes ... but was not
really able to prove it from the RM.

****************************************************************

From: Robert A. Duff
Date: Tuesday, December  6, 2005  6:10 AM

I also think the answers are yes and yes.
I'm also not _sure_ I can prove that from the RM, but it seems
likely that "line length" should be interpreted to mean
number of characters, and that we're talking about (source) characters
as defined in 2.1.  So HT is intended to be one character,
and a character's representation (e.g. as three bytes) is
irrelevant -- it's still one character.
What else could "line length" mean?  Some notion of "number of bytes"
is not present in the RM.

But I do not claim to understand all the wide-wide-ever-so-wide
business, nor the FFFE/FFFF business, and other weirdness in chap 2.

I don't much like the 200-character kludge, but that's neither
here nor there.

At SofCheck, we ran into a mildly amusing (or mildly annoying) anomaly a few
times.  We told GNAT to complain if lines were longer than 80 characters.  We
normally eschewed tabs, but tabs crept into some sources, and GNAT didn't
complain, but then when some tool turned tabs into spaces, GNAT _did_
complain.  So some regression-tested code failed to regression test!

****************************************************************

From: Tucker Taft
Date: Tuesday, December  6, 2005  7:20 PM

Section 2.2 seems pretty clear to me that we are talking
characters (not bytes), and it is independent of representation.
Character_Tabulation is a character that is a format effector,
and is the one format effector that counts as one of the 200
characters, since the others are all considered line terminators
which are specifically not counted as part of the 200.

It seems you could choose to interpret the physical HT character
as representing the appropriate number of logical space characters
rather than a single Character_Tabulation character, and your
customers might thank you for it (or might not, if you don't
agree about where the tab stops are).

****************************************************************

From: Robert Dewar
Date: Tuesday, December  6, 2005  8:45 PM

> Section 2.2 seems pretty clear to me that we are talking
> characters (not bytes), and it is independent of representation.
> Character_Tabulation is a character that is a format effector,
> and is the one format effector that counts as one of the 200
> characters, since the others are all considered line terminators
> which are specifically not counted as part of the 200.

Right, which is indeed yes and yes to my questions, though I
am not so quick to agree it is pretty clear. After all the
RM has nothing to se per se about how the source is represented.

> It seems you could choose to interpret the physical HT character
> as representing the appropriate number of logical space characters
> rather than a single Character_Tabulation character, and your
> customers might thank you for it (or might not, if you don't
> agree about where the tab stops are).

Well once you allow that interpretation, I think you allow a
lot of other possible "deviations" from the simple model above,
so I come to conclude that a physical HT character should better
represent a tabulation character.

****************************************************************

From: Robert Dewar
Date: Tuesday, December  6, 2005  8:52 PM

> What else could "line length" mean?  Some notion of "number of bytes"
> is not present in the RM.

Line length could mean the length of the line logically including
the effect of HT as blanks. And Tuck after all considers this a
possible interpretation (what you say in this case is that the
hard HT in the source is simply a representation of multiple
blanks), but I think that's wrong, because if an HT represents
a sequence of blanks, how do you represent HT itself

,,,
> At SofCheck, we ran into a mildly amusing (or mildly annoying) anomaly a few
> times.  We told GNAT to complain if lines were longer than 80 characters.  We
> normally eschewed tabs, but tabs crept into some sources, and GNAT didn't
> complain, but then when some tool turned tabs into spaces, GNAT _did_
> complain.  So some regression-tested code failed to regression test!

I strongly recommend not allowing tabs to creep into sources
(GNAT style option -gnatyh)

****************************************************************

From: Robert Dewar
Date: Tuesday, December  6, 2005  9:02 PM

Actually I come to realize that the 200 character limit
of the RM is probably different in nature from a style
check that limits lines to a certain length.

Why? because a style check is really about limiting the
visual width of the source (I can't see why else you would
impose a particular limit like 79, the GNAT standard, or
80).

Well in visual width terms, wide characters occupy only
one character with most representations, assuming we are
on a display which handles the appropriate wide character
representation.

Except ... brackets notation for wide characters, e.g.
["1234"] is *precisely* intended for environments where
you do not want to rely on such graphic interpretations,
and I think in a style check mode ["1234"] should count
as 8 characters, whereas the corresponding UTF-8 sequence
would count only as 1 character.

Similarly, I think tabs should count as spaces. Note that
Bob Duff's story about blankifying tabs is a convincing
argument for this treatment.

To satisfy the 200 character (in the different yes/yes
sense) limit of the RM, all we have to do is to have
a default style check that is large enough (in GNAT,
the default line limit is 32767, in the style check
sense, as described above, which is easily enough to
guarantee 200 in the RM sense.

Now, question, is it really important for portability
to have a mode in which the compiler enforces 200 in
the character count sense.

A bit of an obscure mess I am afraid ...

****************************************************************

From: Tucker Taft
Date: Tuesday, December  6, 2005  9:53 PM

We have always supported exactly 200 characters as the max,
to ensure portability.  I can't imagine any good reason to
use much more than 100 -- the code definitely becomes
harder to read at some point.  With static string concatenation,
there is no need to have particularly long string literals.

I think your idea of making the GNAT limit be a "visual"
limit would be great, because as you say, that is the point.

****************************************************************

From: Robert Dewar
Date: Tuesday, December  6, 2005  10:32 PM

> We have always supported exactly 200 characters as the max,
> to ensure portability.  I can't imagine any good reason to
> use much more than 100 -- the code definitely becomes
> harder to read at some point.  With static string concatenation,
> there is no need to have particularly long string literals.

The time that long lines are useful is for automatically generated
stuff, sometimes we have encountered tools generating very long
lines.

> I think your idea of making the GNAT limit be a "visual"
> limit would be great, because as you say, that is the point.

OK, that seems right to me too ...

****************************************************************

From: Tucker Taft
Date: Tuesday, December  6, 2005  10:45 PM

> The time that long lines are useful is for automatically generated
> stuff, sometimes we have encountered tools generating very long
> lines.

Good point, though perhaps judicious use of "fold -s" (or
a slightly smarter one that knows about string literals)
might handle that situation.  In any case it seems to me that
the "default mode" is safe at limiting lines to 200 characters,
while an "extended line" mode could be used for the odd
machine-generated source file.

****************************************************************

From: Jean-Pierre Rosen
Date: Wednesday, December  7, 2005  3:15 AM

> Now, question, is it really important for portability
> to have a mode in which the compiler enforces 200 in
> the character count sense.
>
> A bit of an obscure mess I am afraid ...

To me, the important thing is the "at least" part, which implies that
identifiers up to 200 characters long are guaranteed portable.

And for identifiers, 200 characters are close to infinity...

****************************************************************

From: Robert Dewar
Date: Wednesday, December  7, 2005  6:23 AM

> To me, the important thing is the "at least" part, which implies that
> identifiers up to 200 characters long are guaranteed portable.
>
> And for identifiers, 200 characters are close to infinity...

Not for the ACATS tests, where identifiers mysteriously expand
to meet your actual max line length (and now for the ACATS tests
we have the interesting question of what you say is your max
line length, given these two different interpretations. I would
say the ACATS test is confused here, and should just test
identifiers up to 200 characters.

In fact the language would be improved by saying max identifier
length is 200 period. No point in allowing junk non-portable
identifiers longer than this, and even less point in having the
ACATS enforce this allowance if your compiler allows longer
source lines.

****************************************************************

From: Pascal Leroy
Date: Wednesday, December  7, 2005  4:15 AM

> And Tuck after all
> considers this a possible interpretation (what you say in
> this case is that the hard HT in the source is simply a
> representation of multiple blanks), but I think that's wrong,
> because if an HT represents a sequence of blanks, how do you
> represent HT itself

Bogus question, you just choose a convention, perhaps using some
unassigned code point, or some otherwise impossible encoding.  That would
be distinctly unhelpful to users but it be legitimate, so surely it
doesn't invalidate Tuck's interpretation (with which I happen to agree).

****************************************************************

From: Robert Dewar
Date: Wednesday, December  7, 2005  6:25 AM

Well I take Tuck's interpretation to say that a line of 200 tabs could
be interpreted as being longer than 200 characters, and therefore
rejected.

I agree this is allowable on legalistic terms, but it would seem a very
bad choice to me in terms of portability.

Are you really saying you think this is an OK approach?

****************************************************************

From: Pascal Leroy
Date: Wednesday, December  7, 2005  6:52 AM

> Well I take Tuck's interpretation to say that a line of 200
> tabs could be interpreted as being longer than 200
> characters, and therefore rejected.

That's my interpretation too.

> I agree this is allowable on legalistic terms, but it would
> seem a very bad choice to me in terms of portability.

I agree with both parts of this sentence.

> Are you really saying you think this is an OK approach?

As you say, this is legitimate but unhelpful.  At any rate we don't want
to be in the business of legislating the effect of tabs, so I'd rather
leave these questions unanswered.  A user who writes a program with (large
numbers of) tabs is treading on thin ice anyway.  In particular it is very
much unclear what is the effect of a tab in a string literal (illegal?
some number of spaces? how many?).

****************************************************************

From: Robert Dewar
Date: Wednesday, December  7, 2005  7:02 AM

Right but don't talk in terms of extreme usage, just think of someone
using an editor that routinely substitutes tabs for spaces, and now
has a line longer than 200 but does not realize it. It would be nice
if that program were clearly portable even with the tab characters.

Sometimes, I think we have gone too far with this insistence on not
specifying the representation of sources. At the very least, the RM
should have an implementation advice section saying how sources are
to be represented in "normal" environments.

****************************************************************

From: Pascal Leroy
Date: Wednesday, December  7, 2005  8:06 AM

> Right but don't talk in terms of extreme usage, just think of
> someone using an editor that routinely substitutes tabs for
> spaces, and now has a line longer than 200 but does not
> realize it. It would be nice if that program were clearly
> portable even with the tab characters.

Like with every non-portability that can be detected statically, it would
be nice if a warning were emitted in this situation.

> Sometimes, I think we have gone too far with this insistence
> on not specifying the representation of sources. At the very
> least, the RM should have an implementation advice section
> saying how sources are to be represented in "normal" environments.

I have some sympathy with this viewpoint.  Long long ago we had an
environment where we didn't even have a source text, but we discovered the
hard way that it cannot be made to work right in all cases.  So in
practice there are source texts, and there is only a rather small set of
sensible choices for the source representation.

On the other hand, I am not sure that I want to get into the catfight of
arguing about the meaning of "tabs, line terminators and everything".  Not
to mention the difficulty of agreeing on a representation of Unicode
characters.  We only support UTF-8, so an IA that would recommend anything
else would not be acceptable to me.

I guess I am arguing in favor of letting that sleeping dog lie...

****************************************************************

From: Robert Dewar
Date: Wednesday, December  7, 2005  8:13 AM

> I have some sympathy with this viewpoint.  Long long ago we had an
> environment where we didn't even have a source text, but we discovered the
> hard way that it cannot be made to work right in all cases.  So in
> practice there are source texts, and there is only a rather small set of
> sensible choices for the source representation.

Agreed.

> On the other hand, I am not sure that I want to get into the catfight of
> arguing about the meaning of "tabs, line terminators and everything".  Not
> to mention the difficulty of agreeing on a representation of Unicode
> characters.  We only support UTF-8, so an IA that would recommend anything
> else would not be acceptable to me.

I actually think that all compilers handle "tabs, line terminators and
everything" pretty consistently, and don't think it would be a catfight.
As for wide characters, I think it would be fine to mandate that compilers
(or compilation systems, preprocessors etc are fine) handle UTF-8 and
brackets notation period. I think you want to include brackets notation,
since this is a historically portable form used by the ACATS tests, but
certainly changing the ACATS tests to UTF-8 is discussable and should be
discussed.

> I guess I am arguing in favor of letting that sleeping dog lie...

Well unfortunately the line length limit issue is not just theoretical
for us, and this dog just woke up :-)

****************************************************************

From: Robert A. Duff
Date: Wednesday, December  7, 2005  8:28 AM

> Well I take Tuck's interpretation to say that a line of 200 tabs could
> be interpreted as being longer than 200 characters, and therefore
> rejected.

Right.  Earlier, I said a tab counts as one character.  And that's true, but if
a tab represents spaces, then it's not a tab.  Some other sequence of bytes
might represent tab.

Note that several rules can be made meaningless by this reasoning.  For example,
somebody might think it would be nice to allow string literals to cross line
boundaries.  So define LF and CR as representations of nothing, and NUL as a
representation of LF, and NUL-NUL as a representation of NUL, and ...

I can't get too hot and bothered by such possibilities, since even without them,
it would be trivial to write a conversion program, if I really want to use some
weird representation (maybe = represents := and == represents =).  The ARG
obviously has nothing to say about text-file conversion utilities!

> I agree this is allowable on legalistic terms, but it would seem a very
> bad choice to me in terms of portability.

I agree.

> Are you really saying you think this is an OK approach?

Depends what you mean by "OK".  As you said in the previous paragraph, it is
allowed, but ....  However, this is the ARG -- we're not in the business of
advising implementers about wise choices of source representation.  (Too bad,
perhaps -- in this day and age it's probably feasible to define a portable
source representation.)  I do not think an implementation should fail validation
because it unwisely chooses a weird source representation.  And GNAT supports
_way_ more than 200, however you count, so there's no danger there.

I would prefer the GNAT line-length style check to count tabs as (up to) 8
blanks, for the reason I illustrated with my little anecdote yesterday --
namely, it eases working on a project where some folks are continually inserting
tabs, while automatic tools periodically remove them.  But this has little to do
with the 200-character limit, because nobody sets that option to 200 -- they set
it to 79, or 80, or perhaps 100.

I don't think it's important to have a mode that guarantees portability _from_
GNAT in this regard.

But wait!  Wouldn't my preference above be an upward incompatible change to
GNAT?

P.S. All tab characters should be loaded into a rocket ship and launched into
the center of the Sun, never to be seen or heard from again.  How much human
intellectual effort has been wasted on such nonsense?!

****************************************************************

From: Gary Dismukes
Date: Wednesday, December  7, 2005  12:56 PM

> P.S. All tab characters should be loaded into a rocket ship and launched into
> the center of the Sun, never to be seen or heard from again.  How much human
> intellectual effort has been wasted on such nonsense?!

Hear, hear!  (Though I'd suggest a black hole:-)  I despise tabs.

****************************************************************

From: Robert Dewar
Date: Wednesday, December  7, 2005  2:50 PM

> Note that several rules can be made meaningless by this reasoning.  For example,
> somebody might think it would be nice to allow string literals to cross line
> boundaries.  So define LF and CR as representations of nothing, and NUL as a
> representation of LF, and NUL-NUL as a representation of NUL, and ...
>
> I can't get too hot and bothered by such possibilities

Exactly (though come to think of it, I am surprised that Ada 2005 did not
fix the problem of being able to write long string literals in a mechanical
way -- as we know concatenation does not always work).

> I would prefer the GNAT line-length style check to count tabs as (up to) 8
> blanks, for the reason I illustrated with my little anecdote yesterday --
> namely, it eases working on a project where some folks are continually inserting
> tabs, while automatic tools periodically remove them.  But this has little to do
> with the 200-character limit, because nobody sets that option to 200 -- they set
> it to 79, or 80, or perhaps 100.

Right I agree

> I don't think it's important to have a mode that guarantees portability _from_
> GNAT in this regard.
>
> But wait!  Wouldn't my preference above be an upward incompatible change to
> GNAT?

More like fixing a bug I think, very few people use tabs in Ada code
in our experience.

> P.S. All tab characters should be loaded into a rocket ship and launched into
> the center of the Sun, never to be seen or heard from again.  How much human
> intellectual effort has been wasted on such nonsense?!

When I wrote operating systems for Honeywell, I took the view that tabs were
simply a way of compressing blanks in files, and it was a characteristic of
a file whether it allowed this compression. Reading the file expanded the
blanks, and writing the file put them back in.

Anything else is insanity :-)

****************************************************************

From: Randy Brukardt
Date: Wednesday, December  7, 2005  3:15 PM

...
> When I wrote operating systems for Honeywell, I took the view that tabs were
> simply a way of compressing blanks in files, and it was a characteristic of
> a file whether it allowed this compression. Reading the file expanded the
> blanks, and writing the file put them back in.

Indeed, the reason that all of the Janus/Ada files have tabs is that we
needed that compression back in the old days -- else the source files were
much larger and they didn't fit on a floppy disk. Since the machines in
question *only* had floppy disks, it was important to get an entire pass on
one disk (else the program couldn't be linked).

I suppose the reason for using them is a bit OBE, but it takes me a long
time to change... :-)

My DOS editor works this way; converting to blanks on reading, and
reconverting to tabs on writing.

> Anything else is insanity :-)

Alas, I can't use that mode in my DOS editor, because it does this in Ada
strings -- which we know is wrong. So I have to turn that conversion off.

That apparently makes me insane (since I still insist on the tabs in the
source files), which probably confirms what some of you think anyway. :-)
:-)

****************************************************************

From: Robert Dewar
Date: Wednesday, December  7, 2005  3:16 PM

> Alas, I can't use that mode in my DOS editor, because it does this in Ada
> strings -- which we know is wrong. So I have to turn that conversion off.

Yes, you can, if tabs in files are only ways of compressing blanks, and
get expanded back to blanks on reading the file, it is just fine to do
this within strings.

****************************************************************

From: Robert A. Duff
Date: Wednesday, December  7, 2005  4:35 PM

> When I wrote operating systems for Honeywell, I took the view that tabs were
> simply a way of compressing blanks in files, and it was a characteristic of
> a file whether it allowed this compression. Reading the file expanded the
> blanks, and writing the file put them back in.

So no software outside the OS / file system sees the tabs?
I could live with that.

But then why not use a much more space-efficient compression algorithm?
And make it work for binary files, too.

****************************************************************

From: Robert Dewar
Date: Wednesday, December  7, 2005  5:46 PM

Indeed, that would be even better, but in those days, elaborate
compression was out of the question given available compute
power (we were dealing with processors about 1/8000 of the
power of current PC's).

****************************************************************

From: Pascal Leroy
Date: Wednesday, December  7, 2005  8:35 AM

> I think you want to include
> brackets notation, since this is a historically portable form
> used by the ACATS tests, but certainly changing the ACATS
> tests to UTF-8 is discussable and should be discussed.

Just for the record, we don't plan to support the bracket notation as a
first-class format in our compiler (or any other tool).  It's fine with us
if the ACATS uses that notation: we'll just convert the files to UTF-8.
using some preprocessor.  But native support for the bracket notation
would have absolutely zero benefit for our customers, and it would be
quite a bit of work.

****************************************************************

From: Robert Dewar
Date: Wednesday, December  7, 2005  10:14 AM

Fair enough, a preprocessor solution is fine. We find it useful in GNAT
to support the brackets notation natively, because it means that we can
easily deal with wide characters without needing special fonts and
fiddling. This is useful in tests, and in the runtime library:

package Ada.Numerics is
    pragma Pure;

    Argument_Error : exception;

    Pi : constant :=
           3.14159_26535_89793_23846_26433_83279_50288_41971_69399_37511;

    ["03C0"] : constant := Pi;
    --  This is the greek letter Pi (for Ada 2005 AI-388). Note that it is
    --  conforming to have this present even in Ada 95 mode, because there is
    --  no way for a normal mode Ada 95 program to reference this identifier.

    e : constant :=
          2.71828_18284_59045_23536_02874_71352_66249_77572_47093_69996;

end Ada.Numerics;

We really do NOT want to go to wide character encodings in the
standard run-time library :-)

****************************************************************

From: Randy Brukardt
Date: Wednesday, December  7, 2005  12:59 PM

> > to mention the difficulty of agreeing on a representation of Unicode
> > characters.  We only support UTF-8, so an IA that would recommend anything
> > else would not be acceptable to me.
>
> I actually think that all compilers handle "tabs, line terminators and
> everything" pretty consistently, and don't think it would be a catfight.

I agree. I would be stunned if there were Ada compilers that couldn't
natively compile the ACATS tests. Which are formatted in MS-DOS ASCII format
(that is, both CR and LF ending lines), with each control character mapped
as it is implied in the RM. Remember that the ACATS even has exhaustive
tests that you can't put control characters in string and character
literals. I don't remember any real disputes on these tests. (We used to
split them because the <CTRL>-Z ended the file, but that's all I know of.)

So, as a practical matter, Ada does have a defacto, defined source
representation (excluding wide character issues). It's annoying that the
standard doesn't recognize the truth of this.

> As for wide characters, I think it would be fine to mandate that compilers
> (or compilation systems, preprocessors etc are fine) handle UTF-8 and
> brackets notation period. I think you want to include brackets notation,
> since this is a historically portable form used by the ACATS tests, but
> certainly changing the ACATS tests to UTF-8 is discussable and should be
> discussed.

My preference is to use UTF-8 representations for all wide and wide wide
character tests. But that certainly is open for discussion.

And if we indeed do that, I would think that we again would have a de-facto
default source representation. And again it would be annoying that the
standard doesn't recognize that.

****************************************************************

From: Robert Dewar
Date: Wednesday, December  7, 2005  2:56 PM

> My preference is to use UTF-8 representations for all wide and wide wide
> character tests. But that certainly is open for discussion.

So let me argue against this. There are three problems with this proposal

1. Very often you will be reading ACATS source texts in environments
where UTF-8 sequences display as gobbledygook (this is becoming a technical
term for wide stuff not properly displayed :-) That's distinctly unhelpful.

2. Even if you are in an environment where wide characters are displayed,
they may not be displayed using the approved ISO 10646 graphics.

3. Even if you are in an environment where wide characters are displayed
using the approved ISO 10646 graphics, you may be unfamiliar with e.g.
TAMIL SIGN VISARGA, so you will still have difficulty understanding the
test, whereas ["0B83"] is unambiguously intepretable in all three cases.

****************************************************************

From: Pascal Leroy
Date: Thursday, December  8, 2005  1:46 AM

Robert opined:

> 1. Very often you will be reading ACATS source texts in
> environments where UTF-8 sequences display as gobbledygook
> (this is becoming a technical term for wide stuff not
> properly displayed :-) That's distinctly unhelpful.
>
> 2. Even if you are in an environment where wide characters
> are displayed, they may not be displayed using the approved
> ISO 10646 graphics.
>
> 3. Even if you are in an environment where wide characters
> are displayed using the approved ISO 10646 graphics, you may
> be unfamiliar with e.g. TAMIL SIGN VISARGA, so you will still
> have difficulty understanding the test, whereas ["0B83"] is
> unambiguously intepretable in all three cases.

I cannot get excited about (3).  I don't believe that ["0B83"] is giving
you any useful information regarding the character.  I for one would
certainly have to look at the Unicode properties file to find out if it's
a letter, a number, etc.  So hopefully the ACATS tests for wide-wide
characters will include comments like "TAMIL SIGN VISARGA is a
letter_other, and this test checks that it can appear anywhere in an
identifier".

I agree that (1) and (2) are annoyances.  On the other hand, as Randy
points out the ACATS gives us an opportunity to establish a de-facto
source representation and that's A Good Thing for the community.  As we
all know the ACATS is a powerful way to establish commonality beyond the
scope of the RM if only because filing a petition is such a pain.

We have to keep in mind that the ACATS is only read by compiler writers
and validation people, and they are used to reading various forms of
gobbledygook anyway.  So yeah, it's going to make their life a bit more
complicated, but shrug.

All in all, I guess I am siding with Randy here.

****************************************************************


Questions? Ask the ACAA Technical Agent