Version 1.1 of ai05s/ai05-0181-1.txt

Unformatted version of ai05s/ai05-0181-1.txt version 1.1
Other versions for file ai05s/ai05-0181-1.txt

!standard 3.5.2(2/2)          09-10-30 AI05-0181-1/01
!standard A.1(35/2)
!class binding interpretation 09-10-29
!status work item 09-10-29
!status received 09-10-22
!priority Low
!difficulty Easy
!qualifier Error
!subject Soft hyphen is a non-graphic character
!summary
Soft hyphen is a non-graphic character.
!question
In 3.5.2(2/2), the RM refers to the characters in positions 0000-001F and 007F-009F as nongraphic characters. This means that the soft hyphen, in position 00AD, is a graphic character.
However, 2.1(5) and (14) define what is a graphic (and thus nongraphic) character, by referring to the General Category defined in a document referred to by ISO/IEC 10646:2003. The general category for soft hyphen is listed as Cf which is an abbreviation for Other, Format. This is not a graphic character.
Thus these two definitions conflict. Which is right?
!response
The category defined by ISO/IEC 10646:2003 is always intended to be used. The wording in 3.5.2(2/2) should have been corrected but was not.
!wording
Modify 3.5.2(2/2):
... Each of the nongraphic positions of Row 00 [(0000-001F and 007F-009F)] has a corresponding language-defined name, ...
Modify A.1(35/2):
'', '', '', '', '', ['']{soft_hyphen}, '', '', --168 (16#A8#) .. 175 (16#AF#)
!discussion
Note that non-graphic characters have names in A.1(35/2). Thus, Soft hyphen needs a name in this table. We selected the name of the Unicode character ("Soft_Hyphen") for this name; we did not find any abbreviated Unicode names. Someone found the name "shy" in some unofficial chart of character names; one could consider using that instead if there is a desire to stick to unreadable 3 character names (that would be more consistent with the rest of the definition).
The text defining the nongraphic characters is deleted from 3.5.2(2/2); otherwise it just provides a future maintenance hazard (although we sincerely hope that no further characters change category).
This should be documented as an Inconsistency from Ada 95 in the AARM: the result of Character'Image(Character'Val(16#00AD#)) should be "soft_hyphen", not '-'.
!ACATS Test
Consider a ACATS C-Test that the name Soft_Hyphen is recognized by Character'Value.
!appendix

!topic Is soft hyphen graphic or not?
!reference 3.5.2(2/2), A.1(35/2)
!from Adam Beneschan 09-10-22
!discussion

In 3.5.2(2/2), the RM refers to the characters in positions 0000-001F and
007F-009F as nongraphic characters.  This would mean that the soft hyphen, in
position 00AD, is a graphic character.

However, in AI95-395, in the table UTF_32_Non_Graphic, it lists 00AD as a
nongraphic character:

    UTF_32_Non_Graphic : constant UTF_32_Ranges := (
      (16#00000#, 16#0001F#),  -- <control> .. <control>
      (16#0007F#, 16#0009F#),  -- <control> .. <control>
      (16#000AD#, 16#000AD#),  -- SOFT HYPHEN .. SOFT HYPHEN
      ...

RM 2.1(5) and (14) define what is a graphic (and thus nongraphic) character, by
referring to the General Category defined in a document referred to by ISO/IEC
10646:2003.  I don't have a document that defines what all the categories are
for each character, and offhand it looks like I can't get it without shelling
out a few bucks, but I've been assuming that whoever wrote the
UTF_32_Non_Graphic and other tables in AI95-395 did have access to it.  Thus,
unless the person who wrote the UTF_32_Non_Graphic table made an error, it looks
like:

(1) 3.5.2(2/2) needs to be updated to include 00AD in the list of nongraphic
    characters, and

(2) since 3.5.2(2/2) says that the language-defined names for nongraphic
    characters in the 00-FF range are set in italics in A.1, then A.1 would also
    need to be modified to assign a language-defined name to this character.  (I
    did see a Unicode table that showed it as SHY.)

(By the way, when I look at A.1 in my browser, the character in this position
appears as '' , i.e. as two quote marks but with nothing in between.  It looks
like '-' in my printed Ada 95 manual.)

So who's wrong?  The RM, or the table in AI-395?

****************************************************************

From: Randy Brukardt
Sent: Thursday, October 22, 2009  9:01 PM

...
> RM 2.1(5) and (14) define what is a graphic (and thus
> nongraphic) character, by referring to the General Category defined in
> a document referred to by ISO/IEC 10646:2003.  I don't have a document
> that defines what all the categories are for each character, and
> offhand it looks like I can't get it without shelling out a few bucks,
> but I've been assuming that whoever wrote the UTF_32_Non_Graphic and
> other tables in
> AI95-395 did have access to it.

10646:2003 is essentially the same as Unicode 4.0, and you can find out the
categories there by following the link that is in the AARM (AARM 2.1(14.h/2).

What is found there is:

"00AD;SOFT HYPHEN;Cf;0;ON;;;;;N;;;;;"

which means the category is "Cf", which is short for "Other, Format". (Don't ask
me to explain how "Other, Format" got abbreviated "Cf".

> Thus, unless the person who
> wrote the UTF_32_Non_Graphic table made an error, it looks like:
>
> (1) 3.5.2(2/2) needs to be updated to include 00AD in the list of
> nongraphic characters, and
>
> (2) since 3.5.2(2/2) says that the language-defined names for
> nongraphic characters in the 00-FF range are set in italics in A.1,
> then A.1 would also need to be modified to assign a language-defined
> name to this character.  (I did see a Unicode table that showed it as
> SHY.)
>
> (By the way, when I look at A.1 in my browser, the character in this
> position appears as '' , i.e. as two quote marks but with nothing in
> between.  It looks like '-' in my printed Ada
> 95 manual.)

That sounds like it is working as a "soft hyphen" in the browser; it is not
supposed to be displayed unless it is at the end of a line.

> So who's wrong?  The RM, or the table in AI-395?

I recall quite a bit of discussion about the soft hyphen character. In
particular, is it allowed in identifiers or between tokens? In the end, we got
it exactly wrong, and AI05-0091-1 changes identifiers to not allow them, and
AI05-0079-1 changes whitespace to allow them (the latter was a clear oversight -
we didn't change the wording to match the recommendation; the former was caused
by a change in the Unicode recommendation for identifiers).

Anyway, it is clear that we *knew* that soft hyphen was changing categories, so
not changing 3.5.2(2/2) and A.1(35/2) is purely an oversight.

****************************************************************

From: Adam Beneschan
Sent: Friday, October 23, 2009  10:22 AM

> What is found there is:
>
> "00AD;SOFT HYPHEN;Cf;0;ON;;;;;N;;;;;"
>
> which means the category is "Cf", which is short for "Other, Format".
> (Don't ask me to explain how "Other, Format" got abbreviated "Cf".

Ah, that's the link what I was looking for and couldn't find.  Thanks.


> Anyway, it is clear that we *knew* that soft hyphen was changing
> categories, so not changing 3.5.2(2/2) and A.1(35/2) is purely an oversight.

In that case, it will need a language-defined name.  Any idea what it will
probably be?  "shy"?  "soft_hyphen"?  Something else?  I just need to know what
to make 'Image return.  (I'll assume "shy" if I don't hear otherwise.)


****************************************************************

From: Randy Brukardt
Sent: Friday, October 23, 2009  9:16 PM

> In that case, it will need a language-defined name.  Any idea what it
> will probably be?  "shy"?  "soft_hyphen"?  Something else?  I just
> need to know what to make 'Image return.  (I'll assume "shy" if I
> don't hear otherwise.)

Based on the Unicode name, I would think "soft_hyphen" (I didn't see "shy" used
anywhere). But obviously this will take discussion, so be prepared to change it
again.

****************************************************************

From: Bob Duff
Sent: Saturday, October 24, 2009  11:31 AM

> Based on the Unicode name, I would think "soft_hyphen" (I didn't see "shy"
> used anywhere). But obviously this will take discussion, so be
> prepared to change it again.

Oh, boy!  I'm eagerly looking forward to flying thousands of miles to discuss
the name of some obscure Unicode character. (Not!)

;-) ;-)

****************************************************************

From: Georg Bauhaus
Sent: Monday, October 26, 2009  6:31 AM

(I see the smileys.  More often than I like to remember, though, I looked like
>:( >:(  when once again sloppyness with characters in software libraries
corrupted our software on integration---which might help explain the attitude of
this message.)

There is something to be said against using the "commonly abbreviated as" names
given in NamesList.txt, like SHY. Some examples to illustrate the points.

(1) Irregularity
034F, COMBINING GRAPHEME JOINER
* commonly abbreviated as CGJ
appears between other similarly named "COMBINING ...",  none of those is
"commonly abbreviated".  Switching to an abbreviation for this single one
introduces an irregularity.

(2) Ambiguity
Even when there are neighboring names many of which are "commonly abbreviated",
so that there would be no irregularity other than departure from the "standard"
names, I still see no point in trying to be terse, since the effects are not
always pleasant,
200E	LEFT-TO-RIGHT MARK
* commonly abbreviated LRM
Seeing "LRM" (LEFT-TO-RIGHT MARK) in a compiler dianostic message or exception
information _not_ referring to Ada's LRM is, I guess, confusing at best...


(3) Arbitrary Choice

In other regions of NamesList.txt there are names, similar to each other, of
which some are "commonly abbreviated", some not.  Picking abbreviations for some
would render the choice rather arbitrary.  For example, what are the rules for
having an abbreviation of the "... SPACE" names around 16#200A#?  If and only if
they start with "ZERO" ? It seems so...

Thus abbreviating character names earns irregularity, ambiguity, and
arbitrariness all for saving a few keystrokes...

Seen from the viewpoint of the programmer, operator, or someone trying to glue
pieces of software together, I'd hate to have to remember another set of rules
about which characters need extra attention when it comes to their repsective
set of possible names.  So please, please, don't be SHY, use some redundance to
make the name more obviously correspond to "SOFT HYPHEN".

****************************************************************

From: Adam Beneschan
Sent: Monday, October 26, 2009  12:43 PM

> Based on the Unicode name, I would think "soft_hyphen" (I didn't see "shy"
> used anywhere). But obviously this will take discussion, so be
> prepared to change it again.

FYI, I got the SHY from one of the charts, e.g.
http://www.unicode.org/charts/PDF/U0080.pdf .  I thought it would fit in, since
all the other "language-defined" names in Appendix A are two or three
letters/digits, except for the ones named "reserved_nnn". But if the other
control-character abbreviations are "official" abbrevations according to some
standard ISO/IEC/ANSI/something document, and SHY doesn't have the same sort of
official status, then I can understand why we would...ummmm...shy away from
using that abbreviation.  (Oh, come on, you all knew *somebody* was going to
make a bad pun out of this at some point.)

****************************************************************

Questions? Ask the ACAA Technical Agent