!standard 3.5.2(2/2) 09-10-30 AI05-0181-1/01 !standard A.1(35/2) !class binding interpretation 09-10-29 !status work item 09-10-29 !status received 09-10-22 !priority Low !difficulty Easy !qualifier Error !subject Soft hyphen is a non-graphic character !summary Soft hyphen is a non-graphic character. !question In 3.5.2(2/2), the RM refers to the characters in positions 0000-001F and 007F-009F as nongraphic characters. This means that the soft hyphen, in position 00AD, is a graphic character. However, 2.1(5) and (14) define what is a graphic (and thus nongraphic) character, by referring to the General Category defined in a document referred to by ISO/IEC 10646:2003. The general category for soft hyphen is listed as Cf which is an abbreviation for Other, Format. This is not a graphic character. Thus these two definitions conflict. Which is right? !response The category defined by ISO/IEC 10646:2003 is always intended to be used. The wording in 3.5.2(2/2) should have been corrected but was not. !wording Modify 3.5.2(2/2): ... Each of the nongraphic positions of Row 00 [(0000-001F and 007F-009F)] has a corresponding language-defined name, ... Modify A.1(35/2): '¨', '©', 'ª', '«', '¬', ['­']{*soft_hyphen*}, '®', '¯', --168 (16#A8#) .. 175 (16#AF#) !discussion Note that non-graphic characters have names in A.1(35/2). Thus, Soft hyphen needs a name in this table. We selected the name of the Unicode character ("Soft_Hyphen") for this name; we did not find any abbreviated Unicode names. Someone found the name "shy" in some unofficial chart of character names; one could consider using that instead if there is a desire to stick to unreadable 3 character names (that would be more consistent with the rest of the definition). The text defining the nongraphic characters is deleted from 3.5.2(2/2); otherwise it just provides a future maintenance hazard (although we sincerely hope that no further characters change category). This should be documented as an Inconsistency from Ada 95 in the AARM: the result of Character'Image(Character'Val(16#00AD#)) should be "soft_hyphen", not '-'. !ACATS Test Consider a ACATS C-Test that the name Soft_Hyphen is recognized by Character'Value. !appendix !topic Is soft hyphen graphic or not? !reference 3.5.2(2/2), A.1(35/2) !from Adam Beneschan 09-10-22 !discussion In 3.5.2(2/2), the RM refers to the characters in positions 0000-001F and 007F-009F as nongraphic characters. This would mean that the soft hyphen, in position 00AD, is a graphic character. However, in AI95-395, in the table UTF_32_Non_Graphic, it lists 00AD as a nongraphic character: UTF_32_Non_Graphic : constant UTF_32_Ranges := ( (16#00000#, 16#0001F#), -- .. (16#0007F#, 16#0009F#), -- .. (16#000AD#, 16#000AD#), -- SOFT HYPHEN .. SOFT HYPHEN ... RM 2.1(5) and (14) define what is a graphic (and thus nongraphic) character, by referring to the General Category defined in a document referred to by ISO/IEC 10646:2003. I don't have a document that defines what all the categories are for each character, and offhand it looks like I can't get it without shelling out a few bucks, but I've been assuming that whoever wrote the UTF_32_Non_Graphic and other tables in AI95-395 did have access to it. Thus, unless the person who wrote the UTF_32_Non_Graphic table made an error, it looks like: (1) 3.5.2(2/2) needs to be updated to include 00AD in the list of nongraphic characters, and (2) since 3.5.2(2/2) says that the language-defined names for nongraphic characters in the 00-FF range are set in italics in A.1, then A.1 would also need to be modified to assign a language-defined name to this character. (I did see a Unicode table that showed it as SHY.) (By the way, when I look at A.1 in my browser, the character in this position appears as '' , i.e. as two quote marks but with nothing in between. It looks like '-' in my printed Ada 95 manual.) So who's wrong? The RM, or the table in AI-395? **************************************************************** From: Randy Brukardt Sent: Thursday, October 22, 2009 9:01 PM ... > RM 2.1(5) and (14) define what is a graphic (and thus > nongraphic) character, by referring to the General Category defined in > a document referred to by ISO/IEC 10646:2003. I don't have a document > that defines what all the categories are for each character, and > offhand it looks like I can't get it without shelling out a few bucks, > but I've been assuming that whoever wrote the UTF_32_Non_Graphic and > other tables in > AI95-395 did have access to it. 10646:2003 is essentially the same as Unicode 4.0, and you can find out the categories there by following the link that is in the AARM (AARM 2.1(14.h/2). What is found there is: "00AD;SOFT HYPHEN;Cf;0;ON;;;;;N;;;;;" which means the category is "Cf", which is short for "Other, Format". (Don't ask me to explain how "Other, Format" got abbreviated "Cf". > Thus, unless the person who > wrote the UTF_32_Non_Graphic table made an error, it looks like: > > (1) 3.5.2(2/2) needs to be updated to include 00AD in the list of > nongraphic characters, and > > (2) since 3.5.2(2/2) says that the language-defined names for > nongraphic characters in the 00-FF range are set in italics in A.1, > then A.1 would also need to be modified to assign a language-defined > name to this character. (I did see a Unicode table that showed it as > SHY.) > > (By the way, when I look at A.1 in my browser, the character in this > position appears as '' , i.e. as two quote marks but with nothing in > between. It looks like '-' in my printed Ada > 95 manual.) That sounds like it is working as a "soft hyphen" in the browser; it is not supposed to be displayed unless it is at the end of a line. > So who's wrong? The RM, or the table in AI-395? I recall quite a bit of discussion about the soft hyphen character. In particular, is it allowed in identifiers or between tokens? In the end, we got it exactly wrong, and AI05-0091-1 changes identifiers to not allow them, and AI05-0079-1 changes whitespace to allow them (the latter was a clear oversight - we didn't change the wording to match the recommendation; the former was caused by a change in the Unicode recommendation for identifiers). Anyway, it is clear that we *knew* that soft hyphen was changing categories, so not changing 3.5.2(2/2) and A.1(35/2) is purely an oversight. **************************************************************** From: Adam Beneschan Sent: Friday, October 23, 2009 10:22 AM > What is found there is: > > "00AD;SOFT HYPHEN;Cf;0;ON;;;;;N;;;;;" > > which means the category is "Cf", which is short for "Other, Format". > (Don't ask me to explain how "Other, Format" got abbreviated "Cf". Ah, that's the link what I was looking for and couldn't find. Thanks. > Anyway, it is clear that we *knew* that soft hyphen was changing > categories, so not changing 3.5.2(2/2) and A.1(35/2) is purely an oversight. In that case, it will need a language-defined name. Any idea what it will probably be? "shy"? "soft_hyphen"? Something else? I just need to know what to make 'Image return. (I'll assume "shy" if I don't hear otherwise.) **************************************************************** From: Randy Brukardt Sent: Friday, October 23, 2009 9:16 PM > In that case, it will need a language-defined name. Any idea what it > will probably be? "shy"? "soft_hyphen"? Something else? I just > need to know what to make 'Image return. (I'll assume "shy" if I > don't hear otherwise.) Based on the Unicode name, I would think "soft_hyphen" (I didn't see "shy" used anywhere). But obviously this will take discussion, so be prepared to change it again. **************************************************************** From: Bob Duff Sent: Saturday, October 24, 2009 11:31 AM > Based on the Unicode name, I would think "soft_hyphen" (I didn't see "shy" > used anywhere). But obviously this will take discussion, so be > prepared to change it again. Oh, boy! I'm eagerly looking forward to flying thousands of miles to discuss the name of some obscure Unicode character. (Not!) ;-) ;-) **************************************************************** From: Georg Bauhaus Sent: Monday, October 26, 2009 6:31 AM (I see the smileys. More often than I like to remember, though, I looked like >:( >:( when once again sloppyness with characters in software libraries corrupted our software on integration---which might help explain the attitude of this message.) There is something to be said against using the "commonly abbreviated as" names given in NamesList.txt, like SHY. Some examples to illustrate the points. (1) Irregularity 034F, COMBINING GRAPHEME JOINER * commonly abbreviated as CGJ appears between other similarly named "COMBINING ...", none of those is "commonly abbreviated". Switching to an abbreviation for this single one introduces an irregularity. (2) Ambiguity Even when there are neighboring names many of which are "commonly abbreviated", so that there would be no irregularity other than departure from the "standard" names, I still see no point in trying to be terse, since the effects are not always pleasant, 200E LEFT-TO-RIGHT MARK * commonly abbreviated LRM Seeing "LRM" (LEFT-TO-RIGHT MARK) in a compiler dianostic message or exception information _not_ referring to Ada's LRM is, I guess, confusing at best... (3) Arbitrary Choice In other regions of NamesList.txt there are names, similar to each other, of which some are "commonly abbreviated", some not. Picking abbreviations for some would render the choice rather arbitrary. For example, what are the rules for having an abbreviation of the "... SPACE" names around 16#200A#? If and only if they start with "ZERO" ? It seems so... Thus abbreviating character names earns irregularity, ambiguity, and arbitrariness all for saving a few keystrokes... Seen from the viewpoint of the programmer, operator, or someone trying to glue pieces of software together, I'd hate to have to remember another set of rules about which characters need extra attention when it comes to their repsective set of possible names. So please, please, don't be SHY, use some redundance to make the name more obviously correspond to "SOFT HYPHEN". **************************************************************** From: Adam Beneschan Sent: Monday, October 26, 2009 12:43 PM > Based on the Unicode name, I would think "soft_hyphen" (I didn't see "shy" > used anywhere). But obviously this will take discussion, so be > prepared to change it again. FYI, I got the SHY from one of the charts, e.g. http://www.unicode.org/charts/PDF/U0080.pdf . I thought it would fit in, since all the other "language-defined" names in Appendix A are two or three letters/digits, except for the ones named "reserved_nnn". But if the other control-character abbreviations are "official" abbrevations according to some standard ISO/IEC/ANSI/something document, and SHY doesn't have the same sort of official status, then I can understand why we would...ummmm...shy away from using that abbreviation. (Oh, come on, you all knew *somebody* was going to make a bad pun out of this at some point.) ****************************************************************