Version 1.1 of ai05s/ai05-0091-1.txt

Unformatted version of ai05s/ai05-0091-1.txt version 1.1
Other versions for file ai05s/ai05-0091-1.txt

!standard 2.3(3.1/2)          08-02-26 AI05-0091-1/01
!class binding interpretation 08-02-26
!status work item 08-02-26
!status received 07-10-25
!priority Low
!difficulty Medium
!qualifier Error
!subject An other_format character should not be allowed in an identifier
!summary
An other_format character should not be allowed inside of an identifier.
!question
Recent versions of the Unicode technical report on identifiers (TR31: http://www.unicode.org/reports/tr31/) say that characters in class other_format should not be allowed in programming language identifiers for security reasons. (This changed since the Amendment started standardization.)
The problem is basically that these characters can be used to write an identifier that looks like Foo_Bar but is actually Fo<Right-To-Left>aB_o<Left-To-Right>r. Stripping other_format results in FoaB_or, which is what the compiler sees, and lo and behold, I have introduced a vulnerability in a source file that looks perfectly kosher on your screen.
Should the definition of Ada be changed? (Yes.)
!recommendation
(See Summary.)
!wording
Remove other_format from 2.3(3.1/2).
Change 2.3(4/2) as follows:
[After eliminating the characters in category other_format, an]{An} identifier shall not contain two consecutive characters in category punctuation_connector, or end with a character in that category.
Replace 2.3(5-5.3/2) with:
Two identifiers are considered the same if they consist of the same sequence of characters after converting the characters to upper case.
After converting to upper case, an identifier shall not be identical to a reserved word (in upper case).
Change the AARM Note 2.3(5.b/2) as follows:
We match the reserved words after [doing these transformations]{converting to upper case} so that the rules for identifiers and reserved words are the same. [(This allows other_format characters, which usually don't display, in a reserved word without changing it to an identifier.)] Since a compiler usually will lexically process identifiers and reserved words the same way (often with the same code), this will prevent a lot of headaches.
Change 2.9(2/2) as follows:
The following are the reserved words. Within a program, some or all of the letters of a reserved word may be in upper case[, and one or more characters in category other_format may be inserted within or at the end of the reserved word].
Change AARM note 6.1(10.a/2) as follows:
The "sequence of characters" of the string literal of the operator is a technical term (see 2.6), and does not include the surrounding quote characters. As defined in 2.2, lexical elements are "formed" from a sequence of characters. Spaces are not allowed, and upper and lower case is not significant. [See 2.2 and 2.9 for rules related to the use of other_format characters in delimiters and reserved words.]
!discussion
It is odd that other_format characters were ever thought to make sense in identifiers for Ada. By removing them now (rather than waiting for the next major revision, we reduce the possibility that anyone would write code that would depend on that functionality. We also simplify the language presentation.
The alternative view (which is to do nothing now) is based on that fact that Ada 2005 is based on ISO/IEC 10646:2003, which corresponds to Unicode 4.0, and we followed the recommendations of that version of Unicode. Moreover, changing identifier syntax introduces an incompatibility (probably fairly slight). One presumes a future version of Ada will update the Unicode reference, and that will have the effect of changing the characters allowed in identifiers slightly even without a change to the identifier syntax. (TR31 also has changed the character classifications that are allowed in identifiers.) That could also be mildly incompatible, and some think that would be a better time to make the change for other_format characters.
It should noted that the TR31 pronouncement is not absolute; they suggest that perhaps some other_format characters should be allowed "in special cases" (see clause 2.2 in TR31). Essentially, they don't appear to be very certain what the right answer is. Perhaps they'll change their mind again in the future, which is a hazard no matter what choice we make now. OTOH, they have added "stability extensions" to the allowed characters, in order that identifiers will not become illegal in the future if characters are reclassified. Ada would need to define Unicode properties in some way, and then define the Other_Id_Start and Other_Id_Continue properties, and add them to the syntax of identifiers. (Note that these aren't categories!) For Unicode 5.0, these properties contain 4 and 9 characters, respectively (see clause 2.4 in TR31). The idea is that these categories would contain any characters that previously were allowed in identifiers but whose categories have changed such that they would no longer be allowed. This preserves compatibility.
--!corrigendum 2.3(3.1/2)
!ACATS Test
An ACATS C-Test could be tried.
!appendix

****************************************************************


Questions? Ask the ACAA Technical Agent