CVS difference for ais/ai-00285.txt

Differences between 1.5 and version 1.6
Log of other versions for file ais/ai-00285.txt

--- ais/ai-00285.txt	2003/01/03 00:01:36	1.5
+++ ais/ai-00285.txt	2003/01/15 00:33:52	1.6
@@ -1,136 +1,170 @@
-!standard A.3.2(49)                                    02-09-24  AI95-00285/01
+!standard A.3.2(49)                                    03-01-10  AI95-00285/02
 !class amendment 02-01-23
 !status work item 02-09-24
 !status received 02-01-15
 !priority Medium
 !difficulty Hard
-!subject Latin-9, Ada.Characters.Handling, and 32-bit characters
+!subject Support for 16-bit and 32-bit characters
 
 !summary
 
+Support is added for program text using the entire set of characters from
+ISO/IEC 10646, and for operating on characters outside of the BMP at run-time.
+
 !problem
+
+SC22 directed its working groups to provide support for the ISO/IEC 10646
+character set:
 
-Latin-9 has been introduced by ISO/IEC 8859-15:1999.  Moreover, the working
-draft of ISO/IEC 10646:2003 makes use of planes other than the BMP.
+"JTC 1/SC 22 believes that programming languages should offer the appropriate
+support for ISO/IEC 10646, and the Unicode character set where appropriate."
 
+Moreover, the working draft of ISO/IEC 10646:2003 makes use of planes other
+than the BMP.
+
 !proposal
 
-In order to support Latin-9, we allow an implementation to provide a
-package named Ada.Characters.Latin_9, but we strictly restrict its contents
-to correspond to the characters defined by ISO/IEC 8859-15:1999.
-
-If an application chooses to use both Latin-1 and Latin-9, the package
-Ada.Characters.Handling is quite problematic, as it seems to assume that
-Character always corresponds to Latin-1, and it would get the
-classifications and the conversions wrong for some Latin-9 characters.  We
-deal with this by recognizing that this package is really specific to the
-character encoding being used, and making it a child of the proper Latin_n
-package.  A library-level renaming is provided for compatibility with
-existing applications.
-
-The constant Character_Set in Ada.Strings.Wide_Maps.Wide_Constants is
-similarly problematic as it is defined by reference to
-Ada.Characters.Handling.  We define both the Latin-1 and the Latin-9 sets
-in Wide_Constants, and provide a renaming for compatibility.  <<We should
-probably give permission to add more constants if an implementation wants
-to support other character encodings.>>
-
-ISO/IEC 8859-15 defines  as a letter and  as a ligature.  There seems to
-be no reason to make a distinction between the two, so for the purposes of
-Ada.Characters.Latin_9.Handling, we classify both as letters.
-
-Note that we are not proposing to change the lexical rules of the language,
-so it's still the case that only characters from row 00 of the BMP are
-allowed in identifiers (row 00 of the BMP is not affected by Latin-9).
-
-
-In order to support 32-bit characters, we allow an implementation to add
-new declarations to Standard.  If it does, it must provide the appropriate
-predefined units for 32-bit characters, and new attributes to convert
-discrete values to and from 32-bit strings.
-
-Again, we are not proposing to change the lexical rules of the language, so
-the character and string literals appearing in the program can only make
-use of the graphic symbols from the BMP.  The characters from other planes
-cannot be represented in literals; they must be obtained by evaluating more
-complex expressions; for instance, by evaluating Wide_Wide_Character'Val
-(16#1001B#) it is possible to access the Linear B syllable NI.
+The essence of this proposal is to allow the source of the program to be
+written using 16-bit characters (from the BMP) or 32-bit characters. Also,
+it makes it possible to operate on 32-bit characters at run-time
+
+The main difficulty in supporting characters beyond Row 00 of the BMP in the
+program text is to define how identifiers and literals are built (which
+characters are letters, digits, etc.) and to define the lower/upper case
+equivalence rules. Fortunately, the people developing ISO/IEC 10646 have
+already done most of the work for us, so it's only a matter of defining how we
+want to piggyback on their categorization and conversion rules.
+
+For each character, ISO/IEC 10646 defines a "General Category". General
+categories are disjoint. For our purposes, the following categories are of
+interest:
+
+   - Letter, Uppercase      -- e.g., LATIN CAPITAL LETTER A
+   - Letter, Lowercase      -- e.g., LATIN SMALL LETTER A
+   - Letter, Titlecase      -- e.g., LATIN CAPITAL LETTER L WITH SMALL LETTER J
+   - Letter, Modifier       -- e.g., MODIFIER LETTER APOSTROPHE
+   - Letter, Other          -- e.g., HEBREW LETTER ALEF
+   - Mark, Non-Spacing      -- e.g., COMBINING GRAVE ACCENT
+   - Mark, Spacing Combined -- e.g., MUSICAL SYMBOL COMBINING AUGMENTATION DOT
+   - Number, Decimal Digit  -- e.g., DIGIT ZERO
+   - Number, Letter         -- e.g., ROMAN NUMERAL TWO
+   - Other, Control         -- e.g., NULL
+   - Other, Format          -- e.g., ACTIVATE ARABIC FORM SHAPING
+   - Other, Private Use     -- e.g., <Private Use, First>
+   - Other, Surrogate       -- e.g., <Non Private Use High Surrogate, First>
+   - Punctuation, Connector -- e.g., LOW LINE
+   - Separator, Space       -- e.g., SPACE
+
+(See http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt for
+details on the categorization).
+
+In paragraph 2.1 we define a non-terminal of the grammar for each of these
+categories, e.g., letter_uppercase, letter_lowercase, etc.
+
+Before determining the sequence of lexical elements in a program, all
+characters in the category other_format are filtered out. This means that
+these characters don't have any effect whatsoever on the semantics of the
+program text: they are used for presentation purposes only. (This seems to be
+the intent of ISO/IEC 10646, especially regarding the Row 14 tag characters;
+it is easier to describe the handling of these characters in terms of
+"filtering out" than to insert them in each and every production in chapter 2.)
+
+A character in the separator_space category is a separator except within a
+comment, a string_literal, or a character_literal.
+
+ISO/IEC 10646 proposes to define identifiers for programming languages as
+follows (see http://www.unicode.org/unicode/reports/tr15/tr15-
+22.html#Programming_Language_Identifiers):
+
+   identifier ::= identifier_start {identifier_start | identifier_extend}
+   identifier_start ::= letter_uppercase |
+                        letter_lowercase |
+                        letter_titlecase |
+                        letter_modifier |
+                        letter_other |
+                        number_letter
+   identifier_extend ::= mark_non_spacing |
+                         mark_spacing_combined |
+                         number_decimal_digit |
+                         punctuation_connector |
+                         other_format
+
+This definition was made with C in mind, and is not exactly appropriate for
+Ada, though, as it would allow consecutive underlines. Because the underline
+is the only character of Row 00 of the BMP which is a punctuation_connector,
+it seems sensible to remain close to the existing syntax rules of 2.3(2-3),
+and to use the following definitions:
+
+   identifier_letter ::= letter_uppercase |
+                         letter_lowercase |
+                         letter_titlecase |
+                         letter_modifier |
+                         letter_other |
+                         number_letter
+   letter_or_digit ::= identifier_letter |
+                       mark_non_spacing |
+                       mark_spacing_combined |
+                       number_decimal_digit
+   identifier ::= identifier_letter {[punctuation_connector] letter_or_digit}
+
+ISO/IEC 10646 recommends that, for languages which have case insensitive
+identifiers, Normalization Form KC be used before storing or comparing
+identifiers (see http://www.unicode.org/unicode/reports/tr15/tr15-
+22.html#Specification). This is to ensure that identifiers which look visually
+the same are considered as identical, even if they are composed of different
+characters. Once normalization has been applied, _full_ case folding, as
+described in the table http://www.unicode.org/Public/3.2-Update/CaseFolding-
+3.2.0.txt, is used to find the uppercase version of each character. Two
+identifiers are equivalent if they result in the same string of characters
+after these transformations.
+
+ISO/IEC 10646 doesn't provide guidance for the composition of numeric literals,
+but it is apparent that we can use the character categories above. So we
+define:
+
+   numeral ::= number_decimal_digit {[punctuation_connector]
+number_decimal_digit}
+
+For each character that is a number_decimal_digit, ISO/IEC 10646 defines the
+numeric value of this character in the field "Decimal digit value" of the
+character database. This is used to compute the value of a numeric literal.
+
+We don't change the syntax for extended_digit, so the extended_digits A
+through F and a through f are exclusively the characters at positions 16#41#
+to 16#46# and 16#61# to 16#66#.
+
+The definition and role of format_effectors is modified to include the
+characters at positions 16#85#, 16#2028# and 16#2029#. These characters may be
+used to terminate lines, as recommended by http://www.unicode.org/reports/tr13.
+
+We define as graphic_character any character which is not in the categories
+other_control, other_private_use or other_surrogate, and which is not at
+positions 16#FFFE# or 16#FFFF#. With this definition the syntax for
+character_literal and string_literal doesn't have to change.
+
+Note that no modification is performed for storing a character_literal or
+string_literal. In particular, Normalization Form KC is _not_ applied. (So two
+strings which look alike may not compare equal, but this is already the case.)
+
+[Author's note: There is a slight incompatibility here because currently
+characters in category other_format are preserved in string_ and
+character_literals. A character_literal containing such a character would
+become illegal, so the incompatibility is easy to catch. For string_literals
+the incompatibility would not be detected at compilation time, but then there
+are only 27 characters in the BMP which fall in this category, and using them
+in a string_literal is pretty odd anyway since they don't show up graphically.]
+
+In order to represent 32-bit characters at run-time, we add new declarations
+to Standard.  We also provide the appropriate predefined units for 32-bit
+characters, and new attributes to convert discrete values to and from 32-bit
+strings.
 
 !wording
-
-An implementation is allowed to provide a library package named
-Ada.Characters.Latin_9.  This package shall be identical to
-Ada.Characters.Latin_1, except for the following differences:
-
-- It doesn't declare the constants Currency_Sign, Broken_Bar, Diaeresis,
-Acute, Cedilla, Fraction_One_Quarter, Fraction_One_Half, and
-Fraction_Three_Quarter.
-
-- It declares the following constants:
-
-    Euro_Sign : constant Character := ''; -- Character'Val (164)
-    UC_S_Caron : constant Character := ''; -- Character'Val (166)
-    LC_S_Caron : constant Character := ''; -- Character'Val (168)
-    UC_Z_Caron : constant Character := ''; -- Character'Val (180)
-    LC_Z_Caron : constant Character := ''; -- Character'Val (184)
-    UC_OE_Diphthong : constant Character := ''; -- Character'Val (188)
-    LC_OE_Diphthong : constant Character := ''; -- Character'Val (189)
-    UC_Y_Diaeresis : constant Character := ''; -- Character'Val (190)
-
-In addition, an implementation which provides Ada.Characters.Latin_9 must
-provide two library packages named Ada.Characters.Latin_1.Handling and
-Ada.Characters.Latin_9.Handling, respectively.
-Ada.Characters.Latin_1.Handling must have the same contents and semantics
-as the package Ada.Characters.Handling defined in section A.3.2 of ISO/IEC
-8652:1995 with COR.1:2000 (except of course for the library unit name).
-For compatibility with existing applications, the following library-level
-renaming must also be provided:
-
-    package Ada.Characters.Handling renames
-       Ada.Characters.Latin_1.Handling;
-
-The package Ada.Characters.Latin_9.Handling has the same specification as
-Ada.Characters.Latin_1.Handling, but the following semantic differences:
-
-- The function Is_Letter returns True for the characters at positions 166,
-168, 180, 184, 188, 189 and 190 (in addition to those for which it is defined
-to return True in A.3.2(24)).
-
-- The function Is_Lower returns True for the characters at positions 168,
-184 and 189 (in addition to those for which it is defined to return True in
-A.3.2(25)).
-
-- The function Is_Upper returns True for the characters at positions 166,
-180, 188 and 190 (in addition to those for which it is defined to return
-True in A.3.2(26)).
-
-- The function Is_Basic return True for the characters at positions 188 and
-189 (in addition to those for which it is defined to return True in A.3.2
-(27)).
-
-- The upper-case form of '' is '' for the purposes of function To_Upper.
-
-- The function Is_Character return true if the Wide_Character Item has a
-name in ISO/IEC 10646 which is the name of some Character in
-ISO/IEC 8859-15.
-
-- The function To_Character returns the Character which has the same name
-in ISO/IEC 8859-15 as the Wide_Character Item in ISO/IEC 10646.
-
-- The function To_Wide_Character returns the Wide_Character which has the
-same name in ISO/IEC 10646 as the Character Item in ISO/IEC 8859-15.
-
-The declaration of Character_Set in Ada.Strings.Wide_Maps.Wide_Constants is
-removed.  It is replaced by:
-
-    Latin_1_Character_Set : constant Wide_Maps.Wide_Character_Set;
-    Latin_9_Character_Set : constant Wide_Maps.Wide_Character_Set;
-    Character_Set : Wide_Maps.Wide_Character_Set renames
-                    Latin_1_Character_Set;
 
+[Author's note: there is no wording for the syntax changes, yet. I'll write it
+if/when we reach a consensus based on the !discussion section.]
 
-An implementation is allowed to add the following declarations to package
-Standard:
+The following declarations are added to package Standard:
 
     type Wide_Wide_Character is (nul, soh, ..., FFFE, FFFF,
                                  00010000, ..., 7FFFFFFF);
@@ -138,11 +172,10 @@
                                     Wide_Wide_Character;
     pragma Pack (Wide_Wide_String);
 
-The type Wide_Wide_Character has 2 ** 31 values.  Its first 2 ** 16
-positions must have the same contents as type Wide_Character.
+The type Wide_Wide_Character has 2 ** 31 values. Its first 2 ** 16 positions
+must have the same contents as type Wide_Character.
 
-If an implementation provides these two types, it must also provide the
-following packages:
+The following predefined packages are also added:
 
     Ada.Strings.Wide_Wide_Bounded
     Ada.Strings.Wide_Wide_Fixed
@@ -159,16 +192,16 @@
 
 It contains each Wide_Wide_Character value in the BMP of ISO/IEC 10646.
 
-The attributes Wide_Wide_Image, Wide_Wide_Value and Wide_Wide_Width must
-also be provided.  Their definition is similar to that of Wide_Image,
-Wide_Value and Wide_Width, respectively, with Wide_Character and
-Wide_String replaced by Wide_Wide_Character and Wide_Wide_String.
+The attributes Wide_Wide_Image, Wide_Wide_Value and Wide_Wide_Width must also
+be provided.  Their definition is similar to that of Wide_Image, Wide_Value
+and Wide_Width, respectively, with Wide_Character and Wide_String replaced by
+Wide_Wide_Character and Wide_Wide_String.
 
 The semantics of Wide_Image are modified as follows: the image has the same
 sequence of graphic characters as that defined for S'Wide_Wide_Image if all
-the graphic characters are defined in Wide_Character; otherwise the
-sequence of characters is implementation defined (but no shorter than that
-of S'Wide_Wide_Image for the same value of Arg).
+the graphic characters are defined in Wide_Character; otherwise the sequence
+of characters is implementation defined (but no shorter than that of
+S'Wide_Wide_Image for the same value of Arg).
 
 !discussion
 

Questions? Ask the ACAA Technical Agent