Annotated Ada Reference ManualLegal Information
Contents   Index   References   Search   Previous   Next 

2.1 Character Set

1/3
{AI95-00285-01} {AI95-00395-01} {AI05-0266-1} The character repertoire for the text of an Ada program consists of the entire coding space described by the ISO/IEC 10646: 2011 2003 Universal Multiple-Octet Coded Character Set. This coding space is organized in planes, each plane comprising 65536 characters. only characters allowed outside of comments are the graphic_characters and format_effectors.
1.a/2
This paragraph was deleted.Ramification: {AI95-00285-01} Any character, including an other_control_function, is allowed in a comment.
1.b/2
This paragraph was deleted.{AI95-00285-01} Note that this rule doesn't really have much force, since the implementation can represent characters in the source in any way it sees fit. For example, an implementation could simply define that what seems to be a nongraphic, non-format-effector character is actually a representation of the space character. 
1.c/3
Discussion: {AI95-00285-01} {AI05-0266-1} It is our intent to follow the terminology of ISO/IEC 10646: 2011 2003 ISO 10646 BMP where appropriate, and to remain compatible with the character classifications defined in A.3, “Character Handling”.Note that our definition for graphic_character is more inclusive than that of ISO 10646-1. 

Syntax

Paragraphs 2 and 3 were deleted. 
2/2
{AI95-00285-01} character ::= graphic_character | format_effector | other_control_function
3/2
{AI95-00285-01} graphic_character ::= identifier_letter | digit | space_character | special_character
3.1/3
{AI95-00285-01} {AI95-00395-01} {AI05-0266-1} A character is defined by this International Standard for each cell in the coding space described by ISO/IEC 10646: 2011 2003, regardless of whether or not ISO/IEC 10646: 2011 2003 allocates a character to that cell. 

Static Semantics

4/3
{AI95-00285-01} {AI95-00395-01} {AI05-0079-1} {AI05-0262-1} {AI05-0266-1} The character repertoire for the text of an Ada program consists of the collection of characters described by the ISO/IEC 10646: 2011 2003 called the Basic Multilingual Plane (BMP) of the ISO 10646 Universal Multiple-Octet Coded Character Set, plus a set of format_effectors and, in comments only, a set of other_control_functions; the coded representation for these characters is implementation defined [(it need not be a representation defined within ISO/IEC 10646: 2011 2003 ISO-10646-1)]. A character whose relative code point position in its plane is 16#FFFE# or 16#FFFF# is not allowed anywhere in the text of a program. The only characters allowed outside of comments are those in categories other_format, format_effector, and graphic_character.
4.a
Implementation defined: The coded representation for the text of an Ada program.
4.b/2
Ramification: {AI95-00285-01} Note that this rule doesn't really have much force, since the implementation can represent characters in the source in any way it sees fit. For example, an implementation could simply define that what seems to be an other_private_use character is actually a representation of the space character. 
4.1/3
  {AI95-00285-01} {AI05-0266-1} {AI05-0299-1} The semantics of an Ada program whose text is not in Normalization Form KC (as defined by Clause 21 section 24 of ISO/IEC 10646: 2011 2003) is implementation defined. 
4.c/2
Implementation defined: The semantics of an Ada program whose text is not in Normalization Form KC.
5/3
{AI95-00285-01} {AI05-0266-1} {AI05-0299-1} The description of the language definition in this International Standard uses the character properties General Category, Simple Uppercase Mapping, Uppercase Mapping, and Special Case Condition of the documents referenced by the note in Clause section 1 of ISO/IEC 10646: 2011 2003 graphic symbols defined for Row 00: Basic Latin and Row 00: Latin-1 Supplement of the ISO 10646 BMP; these correspond to the graphic symbols of ISO 8859-1 (Latin-1); no graphic symbols are used in this International Standard for characters outside of Row 00 of the BMP. The actual set of graphic symbols used by an implementation for the visual representation of the text of an Ada program is not specified.
6/3
{AI95-00285-01} {AI05-0266-1} Characters The categories of characters are categorized defined as follows: 
6.a/3
Discussion: {AI05-0005-1} {AI05-0262-1} {AI05-0266-1} Our character classification considers that the cells not allocated in ISO/IEC 10646: 2011 2003 are graphic characters, except for those whose relative code point position in their plane is 16#FFFE# or 16#FFFF#. This seems to provide the best compatibility with future versions of ISO/IEC 10646, as future characters can be already be used in Ada character and string literals. 
7/2
This paragraph was deleted.{AI95-00285-01} identifier_letter

upper_case_identifier_letter | lower_case_identifier_letter 
7.a/2
Discussion: We use identifier_letter instead of simply letter because ISO 10646 BMP includes many other characters that would generally be considered "letters." 
8/2
{AI95-00285-01} letter_uppercase upper_case_identifier_letter

Any character whose General Category is defined to be “Letter, Uppercase” of Row 00 of ISO 10646 BMP whose name begins “Latin Capital Letter”.
9/2
{AI95-00285-01} letter_lowercase lower_case_identifier_letter

Any character whose General Category is defined to be “Letter, Lowercase” of Row 00 of ISO 10646 BMP whose name begins “Latin Small Letter”.
9.a/1
This paragraph was deleted.To be honest: {8652/0001} {AI95-00124-01} The above rules do not include the ligatures Æ and æ. However, the intent is to include these characters as identifier letters. This problem was pointed out by a comment from the Netherlands.
9.1/2
  {AI95-00285-01} letter_titlecase

Any character whose General Category is defined to be “Letter, Titlecase”.
9.2/2
  {AI95-00285-01} letter_modifier

Any character whose General Category is defined to be “Letter, Modifier”.
9.3/2
  {AI95-00285-01} letter_other

Any character whose General Category is defined to be “Letter, Other”.
9.4/2
  {AI95-00285-01} mark_non_spacing

Any character whose General Category is defined to be “Mark, Non-Spacing”.
9.5/2
  {AI95-00285-01} mark_spacing_combining

Any character whose General Category is defined to be “Mark, Spacing Combining”.
10/2
 {AI95-00285-01} number_decimal digit

Any character whose General Category is defined to be “Number, Decimal” One of the characters 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.
10.1/2
   {AI95-00285-01} number_letter

Any character whose General Category is defined to be “Number, Letter”.
10.2/2
   {AI95-00285-01} punctuation_connector

Any character whose General Category is defined to be “Punctuation, Connector”.
10.3/2
   {AI95-00285-01} other_format

Any character whose General Category is defined to be “Other, Format”.
11/2
 {AI95-00285-01} separator_space space_character

Any character whose General Category is defined to be “Separator, Space”. The character of ISO 10646 BMP named “Space”.
12/2
 {AI95-00285-01} separator_line special_character

Any character whose General Category is defined to be “Separator, Line”. of the ISO 10646 BMP that is not reserved for a control function, and is not the space_character, an identifier_letter, or a digit. 
12.a/2
Ramification: Note that the no break space and soft hyphen are special_characters, and therefore graphic_characters. They are not the same characters as space and hyphen-minus. 
12.1/2
   {AI95-00285-01} separator_paragraph

Any character whose General Category is defined to be “Separator, Paragraph”.
13/3
 {AI95-00285-01} {AI05-0262-1} format_effector

The characters whose code points positions are 16#09# (CHARACTER TABULATION), 16#0A# (LINE FEED), 16#0B# (LINE TABULATION), 16#0C# (FORM FEED), 16#0D# (CARRIAGE RETURN), 16#85# (NEXT LINE), and the characters in categories separator_line and separator_paragraph control functions of ISO 6429 called character tabulation (HT), line tabulation (VT), carriage return (CR), line feed (LF), and form feed (FF).
13.a/2
Discussion: ISO/IEC 10646:2003 does not define the names of control characters, but rather refers to the names defined by ISO/IEC 6429:1992. These are the names that we use here. 
13.1/2
   {AI95-00285-01} other_control

Any character whose General Category is defined to be “Other, Control”, and which is not defined to be a format_effector.
13.2/2
   {AI95-00285-01} other_private_use

Any character whose General Category is defined to be “Other, Private Use”.
13.3/2
   {AI95-00285-01} other_surrogate

Any character whose General Category is defined to be “Other, Surrogate”.
14/3
 {AI95-00285-01} {AI95-00395-01} {AI05-0262-1} graphic_character other_control_function

Any character that is not in the categories other_control, other_private_use, other_surrogate, format_effector, and whose relative code point position in its plane is neither 16#FFFE# nor 16#FFFF#. Any control function, other than a format_effector, that is allowed in a comment; the set of other_control_functions allowed in comments is implementation defined.
14.a/2
This paragraph was deleted.Implementation defined: The control functions allowed in comments.
14.b/2
Discussion: {AI95-00285-01} We considered basing the definition of lexical elements on Annex A of ISO/IEC TR 10176 (4th edition), which lists the characters which should be supported in identifiers for all programming languages, but we finally decided against this option. Note that it is not our intent to diverge from ISO/IEC TR 10176, except to the extent that ISO/IEC TR 10176 itself diverges from ISO/IEC 10646:2003 (which is the case at the time of this writing [January 2005]).
14.c/2
More precisely, we intend to align strictly with ISO/IEC 10646:2003. It must be noted that ISO/IEC TR 10176 is a Technical Report while ISO/IEC 10646:2003 is a Standard. If one has to make a choice, one should conform with the Standard rather than with the Technical Report. And, it turns out that one must make a choice because there are important differences between the two:
14.d/2
ISO/IEC TR 10176 is still based on ISO/IEC 10646:2000 while ISO/IEC 10646:2003 has already been published for a year. We cannot afford to delay the adoption of our amendment until ISO/IEC TR 10176 has been revised.
14.e/2
There are considerable differences between the two editions of ISO/IEC 10646, notably in supporting characters beyond the BMP (this might be significant for some languages, e.g. Korean).
14.f/2
ISO/IEC TR 10176 does not define case conversion tables, which are essential for a case-insensitive language like Ada. To get case conversion tables, we would have to reference either ISO/IEC 10646:2003 or Unicode, or we would have to invent our own. 
14.g/2
For the purpose of defining the lexical elements of the language, we need character properties like categorization, as well as case conversion tables. These are mentioned in ISO/IEC 10646:2003 as useful for implementations, with a reference to Unicode. Machine-readable tables are available on the web at URLs: 
14.h/2
http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt
http://www.unicode.org/Public/4.0-Update/CaseFolding-4.0.0.txt
14.i/2
with an explanatory document found at URL: 
14.j/2
http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html
14.k/2
The actual text of the standard only makes specific references to the corresponding clauses of ISO/IEC 10646:2003, not to Unicode.
15/3
 {AI95-00285-01} {AI05-0266-1} The following names are used when referring to certain characters (the first name is that given in ISO/IEC 10646: 2011 2003) special_characters:
15.a/3
Discussion: {AI95-00285-01} {AI05-0266-1} This table serves to show the correspondence between ISO/IEC 10646: 2011 2003 names and the graphic symbols (glyphs) used in this International Standard. These are the characters These are the ones that play a special role in the syntax of Ada 95, or in the syntax rules; we don't bother to define names for all characters. The first name given is the name from ISO 10646-1; the subsequent names, if any, are those used within the standard, depending on context.
  graphic     symbolname  graphic     symbolname
    
         "quotation mark         :colon
         #number sign         ;semicolon
         &ampersand         <less-than sign
         'apostrophe, tick         =equals sign
         (left parenthesis         >greater-than sign
         )right parenthesis         _low line, underline
         *asterisk, multiply         |vertical line
         +plus sign         / [solidus, divide left square bracket
         ,comma         ! ]exclamation point right square bracket
         –hyphen-minus, minus         % {percent sign left curly bracket
         .full stop, dot, point         } right curly bracket
         / solidus, divide  

Implementation Requirements

16/3
 {AI05-0286-1} An Ada implementation shall accept Ada source code in UTF-8 encoding, with or without a BOM (see A.4.11), where every character is represented by its code point. The character pair CARRIAGE RETURN/LINE FEED (code points 16#0D# 16#0A#) signifies a single end of line (see 2.2); every other occurrence of a format_effector other than the character whose code point position is 16#09# (CHARACTER TABULATION) also signifies a single end of line.
16.a/3
Reason: {AI05-0079-1} {AI05-0286-1} This is simply requiring that an Ada implementation be able to directly process the ACATS, which is provided in the described format. Note that files that only contain characters with code points in the first 128 (which is the majority of the ACATS) are represented in the same way in both UTF-8 and in "plain" string format. The ACATS includes a BOM in files that have any characters with code points greater than 127. Note that the BOM contains characters not legal in Ada source code, so an implementation can use that to automatically distinguish between files formatted as plain Latin-1 strings and UTF-8 with BOM.
16.b/3
We allow line endings to be both represented as the pair CR LF (as in Windows and the ACATS), and as single format_effector characters (usually LF, as in Linux), in order that files created by standard tools on most operating systems will meet the standard format. We specify how many line endings each represent so that compilers use the same line numbering for standard source files.
16.c/3
This requirement increases portability by having a format that is accepted by all Ada compilers. Note that implementations can support other source representations, including structured representations like a parse tree.

Implementation Permissions

17/3
 {AI95-00285-01} {AI05-0266-1} The categories defined above, as well as case mapping and folding, may be based on an implementation-defined version of ISO/IEC 10646 (2003 edition or later). In a nonstandard mode, the implementation may support a different character repertoire[; in particular, the set of characters that are considered identifier_letters can be extended or changed to conform to local conventions].
17.a/2
Ramification: If an implementation supports other character sets, it defines which characters fall into each category, such as “identifier_letter,” and what the corresponding rules of this section are, such as which characters are allowed in the text of a program.
17.b/3
The exact categories, case mapping, and case folding chosen affects identifiers, the result of '[[Wide_]Wide_]Image, and packages Wide_Characters.Handling and Wide_Wide_Characters.Handling.
17.c/3
Discussion: This permission allows implementations to upgrade to using a newer character set standard whenever that makes sense, rather than having to wait for the next Ada Standard. But the character set standard used cannot be older than ISO/IEC 10646:2003 (which is essentially similar to Unicode 4.0). 
NOTES
18/2
1  {AI95-00285-01} The characters in categories other_control, other_private_use, and other_surrogate are only allowed in comments Every code position of ISO 10646 BMP that is not reserved for a control function is defined to be a graphic_character by this International Standard. This includes all code positions other than 0000 - 001F, 007F - 009F, and FFFE - FFFF.
19/3
2  {AI05-0286-1} The language does not specify the source representation of programs. 
19.a/3
This paragraph was deleted.Discussion: {AI05-0286-1} Any source representation is valid so long as the implementer can produce an (information-preserving) algorithm for translating both directions between the representation and the standard character set. (For example, every character in the standard character set has to be representable, even if the output devices attached to a given computer cannot print all of those characters properly.) From a practical point of view, every implementer will have to provide some way to process the ACATS ACVC. It is the intent to allow source representations, such as parse trees, that are not even linear sequences of characters. It is also the intent to allow different fonts: reserved words might be in bold face, and that should be irrelevant to the semantics. 

Extensions to Ada 83

19.b
Ada 95 allows 8-bit and 16-bit characters, as well as implementation-specified character sets.

Wording Changes from Ada 83

19.c/3
{AI95-00285-01} {AI05-0299-1} The syntax rules in this subclause clause are modified to remove the emphasis on basic characters vs. others. (In this day and age, there is no need to point out that you can write programs without using (for example) lower case letters.) In particular, character (representing all characters usable outside comments) is added, and basic_graphic_character, other_special_character, and basic_character are removed. Special_character is expanded to include Ada 83's other_special_character, as well as new 8-bit characters not present in Ada 83. Ada 2005 removes special_character altogether; we want to stick to ISO/IEC 10646:2003 character classifications. Note that the term “basic letter” is used in A.3, “Character Handling” to refer to letters without diacritical marks.
19.d/2
{AI95-00285-01} Character names now come from ISO/IEC 10646:2003 ISO 10646.
19.e/2
This paragraph was deleted.{AI95-00285-01} We use identifier_letter rather than letter since ISO 10646 BMP includes many "letters' that are not permitted in identifiers (in the standard mode). 

Extensions to Ada 95

19.f/2
{AI95-00285-01} {AI95-00395-01} Program text can use most characters defined by ISO-10646:2003. This subclause clause has been rewritten to use the categories defined in that Standard. This should ease programming in languages other than English. 

Inconsistencies With Ada 2005

19.g/3
{AI05-0299-1} {AI05-0266-1} An implementation is allowed (but not required) to use a newer character set standard to determine the categories, case mapping, and case folding. Doing so will change the results of attributes '[[Wide_]Wide_]Image and the packages [Wide_]Wide_Characters.Handling in the case of a few rarely used characters. (This also could make some identifiers illegal, for characters that are no longer classified as letters.) This is unlikely to be a problem in practice. Moreover, truly portable Ada 2012 programs should avoid using in these contexts any characters that would have different classifications in any character set standards issued since 10646:2003 (since the compiler can use any such standard as the basis for its classifications). 

Wording Changes from Ada 2005

19.h/3
{AI05-0079-1} Correction: Clarified that only characters in the categories defined here are allowed in the source of an Ada program. This was clear in Ada 95, but Amendment 1 dropped the wording instead of correcting it.
19.i/3
{AI05-0286-1} A standard source representation is defined that all compilers are expected to process. Since this is the same format as the ACATS, it seems unlikely that there are any implementations that don't meet this requirement. Moreover, other representations are still permitted, and the "impossible or impractical" loophole (see 1.1.3) can be invoked for any implementations that cannot directly process the ACATS. 

Contents   Index   References   Search   Previous   Next 
Ada-Europe Ada 2005 and 2012 Editions sponsored in part by Ada-Europe