Rationale for Ada 2012
7.2 Strings and characters
Ada 95 added a number of packages for manipulating
strings and characters. Three child packages of Ada.Strings
enable the manipulation of fixed length, bounded and unbounded strings.
They are Ada.Strings.Fixed, Ada.Strings.Bounded
and Ada.Strings.Unbounded. The packages have
many subprograms with similar facilities.
In particular there are functions Index
and Index_Non_Blank{ which search through
a string and return the index of the first character satisfying some
criteria and procedures Find_Token which search
through a string and find the first instance of a slice satisfying some
other criteria.
As originally defined in Ada 95 these subprograms
all started the search at the beginning of the string. This proved to
be somewhat inconvenient and so in Ada 2005, versions of the functions
Index and
Index_Non_Blank
with an extra parameter
From were added to
enable the search to be started at any position. However, the fact that
versions of the procedures
Find_Token with
an extra parameter
From should also have been
added was overlooked. This is remedied in Ada 2012.
So in Ada 2012 corresponding
additional subprograms Find_Token are added
to the appropriate packages. They are
procedure Find_Token(
Source: in String;
Set: in Maps.Character_Set;
From: in Positive;
Test: in Membership;
First: out Positive;
Last: out Natural);
procedure Find_Token(
Source: in Bounded_String;
Set: in Maps.Character_Set;
From: in Positive;
Test: in Membership;
First: out Positive;
Last: out Natural);}
procedure Find_Token(
Source: in Unbounded_String;
Set: in Maps.Character_Set;
From: in Positive;
Test: in Membership;
First: out Positive;
Last: out Natural);
Note also that the wording for Find_Token
is modified to make it clear that the values of First
and Last identify the longest possible slice
starting at From. If no characters satisfy
the criteria then First is set to From
and Last is set to zero.
The existing procedures Find_Token
are now defined as calls of the new ones with From
set to Source'First.
The encodings UTF-8 and UTF-16 are now widely used
but Ada 2005 provides no mechanisms to convert between these encodings
and the types String, Wide_String,
and Wide_Wide_String.
The encoding UTF-8 works in terms of raw bytes and
is straightforward; it is defined in Annex D of ISO/IEC 10646. However,
UTF-16 comes in two forms according to whether the arrangement of two
bytes into a 16-bit word uses big-endian or little-endian packing. So
there are two forms UTF-16BE and UTF-16LE; they are defined in Annex
C of ISO/IEC 10646.
The different encodings can be distinguished by a
special value known as a BOM (Byte Order Mark) at the start of the string.
So we have
BOM_8,
BOM_16BE,
BOM_16LE, and just
BOM_16
(for wide strings).
To support these encodings,
Ada 2012 includes the following five new packages
Ada.Strings.UTF_Encoding
Ada.Strings.UTF_Encoding.Conversions
Ada.Strings.UTF_Encoding.Strings
Ada.Strings.UTF_Encoding.Wide_Strings
Ada.Strings.UTF_Encoding.Wide_Wide_Strings
The first package declares
items that are used by the other packages. It is
package Ada.Strings.UTF_Encoding
is
pragma Pure(UTF_Encoding);
type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
subtype UTF_String is String;
subtype UTF_8_String is String;
subtype UTF_16_Wide_String is Wide_String;
Encoding_Error: exception;
BOM_8: constant UTF_8_String :=
Character'Val(16#EF#) &
Character'Val(16#BB#) &
Character'Val(16#BF#);
BOM_16BE: constant UTF_String :=
Character'Val(16#FE#) &
Character'Val(16#FF#);
BOM_16LE: constant UTF_String :=
Character'Val(16#FF#) &
Character'Val(16#FE#);
BOM_16: constant UTF_16_Wide_String :=
(1 => Wide_Character'Val(16#FEFF#));
function Encoding(
Item: UTF_String;
Default: Encoding_Scheme := UTF_8)
return Encoding_Scheme;
end Ada.Strings.UTF_Encoding;
Note that the encoded forms are actually still held
in objects of type String or Wide_String.
However, in order to aid understanding, the subtypes UTF_String,
UTF_8_String and UTF_16_Wide_String
are introduced and these should be used when referring to objects holding
the encoded forms.
The type Encoding_Scheme
defines the various schemes. Note that an encoded string might or might
not start with the identifying BOM; it is optional. The function Encoding
takes a UTF_String (that is a plain old string),
checks the BOM if present and returns the value of Encoding_Scheme
identifying the scheme. If there is no BOM then it returns the value
of the parameter Default which itself by default
is UTF_8.
Note carefully that the function Encoding
does not do any encoding — that is done by functions Encode
in the other packages which will be described in a moment. Note also
that there is no corresponding function Encoding
for wide strings; that is because there is only one relevant scheme corresponding
to UTF_16_Wide_String, namely that with BOM_16.
We will now look at the other packages. The package
UTF_Encoding.Strings contains functions Encode
and Decode which convert between the raw type
String and the UTF forms. Similar packages
apply to wide and wide wide strings. The package UTF_Encoding.Conversions
contains functions Convert which convert between
the various UTF forms.
The package for the
type String is
package Ada.Strings.UTF_Encoding.Strings
is
pragma Pure(Strings);
function Encode(
Item: String;
Output_Scheme: Encoding_Scheme;
Output_BOM: Boolean := False)
return UTF_String;
function Encode(Item: String; Output_BOM: Boolean := False)
return UTF_8_String;}
function Encode(Item: String; Output_BOM: Boolean := False)
return UTF_16_Wide_String;
function Decode(Item: UTF_String; Input_Scheme: Encoding_Scheme)
return String;}
function Decode(Item: UTF_8_String)
return String;
function Decode(Item: UTF_16_Wide_String)
return String;
end Ada.Strings.UTF_Encoding.Strings;
The functions
Encode
take a string and return it encoded. The first function has a parameter
Output_Scheme which determines whether the
encoding is to be to
UTF_8,
UTF_16BE
or
UTF_16LE. The second function is provided
as a convenience for the common case of encoding to
UTF_8
and the third function is necessary for encoding to
UTF_16_Wide_String.
In all cases there is a final optional parameter indicating whether or
not an appropriate BOM is to be placed at the start of the encoded string.
The functions Decode do
the reverse. Thus the first function takes a value of subtype UTF_String
and a parameter Input_Scheme giving the scheme
to be used and returns the decoded string. If a BOM is present which
does not match the Input_Scheme, then the
exception Encoding_Error is raised. The second
function is a convenience for the common case of decoding from UTF_8
and the third function is necessary for decoding from UTF_16_Wide_String;
again, if a BOM is present that does not match the expected scheme then
Encoding_Error is raised.
In all cases all the strings returned have a lower
bound of 1.
The packages
UTF_Encoding.Wide_Strings
and
UTF_Encoding.Wide_Wide_Strings are identical
except that the type
String is replaced by
Wide_String or
Wide_Wide_String
throughout.
Finally, the package
for converting between the various UTF forms is as follows
package Ada.Strings.UTF_Encoding.Conversions
is
pragma Pure(Conversions);
function Convert(
Item: UTF_String;
Input_Scheme: Encoding_Scheme;
Output_Scheme: Encoding_Scheme;
Output_BOM: Boolean := False)
return UTF_String;
function Convert(
Item: UTF_String;
Input_Scheme: Encoding_Scheme;
Output_BOM: Boolean := False)
return UTF_16_Wide_String;
function Convert(Item: UTF_8_String; Output_BOM: Boolean := False)
return UTF_16_Wide_String;
function Convert(
Item: UTF_16_Wide_String;
Output_Scheme: Encoding_Scheme;
Output_BOM: Boolean := False)
return UTF_String;
function Convert(Item: UTF_16_Wide_String; Output_BOM: Boolean := False)
return UTF_8_String;
end Ada.Strings.UTF_Encoding.Conversions;
The purpose of these should be obvious. The first
converts between encodings held as strings with parameters indicating
both the Input_Scheme and the Output_Scheme.
If the input string has a BOM that does not match the Input_Scheme
then the exception Encoding_Error is raised.
The final optional parameter indicates whether or not an appropriate
BOM is to be placed at the start of the converted string.
The other functions convert between UTF encodings
held as strings and wide strings. Two give the explicit Input_Scheme
or Output_Scheme and two are provided for
convenience for the common case of UTF_8.
The final topic in this section concerns the classification
and conversion of characters and strings. The package Ada.Characters.Handling
was introduced in Ada 95; this contains various classification functions
such as Is_Lower, Is_Digit
and so on. This package also contains functions such as To_Upper
and To_Lower which convert characters to upper
case or lower case.
These facilities are extended in Ada 2012 by the
addition of a few more classification functions in the package Ada.Characters.Handling
plus two similar packages named Ada.Wide_Characters.Handling
for dealing with wide characters and Ada.Wide_Wide_Characters.Handling
for dealing with wide wide characters.
It should be noticed that these new packages are
children of
Ada.Wide_Characters
and
Ada.Wide_Wide_Characters
respectively. These packages were introduced in Ada 2005 but are empty
other than for pragmas
Pure.
The additional character
classification functions in Ada.Characters.Handling
are
function Is_Line_Terminator ...
function Is_Mark(Item: Character) return Boolean;
function Is_Other_Format ...
function Is_Punctuation_Connector ...
function Is_Space ...
In each case they have a single parameter Item
of type Character and return a result of type
Boolean.
The meanings are as
follows
Is_Line_Terminator—
returns True if Item
is one of Line_Feed (10), Line_Tabulation (11), Form_Feed (12), Carriage_Return
(13), or Next_Line (133).
Is_Mark—
always returns False.
Is_Other_Format—
returns True if Item
is Soft_Hyphen (171).
Is_Punctuation_Connector—
returns True if Item
is Low_Line (95); this is often known as Underscore.
Is_Space—
returns True if Item
is Space (32) or No_Break_Space (160).
Readers might feel that Is_Mark
is a foolish waste of time. However, it is introduced because the corresponding
functions in the new packages for wide and wide wide characters can return
True.
An important point is that these classifications
enable a compiler to analyze Ada source code without direct reference
to the definition of ISO/IEC 10646. Note further that case insensitive
text comparison which is useful for the analysis of identifiers is now
provided by new functions described in section
7.5.
The new package
Wide_Characters.Handling
is very similar to the package
Characters.Handling
(as modified by the additional functions just described) with
Character
and
String everywhere replaced by
Wide_Character
and
Wide_String. However, there are no functions
corresponding to
Is_Basic,
Is_ISO_646,
To_Basic and
To_ISO_646.
In the case of
Is_Basic this is because there
is no categorization of Basic in 10646. In the case of ISO-646 it is
not really necessary because it would seem rather unlikely that one would
want to check a wide character
WC to see if
it was one of the 7-bit ISO-646 set. In any event, one could always write
WC in Wide_Characters'POS(0) .. Wide_Characters'POS(127)
The package Wide_Characters.Handling
also has the new function Character_Set_Version
thus
function Character_Set_Version return String;
The string returned identifies the version of the
character set standard being used. Typically it will include either
"10646:"
or
"Unicode". The reason for introducing
this function is because the categorization of some wide characters depends
upon the version of 10646 or Unicode being used. So rather than specifying
that the package uses a particular set (which might be a nuisance in
the future if the character set standard changes), it seemed more appropriate
to enable the program to find out exactly which version is being used.
For most programs, it won't matter at all of course.
Note that there is no corresponding function in Ada.Characters.Handling.
This is because the set used for the type Character
is frozen as at 1995 and the classification functions defined for the
type Character are frozen as well (and so
do not now exactly match 10646 which has since evolved). It might be
that classifications for wide and ever wider characters might change
in the future for some obscure characters but the programmer can rest
assured that Character is for ever reliable.
So Wide_Characters.Handling
in essence is
package Ada.Wide_Characters.Handling is
pragma Pure(Handling);
function Character_Set_Version return String;
function Is_Control(Item: Wide_Character) return Boolean;
... -- and so on
function To_Upper(Item: Wide_String) return Wide_String);
end Ada.Wide_Characters.Handling;
The new package Wide_Wide_Characters.Handling
is the same as Wide_Characters.Handling with
Wide_Character and Wide_String
replaced by Wide_Wide_Character and Wide_Wide_String
throughout.
© 2011, 2012, 2013 John Barnes Informatics.
Sponsored in part by: