Rationale for Ada 2012

John Barnes
Contents   Index   References   Search   Previous   Next 

7.2 Strings and characters

Ada 95 added a number of packages for manipulating strings and characters. Three child packages of Ada.Strings enable the manipulation of fixed length, bounded and unbounded strings. They are Ada.Strings.Fixed, Ada.Strings.Bounded and Ada.Strings.Unbounded. The packages have many subprograms with similar facilities.
In particular there are functions Index and Index_Non_Blank{ which search through a string and return the index of the first character satisfying some criteria and procedures Find_Token which search through a string and find the first instance of a slice satisfying some other criteria.
As originally defined in Ada 95 these subprograms all started the search at the beginning of the string. This proved to be somewhat inconvenient and so in Ada 2005, versions of the functions Index and Index_Non_Blank with an extra parameter From were added to enable the search to be started at any position. However, the fact that versions of the procedures Find_Token with an extra parameter From should also have been added was overlooked. This is remedied in Ada 2012.
So in Ada 2012 corresponding additional subprograms Find_Token are added to the appropriate packages. They are
procedure Find_Token(
                         Source: in String;
                         Set: in Maps.Character_Set;
                         From: in Positive;
                         Test: in Membership;
                         First: out Positive;
                         Last: out Natural);
procedure Find_Token(
                         Source: in Bounded_String;
                         Set: in Maps.Character_Set;
                         From: in Positive;
                         Test: in Membership;
                         First: out Positive;
                         Last: out Natural);}
procedure Find_Token(
                         Source: in Unbounded_String;
                         Set: in Maps.Character_Set;
                         From: in Positive;
                         Test: in Membership;
                         First: out Positive;
                         Last: out Natural);
Note also that the wording for Find_Token is modified to make it clear that the values of First and Last identify the longest possible slice starting at From. If no characters satisfy the criteria then First is set to From and Last is set to zero.
The existing procedures Find_Token are now defined as calls of the new ones with From set to Source'First.
The encodings UTF-8 and UTF-16 are now widely used but Ada 2005 provides no mechanisms to convert between these encodings and the types String, Wide_String, and Wide_Wide_String.
The encoding UTF-8 works in terms of raw bytes and is straightforward; it is defined in Annex D of ISO/IEC 10646. However, UTF-16 comes in two forms according to whether the arrangement of two bytes into a 16-bit word uses big-endian or little-endian packing. So there are two forms UTF-16BE and UTF-16LE; they are defined in Annex C of ISO/IEC 10646.
The different encodings can be distinguished by a special value known as a BOM (Byte Order Mark) at the start of the string. So we have BOM_8, BOM_16BE, BOM_16LE, and just BOM_16 (for wide strings).
To support these encodings, Ada 2012 includes the following five new packages
The first package declares items that are used by the other packages. It is
package Ada.Strings.UTF_Encoding is
   pragma Pure(UTF_Encoding);
   type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
   subtype UTF_String is String;
   subtype UTF_8_String is String;
   subtype UTF_16_Wide_String is Wide_String;
   Encoding_Error: exception;
   BOM_8: constant UTF_8_String :=
                      Character'Val(16#EF#) &
                      Character'Val(16#BB#) &
   BOM_16BE: constant UTF_String :=
                      Character'Val(16#FE#) &
   BOM_16LE: constant UTF_String :=
                      Character'Val(16#FF#) &
   BOM_16: constant UTF_16_Wide_String :=
                      (1 => Wide_Character'Val(16#FEFF#));
   function Encoding(
                      Item: UTF_String;
                      Default: Encoding_Scheme := UTF_8)
                          return Encoding_Scheme;
end Ada.Strings.UTF_Encoding;
Note that the encoded forms are actually still held in objects of type String or Wide_String. However, in order to aid understanding, the subtypes UTF_String, UTF_8_String and UTF_16_Wide_String are introduced and these should be used when referring to objects holding the encoded forms.
The type Encoding_Scheme defines the various schemes. Note that an encoded string might or might not start with the identifying BOM; it is optional. The function Encoding takes a UTF_String (that is a plain old string), checks the BOM if present and returns the value of Encoding_Scheme identifying the scheme. If there is no BOM then it returns the value of the parameter Default which itself by default is UTF_8.
Note carefully that the function Encoding does not do any encoding — that is done by functions Encode in the other packages which will be described in a moment. Note also that there is no corresponding function Encoding for wide strings; that is because there is only one relevant scheme corresponding to UTF_16_Wide_String, namely that with BOM_16.
We will now look at the other packages. The package UTF_Encoding.Strings contains functions Encode and Decode which convert between the raw type String and the UTF forms. Similar packages apply to wide and wide wide strings. The package UTF_Encoding.Conversions contains functions Convert which convert between the various UTF forms.
The package for the type String is
package Ada.Strings.UTF_Encoding.Strings is
   pragma Pure(Strings);
   function Encode(
            Item: String;
            Output_Scheme: Encoding_Scheme;
            Output_BOM: Boolean := False)
                return UTF_String;
   function Encode(Item: String; Output_BOM: Boolean := False)
                return UTF_8_String;}
   function Encode(Item: String; Output_BOM: Boolean := False)
                return UTF_16_Wide_String;
   function Decode(Item: UTF_String; Input_Scheme: Encoding_Scheme)
                return String;}
   function Decode(Item: UTF_8_String)
                return String;
   function Decode(Item: UTF_16_Wide_String)
                return String;
end Ada.Strings.UTF_Encoding.Strings;
The functions Encode take a string and return it encoded. The first function has a parameter Output_Scheme which determines whether the encoding is to be to UTF_8, UTF_16BE or UTF_16LE. The second function is provided as a convenience for the common case of encoding to UTF_8 and the third function is necessary for encoding to UTF_16_Wide_String. In all cases there is a final optional parameter indicating whether or not an appropriate BOM is to be placed at the start of the encoded string.
The functions Decode do the reverse. Thus the first function takes a value of subtype UTF_String and a parameter Input_Scheme giving the scheme to be used and returns the decoded string. If a BOM is present which does not match the Input_Scheme, then the exception Encoding_Error is raised. The second function is a convenience for the common case of decoding from UTF_8 and the third function is necessary for decoding from UTF_16_Wide_String; again, if a BOM is present that does not match the expected scheme then Encoding_Error is raised.
In all cases all the strings returned have a lower bound of 1.
The packages UTF_Encoding.Wide_Strings and UTF_Encoding.Wide_Wide_Strings are identical except that the type String is replaced by Wide_String or Wide_Wide_String throughout.
Finally, the package for converting between the various UTF forms is as follows
package Ada.Strings.UTF_Encoding.Conversions is
   pragma Pure(Conversions);
   function Convert(
            Item: UTF_String;
            Input_Scheme: Encoding_Scheme;
            Output_Scheme: Encoding_Scheme;
            Output_BOM: Boolean := False)
                return UTF_String;
   function Convert(
            Item: UTF_String;
            Input_Scheme: Encoding_Scheme;
            Output_BOM: Boolean := False)
                return UTF_16_Wide_String;
   function Convert(Item: UTF_8_String; Output_BOM: Boolean := False)
                return UTF_16_Wide_String;
   function Convert(
            Item: UTF_16_Wide_String;
            Output_Scheme: Encoding_Scheme;
            Output_BOM: Boolean := False)
                return UTF_String;
   function Convert(Item: UTF_16_Wide_String; Output_BOM: Boolean := False)
                return UTF_8_String;
end Ada.Strings.UTF_Encoding.Conversions;
The purpose of these should be obvious. The first converts between encodings held as strings with parameters indicating both the Input_Scheme and the Output_Scheme. If the input string has a BOM that does not match the Input_Scheme then the exception Encoding_Error is raised. The final optional parameter indicates whether or not an appropriate BOM is to be placed at the start of the converted string.
The other functions convert between UTF encodings held as strings and wide strings. Two give the explicit Input_Scheme or Output_Scheme and two are provided for convenience for the common case of UTF_8.
The final topic in this section concerns the classification and conversion of characters and strings. The package Ada.Characters.Handling was introduced in Ada 95; this contains various classification functions such as Is_Lower, Is_Digit and so on. This package also contains functions such as To_Upper and To_Lower which convert characters to upper case or lower case.
These facilities are extended in Ada 2012 by the addition of a few more classification functions in the package Ada.Characters.Handling plus two similar packages named Ada.Wide_Characters.Handling for dealing with wide characters and Ada.Wide_Wide_Characters.Handling for dealing with wide wide characters.
It should be noticed that these new packages are children of Ada.Wide_Characters and Ada.Wide_Wide_Characters respectively. These packages were introduced in Ada 2005 but are empty other than for pragmas Pure.
The additional character classification functions in Ada.Characters.Handling are
function Is_Line_Terminator ...
function Is_Mark(Item: Character) return Boolean;
function Is_Other_Format ...
function Is_Punctuation_Connector ...
function Is_Space ...
In each case they have a single parameter Item of type Character and return a result of type Boolean.
The meanings are as follows

returns True if Item is one of Line_Feed (10), Line_Tabulation (11), Form_Feed (12), Carriage_Return (13), or Next_Line (133).
always returns False.

returns True if Item is Soft_Hyphen (171).

returns True if Item is Low_Line (95); this is often known as Underscore.

returns True if Item is Space (32) or No_Break_Space (160). 
Readers might feel that Is_Mark is a foolish waste of time. However, it is introduced because the corresponding functions in the new packages for wide and wide wide characters can return True.
An important point is that these classifications enable a compiler to analyze Ada source code without direct reference to the definition of ISO/IEC 10646. Note further that case insensitive text comparison which is useful for the analysis of identifiers is now provided by new functions described in section 7.5.
The new package Wide_Characters.Handling is very similar to the package Characters.Handling (as modified by the additional functions just described) with Character and String everywhere replaced by Wide_Character and Wide_String. However, there are no functions corresponding to Is_Basic, Is_ISO_646, To_Basic and To_ISO_646. In the case of Is_Basic this is because there is no categorization of Basic in 10646. In the case of ISO-646 it is not really necessary because it would seem rather unlikely that one would want to check a wide character WC to see if it was one of the 7-bit ISO-646 set. In any event, one could always write
WC in Wide_Characters'POS(0) .. Wide_Characters'POS(127)
The package Wide_Characters.Handling also has the new function Character_Set_Version thus
function Character_Set_Version return String;
The string returned identifies the version of the character set standard being used. Typically it will include either "10646:" or "Unicode". The reason for introducing this function is because the categorization of some wide characters depends upon the version of 10646 or Unicode being used. So rather than specifying that the package uses a particular set (which might be a nuisance in the future if the character set standard changes), it seemed more appropriate to enable the program to find out exactly which version is being used. For most programs, it won't matter at all of course.
Note that there is no corresponding function in Ada.Characters.Handling. This is because the set used for the type Character is frozen as at 1995 and the classification functions defined for the type Character are frozen as well (and so do not now exactly match 10646 which has since evolved). It might be that classifications for wide and ever wider characters might change in the future for some obscure characters but the programmer can rest assured that Character is for ever reliable.
So Wide_Characters.Handling in essence is
package Ada.Wide_Characters.Handling is
   pragma Pure(Handling);
   function Character_Set_Version return String;
   function Is_Control(Item: Wide_Character) return Boolean;
   ... -- and so on
   function To_Upper(Item: Wide_String) return Wide_String);
end Ada.Wide_Characters.Handling;
The new package Wide_Wide_Characters.Handling is the same as Wide_Characters.Handling with Wide_Character and Wide_String replaced by Wide_Wide_Character and Wide_Wide_String throughout.

Contents   Index   References   Search   Previous   Next 
© 2011, 2012, 2013 John Barnes Informatics.
Sponsored in part by:
The Ada Resource Association:


and   Ada-Europe: