Version 1.1 of ais/ai-00301.txt

Unformatted version of ais/ai-00301.txt version 1.1
Other versions for file ais/ai-00301.txt

!standard A.4.5          02-06-13 AI95-00301/01
!class amendment 02-06-12
!status work item 02-06-12
!status received 02-06-12
!priority Medium
!difficulty Easy
!subject Missing operations in Ada.Strings.Unbounded
!summary
Some additional operations on unbounded strings are proposed to be added to Ada.Strings.Unbounded, an I/O operations on unbounded strings are proposed to be provided in a new child package of Ada.Strings.Unbounded.
!problem
The interface defined for Ada.Strings.Unbounded offers many operations only with one parameter of type Unbounded_String and a second parameter of type String. Corresponding operations with both parameters of type Unbounded_String are missing, which requires that these have to be emulated by first converting the second parameter to a temporary string. This unnecessary time and storage overhead can be eliminated by adding the missing operations to Ada.Strings.Unbounded.
A variation of this theme occurs if one wants to work with slices of Unbounded_Strings, for which these operations typically do not exist either, again forcing a detour through an intermediary conversion to String by calling To_String or Slice.
Furthermore, search operations (Index) always search the whole unbounded string, which makes it very cumbersome to repeatedly search for a pattern in a string. A variant with a starting index as a parameter would be very useful.
Also, while there is a function To_Unbounded_String, there is no procedural variant for this operation. However, such a procedure would be very useful, for it might be more efficient than the function.
And finally, I/O operations with Unbounded_Strings as parameters are missing. These, too, would be useful to avoid these extra copies as Strings. Such I/O operations could be provided in a child package of Ada.Strings.Unbounded.
!proposal
It is proposed to add the following operations to package Ada.Strings.Unbounded:
package Ada.Strings.Unbounded is
...
--------------------------------------------------------------------- -- Additional string-to-unbounded-string conversion:
procedure Set (Target : in out Unbounded_String; Str : in String); -- Identical in effect to Target := To_Unbounded_String (Str);
--------------------------------------------------------------------- -- Additional operations for obtaining slices:
function Slice (Source : in Unbounded_String; Low : in Positive; High : in Natural) return Unbounded_String; -- Identical to To_Unbounded_String (Slice (Source, Low, High));
procedure Slice (Source : in Unbounded_String; Target : out Unbounded_String; Low : in Positive; High : in Natural); -- Identical to Target := To_Unbounded_String (Slice (Source, Low, High));
--------------------------------------------------------------------- -- Index operations with a From parameter giving the place where the -- search should start.
function Index (Source : in Unbounded_String; Pattern : in String; From : in Positive; Going : in Direction := Forward; Mapping : in Maps.Character_Mapping := Maps.Identity) return Natural; -- If Going = Forward, identical in effect to -- Index (Slice (Source, From, Length (Source)), -- Pattern, Going, Mapping); -- otherwise Going is Backward and the operation is identical to -- Index (Slice (Source, 1, From), Pattern, Going, Mapping);
function Index (Source : in Unbounded_String; Pattern : in Unbounded_String; From : in Positive := 1; Going : in Direction := Forward; Mapping : in Maps.Character_Mapping := Maps.Identity) return Natural; -- Identical in effect to -- Index (Source, To_String (Pattern), From, Going, Mapping);
function Index (Source : in Unbounded_String; Pattern : in Unbounded_String; Low : in Positive; High : in Natural; From : in Positive := 1; Going : in Direction := Forward; Mapping : in Maps.Character_Mapping := Maps.Identity) return Natural; -- "Sliced" Index: identical in effect to -- Index (Source, Slice (Pattern, Low, High), From, Going, Mapping);
function Index (Source : in Unbounded_String; Pattern : in String; From : in Positive; Going : in Direction := Forward; Mapping : in Maps.Character_Mapping_Function) return Natural; -- If Going = Forward, identical in effect to -- Index (Slice (Source, From, Length (Source)), -- Pattern, Going, Mapping); -- otherwise Going is Backward and the operation is identical to -- Index (Slice (Source, 1, From), Pattern, Going, Mapping);
function Index (Source : in Unbounded_String; Pattern : in Unbounded_String; From : in Positive := 1; Going : in Direction := Forward; Mapping : in Maps.Character_Mapping_Function) return Natural; -- Identical in effect to -- Index (Source, To_String (Pattern), From, Going, Mapping);
function Index (Source : in Unbounded_String; Pattern : in Unbounded_String; Low : in Positive; High : in Natural; From : in Positive := 1; Going : in Direction := Forward; Mapping : in Maps.Character_Mapping_Function) return Natural; -- "Sliced" Index: identical in effect to -- Index (Source, Slice (Pattern, Low, High), From, Going, Mapping);
------------------------------------------------------------------- -- Slice operations. These are useful because they avoid the need of -- an intermediary explicit representation of the slice, which would -- require an intermediary extra copy of the string data.
function Append_Slice (Source : in Unbounded_String; From : in Unbounded_String; Low : in Positive; High : in Natural) return Unbounded_String; -- Identical in effect to -- Append (Source, Slice (From, Low, High));
procedure Append_Slice (Source : in out Unbounded_String; From : in Unbounded_String; Low : in Positive; High : in Natural); -- Identical in effect to -- Append (Source, Slice (From, Low, High));
function Replace_Slice (Source : in Unbounded_String; Low : in Positive; High : in Natural; By : in Unbounded_String; From : in Positive; To : in Natural) return Unbounded_String; -- Identical in effect to -- Replace_Slice (Source, Low, High, Slice (By, From, To));
procedure Replace_Slice (Source : in out Unbounded_String; Low : in Positive; High : in Natural; By : in Unbounded_String; From : in Positive; To : in Natural); -- Identical in effect to -- Replace_Slice (Source, Low, High, Slice (By, From, To));
function Insert (Source : in Unbounded_String; Before : in Positive; New_Item : in Unbounded_String) return Unbounded_String; -- Identical in effect to Insert (Source, Before, To_String (New_Item));
procedure Insert (Source : in out Unbounded_String; Before : in Positive; New_Item : in Unbounded_String); -- Identical in effect to Insert (Source, Before, To_String (New_Item));
function Insert (Source : in Unbounded_String; Before : in Positive; New_Item : in Unbounded_String; Low : in Positive; High : in Natural) return Unbounded_String; -- Identical in effect to Insert (Source, Before, -- Slice (New_Item, Low, High));
procedure Insert (Source : in out Unbounded_String; Before : in Positive; New_Item : in Unbounded_String; Low : in Positive; High : in Natural); -- Identical in effect to Insert (Source, Before, -- Slice (New_Item, Low, High));
function Overwrite (Source : in Unbounded_String; Position : in Positive; New_Item : in Unbounded_String) return Unbounded_String; -- Identical in effect to -- Overwrite (Source, Position, To_String (New_Item));
procedure Overwrite (Source : in out Unbounded_String; Position : in Positive; New_Item : in Unbounded_String); -- Identical in effect to -- Overwrite (Source, Position, To_String (New_Item));
function Overwrite (Source : in Unbounded_String; Position : in Positive; New_Item : in Unbounded_String; Low : in Positive; High : in Natural) return Unbounded_String; -- Identical in effect to Overwrite (Source, Position, -- Slice (New_Item, Low, High));
procedure Overwrite (Source : in out Unbounded_String; Position : in Positive; New_Item : in Unbounded_String; Low : in Positive; High : in Natural); -- Identical in effect to Overwrite (Source, Position, -- Slice (New_Item, Low, High));
...
private
...
end Ada.Strings.Unbounded;
Implementation advice: all these new operations should be implemented such that no extra (intermediary) copy of the string (slice) data is created.
It is also proposed to add a child package to Ada.Strings.Unbounded:
with Ada.Text_IO;
package Ada.Strings.Unbounded.Text_IO is
procedure Put (File : in Ada.Text_IO.File_Type; Item : in Unbounded_String); -- Identical in effect to -- Ada.Text_IO.Put (File, To_String (Item));
procedure Put (Item : in Unbounded_String); -- Identical in effect to -- Ada.Text_IO.Put (To_String (Item));
procedure Put_Line (File : in Ada.Text_IO.File_Type; Item : in Unbounded_String); -- Identical in effect to -- Ada.Text_IO.Put_Line (File, To_String (Item));
procedure Put_Line (Item : in Unbounded_String); -- Identical in effect to -- Ada.Text_IO.Put_Line (To_String (Item));
function Get_Line (File : in Ada.Text_IO.File_Type) return Unbounded_String; -- Reads up to the end of the line, returning the whole line as an -- unbounded string. If a line terminator is met, Skip_Line is (in effect) -- called with a spacing of 1, and the characters up to but not including -- the line terminator are returned. If a file terminator is met, the -- characters up to but not including the file terminator are returned, -- except if the file terminator is met before any characters have been -- read; in this case, Ada.IO_Exceptions.End_Error is raised.
function Get_Line return Unbounded_String; -- Identical to Get_Line (Ada.Text_IO.Current_Input);
--------------------------------------------------------------------- -- Slice operations:
procedure Put (File : in Ada.Text_IO.File_Type; Item : in Unbounded_String; Low : in Positive; High : in Natural); -- Identical in effect to -- Put (File, Slice (Item, Low, High));
procedure Put (Item : in Unbounded_String; Low : in Positive; High : in Natural); -- Identical in effect to -- Put (Slice (Item, Low, High));
procedure Put_Line (File : in Ada.Text_IO.File_Type; Item : in Unbounded_String; Low : in Positive; High : in Natural); -- Identical in effect to -- Put_Line (File, Slice (Item, Low, High));
procedure Put_Line (Item : in Unbounded_String; Low : in Positive; High : in Natural); -- Identical in effect to -- Put_Line (Slice (Item, Low, High));
end Ada.Strings.Unbounded.Text_IO;
Implementation advice: the Put and Put_Line operations should be implemented such that no extra copy of the string data occurs.
BTW: since Ada.Text_IO.End_Of_Line returns True when a line *or a file* terminator are next in the input (A.10.5(13)), I believe it would make sense to clearly define that Ada.Text_IO.Get_Line does *not* raise End_Error upon the last line of a file if that last line has no line terminator, but ends on EOF directly. Otherwise, it may get rather difficult to read the last line of such a file. in other words, define "end of the line" as "line terminator or EOF", and Get_Line to read up to the end of the line (and call Skip_Line (1) if it stopped on a line terminator, but not if it stopped on EOF).
It should also be evaluated whether similar additions as proposed above for bounded strings (Ada.Strings.Bounded) would make sense.
!discussion
The point of the proposed additions is not so much missing functionality, as a user can always get the same functionality by converting the unbounded string (slice) to a plain String using To_String or Slice and then using the existing operation, but a sometimes significant efficiency gain by the new operations precisely because they avoid this clumsy method with an intermediary representation of one parameter as a String.
All of these operations can be implemented very simply. As an example assume the implementation of unbounded strings suggested in the Rationale and consider for instance the last Overwrite variant proposed above:
procedure Overwrite
(Source : in out Unbounded_String;
Position : in Positive; New_Item : in Unbounded_String; Low : in Positive; High : in Natural)
is begin
if Low > New_Item.Reference'Last or High > New_Item.Reference'Last then
raise Ada.Strings.Index_Error;
end if; Overwrite (Source, Position, New_Item.Reference (Low .. High));
end Overwrite;
The I/O operations can be implemented similarly. E.g.
procedure Put
(Item : in Unbounded_String)
is begin
Ada.Text_IO.Put (Item.Reference.all);
end Put;
function Get_Line (File : in Ada.Text_IO.File_Type) return Unbounded_String is Result : Unbounded_String; Buffer : String (1 .. 500); -- Or whatever size is deemed appropriate Last : Natural; begin loop Ada.Text_IO.Get_Line (File, Buffer, Last); Append (Result, Buffer (1 .. Last)); exit when Last < Buffer'Last or else Ada.Text_IO.End_Of_File (File); end loop; return Result; end Get_Line;
While the Get_Line operations could be written by a user, this is not possible for the Put and Put_Line operations, since there is no access to the internal component Reference. However, it seems only natural to provide the Get_Line, too.
!example
Consider the task of finding all occurrences of a given string Pattern in a given Unbounded_String Source. Currently, one either has to convert the whole unbounded string into a string and search on that, or implement the search oneself (ignoring character mappings here):
declare Src_Length : constant Natural := Length (Source); From : Positive := 1; begin while From + Pattern'Length - 1 <= Src_Length loop if Slice (Source, From, From + Pattern'Length - 1) = Pattern then -- Found an occurrence of Pattern; process it, then advance From From := From + Pattern'Length; -- No overlapping occurrences! else From := From + 1; end if; declare end loop; end;
This incurs quite some extra storage overhead because the (necessary!) call to Slice is required to return a copy of the slice of Source.Reference. With the proposed new Index functions, this could be simplified to:
declare Src_Length : constant Natural := Length (Source); From : Positive := 1; Idx : Natural; begin while From + Pattern'Length - 1 <= Src_Length loop Idx := Ada.Strings.Unbounded.Index (Source, Pattern, From); exit when Idx = 0; -- Found an occurrence of Pattern starting at Idx. -- Process this occurrence, then advance From From := Idx + Pattern'Length; -- No overlapping occurrences! end loop; end;
which may be much more efficient because there are no intermediary copies to String involved.
The same is true for writing an Unbounded_String. Currently, one has to do
Ada.Text_IO.Put_Line (Ada.Strings.Unbounded.To_String (My_Unbounded_String));
which also incurs this extra intermediary String representation, which could be avoided with the new operation
Ada.Strings.Unbounded.Text_IO (My_Unbounded_String);
Note: GNAT already provides the package Ada.Strings.Unbounded.Text_IO.
!appendix

Editor's note: The original proposal was submitted by Thomas Wolf on
Wednesday, June 12, 2002. It was edited slightly, but otherwise used intact
as the initial draft of this AI.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, June 12, 2002  10:14 PM

Thanks for the submission.

I've recently had similar problems using Unbounded_Strings. The fact that
you have to frequently convert to strings to get anything done is rather
annoying. I'm surprised I haven't heard about it before. (I personally
hadn't tried to use Unbounded_Strings before the recent program, a spam
scanner.)

Some comments:

I ran into a significant performance problem when I needed to test whether
the start of an unbounded string matched some other string. The simple
solution of
   if Index (Source, To_String(Pattern)) = 1 then
turned out to be horribly inefficient.

The better solution of
   if Slice (Source, 1, Length(Pattern)) = To_String(Pattern) then
is wrong if Source is too short (Constraint_Error is raised).

The correct solution of
   if Length (Source) >= Length (Pattern) and then
      Slice (Source, 1, Length(Pattern)) = To_String(Pattern) then
also is rather inefficient, because it has 5 function calls and two
unconstrained string returning function calls, which need to use heap or
other expensive memory allocation, and of course, two extra copies.

Being a compiler writer, I ended up declaring
    Is_Prefix (Source, Pattern : in Unbounded_String)
in a child package, and implemented it myself. But I don't recommend that
solution to everyone.

I also ran into the missing Index from the middle issue (which comes up when
you want to process all of the matches for the pattern), which I solved by
punting: convert to a String and use Ada.Strings.Fixed.Index.

> Also, while there is a function To_Unbounded_String, there is
> no procedural variant for this operation. However, such a
> procedure would be very useful, for it might be more efficient
> than the function.

Arguably, a procedural variant of To_String also would be useful, and it
certainly would be more efficient. (Indeed, the To_Unbounded_String
function, which returns a bounded record type, is usually cheap in most
implementations.)

> Implementation advice: all these new operations should be implemented such
> that no extra (intermediary) copy of the string (slice) data is created.

I'm not sure there is much point to this. The existing implementations of
Ada.Strings.Unbounded vary wildly in efficiency, I don't see any reason to
try to say anything more about any new functions.

I realize that you are trying to prevent people from implementing these
simply by calling Slice or To_String, but an implementor that would do that
probably has done it in many other parts of Ada.Strings.Unbounded.

...

>   function Get_Line
>     (File : in Ada.Text_IO.File_Type)
>     return Unbounded_String;
>   --  Reads up to the end of the line, returning the whole line as an
>   --  unbounded string. If a line terminator is met, Skip_Line is (in effect)
>   --  called with a spacing of 1, and the characters up to but not including
>   --  the line terminator are returned. If a file terminator is met, the
>   --  characters up to but not including the file terminator are returned,
>   --  except if the file terminator is met before any characters have been
>   --  read; in this case, Ada.IO_Exceptions.End_Error is raised.

There are two problems with this. First, you've left out the part of about
what happens if the string is full. This is a real issue on compilers where
Integer is 16-bit. The standard has to cover all of the cases, even if they
are unlikely on most implementations.

Second, there is no good reason to change the semantics of Get_Line at a
file terminator. But this is based on a serious misconception about Ada (see
below). And in any case, Get_Line should operate like the Get_Line we all
expect.

> Implementation advice: the Put and Put_Line operations should be implemented
> such that no extra copy of the string data occurs.

This one if even more dubious than the first one. It's likely that the
implementation has to copy the string data for the existing Put and
Put_Line, to put it into a buffer or to pass it to the operating system.
You're saying that it can't do that? I think every implementation would have
to violate this advice, making it useless. I realize your point is that you
want the string directly written out, but again I think you have to leave
this to the implementor.


> BTW: since Ada.Text_IO.End_Of_Line returns True when a line *or a file*
> terminator are next in the input (A.10.5(13)), I believe it would make sense
> to clearly define that Ada.Text_IO.Get_Line does *not* raise End_Error upon
> the last line of a file if that last line has no line terminator, but ends
> on EOF directly. Otherwise, it may get rather difficult to read the last
> line of such a file. In other words, define "end of the line" as "line
> terminator or EOF", and Get_Line to read up to the end of the line (and
> call Skip_Line (1) if it stopped on a line terminator, but not if it
> stopped on EOF).

Why do you think this is not true? There is always an line terminator (and
page terminator) before the file terminator, see A.10(7). If a file does not
explicitly have a line terminator at the end of the file, the implementation
has to implicitly provide one. That's been true since Ada 80.


> It should also be evaluated whether similar additions as proposed above
> for bounded strings (Ada.Strings.Bounded) would make sense.

Deleting bounded strings would make more sense. :-)


In your discussion section, you probably should mention that being forced to
leave the unbounded string abstraction to do common operations substantially
weakens the abstraction. If you have to do it often enough, you begin to
wonder what exactly you are gaining by using Unbounded_String. (I doubt that
I will use Unbounded_String again, because it was as messy or messier than
simply using good old regular strings - in large part because I continually
had to convert back to String.)

****************************************************************

From: Ted Baker
Sent: Thursday, June 13, 2002  5:55 AM

I would second Randy's comments on Unbounded_String.  Back when we
were teaching Ada here I tried to introduce this package for
students to use in programming assignments.  I ended up giving up.
I reverted to an Ada 83-style package of my own construction in
which the (length, string) components were both visible.
Otherwise, the students ended up doing so many copying conversion
operations that I felt embarassed.  I was embarassed because the
less perceptive students might think I meant to teach that
sloppiness about recopying arrays is good style, and the better
students might conclude that Ada is an inherently inefficient
language.

****************************************************************

From: Thomas Wolf
Sent: Thursday, June 13, 2002  5:20 AM

On 12 Jun 2002 at 22:14, Randy Brukardt wrote:

> I've recently had similar problems using Unbounded_Strings.

Good to see that I'm not the alone with this!

> Second, there is no good reason to change the semantics of Get_Line at a
> file terminator. But this is based on a serious misconception about Ada (see
> below). And in any case, Get_Line should operate like the Get_Line we all
> expect.

About the misconception, see below.

> > Implementation advice: the Put and Put_Line operations should be
> > implemented such that no extra copy of the string data occurs.
>
> This one if even more dubious than the first one. It's likely that the
> implementation has to copy the string data for the existing Put and
> Put_Line, to put it into a buffer or to pass it to the operating system.
> You're saying that it can't do that? I think every implementation would have
> to violate this advice, making it useless. I realize your point is that you
> want the string directly written out, but again I think you have to leave
> this to the implementor.

No. The intention of this is just to discourage an implementation like

procedure Put_Line (S : in Unbounded_String) is
begin
  Ada.Text_IO.Put_Line (To_String (S));
end Put_Line;

(well, at least discourage it unless the compiler is smart enough to
avoid doing in effect

  declare
     Tmp : String := To_String (S);
  begin
     Ada.Text_IO.Put_Line (Tmp);
  end;

That's what I meant by "extra copy". Note that I wrote "extra copy",
not just "copy".)

> Why do you think this is not true? There is always an line terminator (and
> page terminator) before the file terminator, see A.10(7). If a file does not
> explicitly have a line terminator at the end of the file, the implementation
> has to implicitly provide one. That's been true since Ada 80.

Indeed. I misread A.10(7) and was tricked into thinking it wasn't so
because I have come across at least one implementation that didn't do
it this way and failed horribly on the last line of a file if that line
didn't have and end-of-line. But you're right, the whole issue is moot.

> > It should also be evaluated whether similar additions as proposed above
> > for bounded strings (Ada.Strings.Bounded) would make sense.
>
> Deleting bounded strings would make more sense. :-)

I do see the smiley, but much as I'd like to see it go, it wouldn't be
backwards compatible. So I'm afraid somebody will have to think about
whether there should be similar operations for bounded strings, too.

****************************************************************

From: Robert A. Duff
Sent: Thursday, June 13, 2002  9:27 AM

> No. The intention of this is just to discourage an implementation like
>
> procedure Put_Line (S : in Unbounded_String) is
> begin
>   Ada.Text_IO.Put_Line (To_String (S));
> end Put_Line;

It's a mistake to try to put this sort of thing in a standard.
If you want the compiler to do things efficiently, pester your
compiler vendor (preferably with checkbook in hand).  ;-)

The language definition should make it feasible, and perhaps even easy,
to do things efficiently.  But it should not try to *force* efficiency.

> (well, at least discourage it unless the compiler is smart enough to
> avoid doing in effect
>
>   declare
>      Tmp : String := To_String (S);
>   begin
>      Ada.Text_IO.Put_Line (Tmp);
>   end;
>
> That's what I meant by "extra copy". Note that I wrote "extra copy",
> not just "copy".)

OK, I have an implementation that does 37 copies, but it doesn't do the
"extra" 38'th one I was thinking of.  Is that good enough?  ;-)

My point is that defining "extra" in the context of a standard is not
feasible.  So don't waste a lot of energy trying.

Compiler writers do not deliberately try to make their products
inefficient.  Of course they cut corners to save money.  So what they
need is pressure from paying customers, so they can set their
optimization priorities right.

****************************************************************

From: Robert A. Duff
Sent: Thursday, June 13, 2002  9:39 AM

Randy said:

> Deleting bounded strings would make more sense. :-)

A year or so ago, I was writing a lexical analyzer.
I needed a buffer to keep the token text in (for identifiers
and the like), and there's a max length for tokens,
so I used Bounded_Strings, with a max length of 1000 or so.

I expected the lexer to be slower than the parser, since lexers look at
each character, whereas parsers look only at each token.  But the lexer
was 60 *times* slower, which surprised me.

After some investigation, I discovered that for each token, and for each
whitespace and comment character, it was entering the block that
declared the buffer.  One might expect that to be nearly free -- it has
to initialize the buffer length to 0.

But the implementation of Bounded_Strings initialized all 1000
characters (because some AI says "=" has to compose on these things)!

Changing it to use my own record type (length plus array of characters,
just like Bounded_String, but without the useless initialization),
increased the speed of the lexer by a factor of 100.

So much for reusable abstractions.

****************************************************************


Questions? Ask the ACAA Technical Agent