Version 1.1 of acs/ac-00028.txt

Unformatted version of acs/ac-00028.txt version 1.1
Other versions for file acs/ac-00028.txt

!standard A.04.05(00)          02-02-19 AC95-00028/01
!class amendment 01-12-28
!status received no action 02-02-19
!subject Splitting and Joining Strings
!summary
!appendix

!topic Splitting and Joining Strings
!reference RM95-A.4
!from Christoph Grein 02-01-28
!keywords strings, lists of strings, split, join, Perl
!discussion

The capability to split strings to form a list and to join list elements to
form strings is very useful for text processing applications. This is the sort
of thing Perl is famous for.

So I want to make a proposal for a new package Ada.Strings.Unbounded.Lists,
which I found handy for some code generator application. (This proposal has
been inspired by Tash <http://www.adatcl.com>).

The proposal might look like a nice to have feature only and so not worth the
effort of standardization, since anybody can implement something like it
themselves easily. However this is a functionality quite often convenient when
processing strings and, because of its simplicity, should not need much work to
add to the standard, but be useful to many.

The library has the following declaration:
----------------------------------------------------------------------------
--with Ada.Finalization;  -- only needed for current example implementation

with Ada.Strings.Maps;

package Ada.Strings.Unbounded.Lists is

  type String_List is private;

  Null_String_List: constant String_List;

  function "=" (Left, Right: String_List) return Boolean;

  function Length (Source: String_List) return Natural;

  function To_String_List (Singleton: String) return String_List;
  function To_String_List (Singleton: Unbounded_String)
    return String_List;

  function "&" (Left, Right: String_List) return String_List;
  function "&" (Left: String; Right: String_List) return String_List;
  function "&" (Left: String_List; Right: String) return String_List;
  function "&" (Left: Unbounded_String; Right: String_List)
    return String_List;
  function "&" (Left: String_List; Right: Unbounded_String)
    return String_List;

  function Split (Source  : String;
                  Splitter: Maps.Character_Set) return String_List;
  function Split (Source  : Unbounded_String;
                  Splitter: Maps.Character_Set) return String_List;

  function Join (Source: String_List;
                 Joiner: Character) return String;
  function Join (Source: String_List;
                 Joiner: Character) return Unbounded_String;

  function Element (Source: String_List; Index: Positive)
    return String;
  function Element (Source: String_List; Index: Positive)
    return Unbounded_String;

  function Pop      (Source: String_List) return String_List;
  function Condense (Source: String_List) return String_List;
  function Slice    (Source: String_List;
                     Low: Positive; High: Natural) return String_List;

private
  -- not specified by the language

  -- This is an example implementation:

  type String_Array is array (Positive range <>) of Unbounded_String;
  type String_Array_Access is access String_Array;

  type String_List is new Finalization.Controlled with record
    Contents: String_Array_Access;
  end record;

  procedure Adjust   (Obj: in out String_List);
  procedure Finalize (Obj: in out String_List);

  Null_String_List: constant String_List :=
    (Finalization.Controlled with Contents => null);

end Ada.Strings.Unbounded.Lists;
----------------------------------------------------------------------------
An object of type String_List represents a list of (unbounded) strings.
Null_String_List represents the list containing no strings. If an object
of type String_List is not otherwise initialized, it will be initialized
to the same value as Null_String_List.

  function "=" (Left, Right: String_List) return Boolean;

The function "=" returns True if Left and Right represent identical lists,
and False otherwise.

  function Length (Source: String_List) return Natural;

The function Length returns the length of the list represented by Source, i.e.
the number of strings in the list. A list's index range starts always with 1
and extends to its length.

  function To_String_List (Singleton: String) return String_List;
  function To_String_List (Singleton: Unbounded_String)
    return String_List;

Each of the functions To_String_List returns a list with only one string
element, Singleton.

  function "&" (Left, Right: String_List) return String_List;
  function "&" (Left: String; Right: String_List) return String_List;
  function "&" (Left: String_List; Right: String) return String_List;
  function "&" (Left: Unbounded_String; Right: String_List)
     return String_List;
  function "&" (Left: String_List; Right: Unbounded_String)
     return String_List;

Each of the "&" functions returns a String_List obtained by concatenating the
(unbounded) string or list of strings given by the Left parameter to the
(unbounded) string or list of strings given by the Right parameter.

  function Split (Source  : String;
                  Splitter: Maps.Character_Set) return String_List;
  function Split (Source  : Unbounded_String;
                  Splitter: Maps.Character_Set) return String_List;

Each of the Split functions splits a string represented by the Source parameter
at each occurrence of one of the characters in the Splitter set to form a
String_List. The String_List will be composed of elements which are the
sequences of characters between each split character. Adjacent split characters
result in empty list elements. A split character at the first position of the
string results in an empty list element at the head of the list; a split
character at the last position of the string results in an empty list element
at the tail of the list.
If the Splitter set is empty, the source string is split at each character.
An empty Source string results in a list with one empty list element.

    Thus if N is the number of split characters in the string, you will
    always get N+1 elements in the resulting list.

    For example, the following code splits a Windows file name into
    path elements.

      List := Split (Source   => "\My_Path\My_File.ext",
                     Splitter => To_Set ('\'));

    This will result in a list with three elements: "", "My_Path", and
    "My_File.ext".
    Note that Null_String_List is different from a list holding only an
    empty string "".

  function Join (Source: String_List;
                 Joiner: Character) return String;
  function Join (Source: String_List;
                 Joiner: Character) return Unbounded_String;

Join is the inverse of Split: Each of these functions creates an (unbounded)
string by joining all the elements of a list given by Source. It inserts one
Joiner character between each adjacent list elements.
If Source is equal to Null_String_List, Index_Error is raised.

    For example, the following code joins the file name split above
    to a Unix file name:

      File_Name := Join (Source => List,
                         Joiner => '/');

  function Element (Source: String_List; Index: Positive)
    return String;
  function Element (Source: String_List; Index: Positive)
    return Unbounded_String;

Each of the Element functions returns the (unbounded) string in Source at the
Index position. If Index > Length (Source), Index_Error is raised.

  function Pop (Source: String_List) return String_List;

Pop drops the first element of Source. If Source is empty, Index_Error is
raised.

  function Condense (Source: String_List) return String_List;

Condense removes all empty list elements. It is idempotent, so applying it to
Null_String_List does not raise an error.

  function Slice (Source: String_List;
                  Low: Positive; High: Natural) return String_List;

Slice returns the sublist of Source within the range Low .. High [the slice's
index range is of course 1 .. High - Low + 1]; it raises Index_Error if
Low > Length (Source) + 1 or High > Length (Source).

The following presents an implementation:
----------------------------------------------------------------------------
with Ada.Strings.Fixed;
with Ada.Unchecked_Deallocation;

package body Ada.Strings.Unbounded.Lists is

  function "=" (Left, Right: String_List) return Boolean is
    -- Empty arrays and null pointers are also equal.
  begin
    return Length (Left) = Length (Right) and
          (Length (Left) = 0 or else Left.Contents.all = Right.Contents.all);
  end "=";

  function Length (Source: String_List) return Natural is
  begin
    if Source.Contents = null then
      return 0;
    else
      return Source.Contents'Last;
    end if;
  end Length;

  function To_String_List (Singleton: String) return String_List is
  begin
    return (Finalization.Controlled with
            new String_Array'(1 => To_Unbounded_String (Singleton)));
  end To_String_List;

  function To_String_List (Singleton: Unbounded_String)
    return String_List is
  begin
    return (Finalization.Controlled with new String_Array'(1 => Singleton));
  end To_String_List;

  function "&" (Left, Right: String_List) return String_List is
    Contents: constant String_Array := Left.Contents.all & Right.Contents.all;
  begin
    return (Finalization.Controlled with new String_Array'(Contents));
  end "&";

  function "&" (Left: String; Right: String_List) return String_List is
    Contents: constant String_Array :=
      To_Unbounded_String (Left) & Right.Contents.all;
  begin
    return (Finalization.Controlled with new String_Array'(Contents));
  end "&";

  function "&" (Left: String_List; Right: String) return String_List is
    Contents: constant String_Array :=
      Left.Contents.all & To_Unbounded_String (Right);
  begin
    return (Finalization.Controlled with new String_Array'(Contents));
  end "&";

  function "&" (Left: Unbounded_String; Right: String_List)
    return String_List is
    Contents: constant String_Array := Left & Right.Contents.all;
  begin
    return (Finalization.Controlled with new String_Array'(Contents));
  end "&";

  function "&" (Left: String_List; Right: Unbounded_String)
    return String_List is
    Contents: constant String_Array := Left.Contents.all & Right;
  begin
    return (Finalization.Controlled with new String_Array'(Contents));
  end "&";

  function Split (Source  : String;
                  Splitter: Maps.Character_Set) return String_List is
    Index: Natural;
    use type Maps.Character_Set;
  begin
    if Splitter = Maps.Null_Set then
      if Source'First >= Source'Last then
        return To_String_List (Source);
      else
        return Source (Source'First .. Source'First) &
               Split (Source (Source'First + 1 .. Source'Last), Splitter);
      end if;
    else
      Index := Fixed.Index (Source, Splitter);
      if Index = 0 then
        return To_String_List (Source);
      else
        return Source (Source'First .. Index - 1) &
               Split (Source (Index + 1 .. Source'Last), Splitter);
      end if;
    end if;
  end Split;

  function Split (Source  : Unbounded_String;
                  Splitter: Maps.Character_Set) return String_List is
  begin
    return Split (To_String (Source), Splitter);
  end Split;

  function Join (Source: String_List; Joiner: Character) return String is
    Tail: constant String_List := Pop (Source);
  begin
    if Length (Tail) = 0 then
      return Element (Source, 1);
    else
      return Element (Source, 1) & Joiner & Join (Pop (Source), Joiner);
    end if;
  end Join;

  function Join (Source: String_List; Joiner: Character)
    return Unbounded_String is
    Tail: constant String_List := Pop (Source);
  begin
    if Length (Tail) = 0 then
      return Element (Source, 1);
    else
      return Unbounded_String'(Element (Source, 1)) & Joiner &
             Unbounded_String'(Join (Pop (Source), Joiner));
    end if;
  end Join;

  function Element (Source: String_List; Index: Positive) return String is
  begin
    if Index > Length (Source) then
      raise Index_Error;
    end if;
    return To_String (Source.Contents (Index));
  end Element;

  function Element (Source: String_List; Index: Positive)
    return Unbounded_String is
  begin
    if Index > Length (Source) then
      raise Index_Error;
    end if;
    return Source.Contents (Index);
  end Element;

  function Pop (Source: String_List) return String_List is
  begin
    if Length (Source) = 0 then
      raise Index_Error;
    end if;
    declare
      Contents: constant String_Array (1 .. Source.Contents'Last - 1) :=
        Source.Contents (2 .. Source.Contents'Last);
    begin
      return (Finalization.Controlled with new String_Array'(Contents));
    end;
  end Pop;

  function Condense (Source: String_List) return String_List is
  begin
    if Length (Source) = 0 then
      return Null_String_List;
    else
      declare
        Contents: String_Array := Source.Contents.all;
        Result  : String_Array (Contents'Range);
        Position: Natural := 0;
      begin
        for I in Contents'Range loop
          if Contents (I) /= Null_Unbounded_String then
            Position := Position + 1;
            Result (Position) := Contents (I);
          end if;
        end loop;
        return (Finalization.Controlled with
                new String_Array'(Result (1 .. Position)));
      end;
    end if;
  end Condense;

  function Slice (Source: String_List;
                  Low: Positive; High: Natural) return String_List is
  begin
    if Low > Length (Source) + 1 or High > Length (Source) then
      raise Index_Error;
    end if;
    declare
      Contents: constant String_Array (1 .. High - Low + 1) :=
        Source.Contents (Low .. High);
    begin
      return (Finalization.Controlled with new String_Array'(Contents));
    end;
  end Slice;

  procedure Adjust (Obj: in out String_List) is
    -- Deep copy.
  begin
    if Obj.Contents /= null then
      Obj.Contents := new String_Array'(Obj.Contents.all);
    end if;
  end Adjust;

  procedure Free is new
    Unchecked_Deallocation (String_Array, String_Array_Access);

  procedure Finalize (Obj: in out String_List) is
  begin
    Free (Obj.Contents);
  end Finalize;

end Ada.Strings.Unbounded.Lists;

****************************************************************

From: Pascal Obry
Sent: Wednesday, January 30, 2002  4:30 AM

I agree and I had the need for such package since a long time. I have built
my first Strings_Cutter package on 1998 and I have found it very useful since
then on many projects. I would definitly like to have such package added in
the standard.

****************************************************************

From: Christoph Grein
Sent: Thursday, January 31, 2002  2:57 AM

Oops, I've forgotten a set of important subprograms:

Each of the Set_Element procedures sets in Source the element at
the Index position to the (unbounded) string To_Value.
If Index > Length (Source), Index_Error is raised.

  procedure Set_Element (Source  : in out String_List;
                         Index   : in Positive;
                         To_Value: in String);
  procedure Set_Element (Source  : in out String_List;
                         Index   : in Positive;
                         To_Value: in Unbounded_String);

  procedure Set_Element (Source  : in out String_List;
                         Index   : in Positive;
                         To_Value: in String) is
  begin
    Set_Element (Source, Index, To_Unbounded_String (To_Value));
  end Set_Element;

  procedure Set_Element (Source  : in out String_List;
                         Index   : in Positive;
                         To_Value: in Unbounded_String) is
  begin
    if Index > Length (Source) then
      raise Index_Error;
    end if;
    Source.Contents (Index) := To_Value;
  end Set_Element;

Other operations like

  function Delete_Slice (Source: String_List;
                         Low: Positive; High: Natural) return String_List;

might also be added, but can be constructed from the already defined ones by the
users themselves and do not seem vital for the package:

  Slice (Source, Low =>        1, High => Low - 1        ) &
  Slice (Source, Low => High + 1, High => Length (Source))

****************************************************************

From: Craig Carey
Sent: Thursday, January 31, 2002  1:26 PM

The proposal would implement code that is too slow for the problem of
splitting one String into two.

A convenient user package that implements regular expressions may return
an array of some custom Unbounded Strings. This package too has an array
of special access Strings and it is not easy to convert from one to the
other. If a user wanted to call "Join ()" on the array of Regex package
access Strings, then user could write a new Join(). But once on that
path then the whole package & its specs could be redone.

The proposal has two packages calling their own Adjust()s. And both
Adjust()s do memory allocations that can be unnecessary.

It seems that "String_List" could be an array instead of a pointer to an
array.

Some users may want to not lose the text that matched, when splitting a
String. Code that does not lose that text, is this:

-----------------------------------------------------------------------------

    type V_Str_Rec is     --  Some of Ref.all may be waste space
       record
          Ref         : String_Access := Null_String_Access;
          Len         : Natural := 0;                  --  The current length
       end record;

    type V_Str_Ptr is access all V_Str_Rec;
    pragma Volatile (V_Str_Ptr);
    ...
    type V_Str is new Ada.Finalization.Limited_Controlled with
       record
          S     : V_Str_Ptr := Null_V_Str_Ptr;
       end record;


    type Split_Mode is (Remove_Pattern, Keep_Pattern);

    procedure Split (
             Source, Pat : String;
             R1, R2      : out V_Str;
             Mode        : Split_Mode := Remove_Pattern;
             Pad         : Character := Space)
    is
          --  Pat is returned in R2 if Pat is returned. Index is a fast
          --  version of Ada.Strings.Fixed.Index.

       Divide      : Natural := Index (Source, Pat);
       From        : Natural;
    begin    --  A competent programmer could improve on this
       if 0 = Divide then
          Assign (R1, Source);
          Assign (R2, "");
       else
          Assign (R1, Trim (Source (Source'First .. Divide - 1), Both, Pad));
          if Mode = Remove_Pattern then
             From := Divide + Pat'Length;
          else
             From := Divide;
          end if;
          Assign (R2, Trim (Source (From .. Source'Last), Both, Pad));
       end if;
    end Split;   --  (Not tested after getting modified)

-----------------------------------------------------------------------------

The specs of my Split() above allow it to be called from inside of a loop.
There can be less inflexibility in splitting a String using a loop.
Also it might be faster.

The parameters of the proposed Split () differ from the one above.

Mr Grein's code could be put onto a webpage or an archive.

****************************************************************

From: Christoph Grein
Sent: Friday, February 1, 2002  12:48 AM

> The proposal would implement code that is too slow for the problem of
> splitting one String into two.

The implementation shown was just given to let you quickly play with the
package. It was in no way meant as a reference implementation. A child of
Ada.Strings.Unbounded sees the implementation of Unbounded_Strings (if
implemented by the implementer of Ada.Strings.Unbounded) and can thus provide
_much_more_ efficient code than an Ada user can do who cannot access the
innards  (because they are not defined by the language; only with Gnat you
have full  access to the implementation).

  >> This is another reason why such a package should be added <<
  >>                     to the standard.                      <<

Like in Strings.Fixed the Index, Split could, additionally to a character set,
provide a Split on a Pattern string (possibly with a Mapping). Also the option
to keep the Splitter character or Pattern string in the string list is worth
condering (two further options: together with the slice preceeding or
succeeding  it).

  type Split_Mode is (Remove, Keep_with_Predecessor, Keep_with_Successor);

  function Split (Source : [Unbounded_]String;
                  Pattern: String;
                  Mode   : Split_Mode := Remove;
                  Mapping: Maps.Character_Mapping := Maps.Identity)
    return String_List;

Split ("Exa(:-)mple",
       Pattern => "(:-)",
       Mode    => Keep_with_Predecessor)

would result in "Exa(:-)" "mple", whereas

Split ("Exa(:-)mple",
       Pattern => "(:-)",
       Mode    => Keep_with_Successor)

would result in "Exa" "(:-)mple".

But then as an inverse, we would also need a Join without a Joiner character:

  function Join (Source: String_List) return [Unbounded_]String;

You can always invent further operations that seem useful. This was just a
basic  proposal.

****************************************************************

From: Christoph Grein
Sent: Friday, February 1, 2002  1:52 AM

Also note that the type String_List must be tagged for "=" to compose properly,
as per AI95-00123.

****************************************************************

From: Nick Roberts
Sent: Thursday, January 31, 2002  5:54 PM

My 'Tenet' container library (to be released as soon as I can!) will
effectively subsume this functionality. E.g.:

   package Character_Lists is
      new Tenet.Lists.Unbounded(Character);

   use Character_Lists, Ada.Strings.Unbounded;

   L: List;
   S: Unbounded_String;

   ...

   Clear(L);
   Append(L,Element_Array(To_String(S)));
   ... -- manipulate list L
   S := To_Unbounded_String(String(To_Array(L)));

Tenet will provide a wide variety of basic operations for list manipulation.

Presumably the same technique will be possible with other list container
packages (maybe less easily).

I agree there is a need for the suggested functionality. I do not mean to
dismerit the original suggestion, and I'm not commenting on the question of
whether such functionality needs to be specified by the standard (or can be
left to secondary libraries).

****************************************************************


Questions? Ask the ACAA Technical Agent