Version 1.2 of acs/ac-00325.txt
!standard A.4.3(16) 20-01-31 AC95-00325/00
!class Amendment 20-01-31
!status received no action 20-01-31
!status received 19-12-02
!subject Simplified string splitting / tokenizing with Procedural Iterators
!summary
!topic Iterating over substrings in a delimited string
!reference Ada 202x Draft 23 RM5.5.3
!reference Ada 202x Draft 23 RMA.4.3
!from Egil Harald Hovik 2019-12-06
!keywords iterator string split
!discussion
Parsing / tokenizing a string is often as simple as splitting the string
on delimiters and processing each substring individually. In many
other languages this means just looping over the result of a call
to a split function which is included in the respective standard
libraries. In Ada, string handling is one of the main issues for
beginners, and the lack of such a split function seems to be an
increasingly common surprise for those coming from other
languages.
With the introduction of Procedural Iterators (RM 5.5.3), this can
easily be remedied by a simple addition to the standard library,
specifically in Ada.Strings.Fixed (RM A.4.3)
(and similar for the bounded and unbounded versions):
procedure Split
(Item : in String;
Delimiters : in Ada.Strings.Maps.Character_Set;
Process : not null access procedure(Substring : in String));
or
procedure Split
(Item : in String;
Delimiters : in Ada.Strings.Maps.Character_Set;
Process : not null access procedure(First : Positive; Last : Natural));
Looping over the substrings would then be as easy as:
declare
use Ada.Strings.Maps;
use Ada.Strings.Fixed;
Line : constant String := "Foo;Bar:Baz,Qux";
begin
for (Substring) of Split(Line, To_Set(":;,")) loop
Ada.Text_IO.Put_Line(Substring);
end loop;
end;
or
declare
use Ada.Strings.Maps;
use Ada.Strings.Fixed;
Line : constant String := "Foo;Bar:Baz,Qux";
begin
for (First, Last) of Split(Line, To_Set(":;,")) loop
Ada.Text_IO.Put_Line( Line(First..Last) );
end loop;
end;
If I understand correctly, this would also play nicely with the
container aggregates, and allow a simple tokenizer to do:
package Token_Vectors is
new Ada.Containers.Indefinite_Vectors(Positive, String);
Tokens : Token_Vectors.Vector :=
[for (Substring) of Split(Line, To_Set(":;,")) => Substring];
Such a simple way to split a string would be a very nice and simple
addition to the standard library and would greatly lower the bar for
new-comers to the language.
I realize it's too late for new ideas for Ada 202x, but this is more
building on one of the new ideas already introduced, rather than an
idea on its own, so I think it's worth considering...
***************************************************************
From: Joey Fish
Sent: Friday, December 6, 2019 10:50 AM
> Parsing / tokenizing a string is often as simple as splitting the string
> on delimiters and processing each substring individually. In many
> other languages this means just looping over the result of a call
> to a split function which is included in the respective standard
> libraries. In Ada, string handling is one of the main issues for
> beginners, and the lack of such a split function seems to be an
> increasingly common surprise for those coming from other
> languages.
No, it is NOT that simple.
To properly parse anything more complex than the "regular languages"
(everything that is arbitrarily nested, like HTML, balanced parentheses,
or CSV) simply CANNOT be done by "simple string splitting".
...
>procedure Split
> (Item : in String;
> Delimiters : in Ada.Strings.Maps.Character_Set;
> Process : not null access procedure(Substring : in String));
And what of multiple-character delimiters?
Ada's own open-label (<<), close-label (>>), and exponent (**).
> If I understand correctly, this would also play nicely with the
> container aggregates, and allow a simple tokenizer to do:
>
> package Token_Vectors is
> new Ada.Containers.Indefinite_Vectors(Positive, String);
>
> Tokens : Token_Vectors.Vector :=
> [for (Substring) of Split(Line, To_Set(":;,")) => Substring];
Perhaps; but as shown above, the presence of multiple-character delimiters
confounds the simplicity, as does the handling of any non-regular language. if
you've done maintenance on systems that use RegEx for 'parsing' information, you
should be familiar with how easy it is to break out of the restrictions of a
"regular language".
> Such a simple way to split a string would be a very nice and simple
> addition to the standard library and would greatly lower the bar for
> new-comers to the language.
The problem I see with this proposal, as-is, is that it introduces a simple and
partial solution to the problem. This could be very bad for the newcomer or
novice as its inclusion could give the impression that it is more 'powerful' or
complete than it really is; this is so common in languages that use RegEx that
there are a not-insignificant portion of programmers that believe that HTML can
be parsed with RegEx.
>I realize it's too late for new ideas for Ada 202x, but this is more
>building on one of the new ideas already introduced, rather than an
>idea on its own, so I think it's worth considering...
I think it is a thrust in the right direction, overall.
Perhaps designing it as a library [built on Ada 2020] first would help flesh the
idea out. (I think this is how the Ada containers came into existence; but that
was before I'd learned Ada so I don't know.)
***************************************************************
From: Richard Wai
Sent: Friday, December 6, 2019 2:32 PM
> Parsing / tokenizing a string is often as simple as splitting the string
> on delimiters and processing each substring individually. In many
> other languages this means just looping over the result of a call
> to a split function which is included in the respective standard
> libraries. In Ada, string handling is one of the main issues for
> beginners, and the lack of such a split function seems to be an
> increasingly common surprise for those coming from other
> languages.
This is an interesting approach... I haven't seen this done very often in the
wild. I definitely do not share your experience that Ada beginners struggle with
the lack of string splitting facilities, as I don't see that getting used as
much as you've experienced. As for Strings more generally, the general
difficulty that some newcomers have usually follows from their prior focus on
garbage collected languages like JavaScript, Python, Java etc, where strings can
be thrown away willy-nilly. If the concepts are properly taught, most newcomers
are able to work with Ada String comfortably. It's all about having the right
mindset - approaching the problem the Ada way.
For all the parsing engines what we've worked with, it is almost universally a
stream manipulation operation. You simply have a Character stream that you parse
sequentially out of. Though one time we used Bounded_Strings and slices to
achieve something tangentially similar to what you describe. I'm of the position
that having a few extra steps using, say, Find_Token, and then taking a slice of
the string from that only enhances readability.
I can't help but feel that you're really seeing a pathology. I think the real
answer is that newcomers to Ada should learn how to do things the Ada-way,
instead of trying to bend Ada to be more like all the other languages. This is
especially important since the Ada-way has very specific rationale behind it,
and is not generally a matter of style. Splitting strings in this way seems to
be functionally redundant. It can be easily achieved through existing means.
Furthermore, parsing strings instead of streams, is probably not be "right" way
to parse text anyways. I think slices with existing search procedures can handle
everything that splitting could.
***************************************************************
From: Randy Brukardt
Sent: Friday, December 6, 2019 8:34 PM
...
> Parsing / tokenizing a string is often as simple as splitting the
> string on delimiters and processing each substring individually. In
> many other languages this means just looping over the result of a
> call to a split function which is included in the respective standard
> libraries.
Ada.Strings.Fixed and so on provide Find_Token for this purpose.
> In Ada, string
> handling is one of the main issues for beginners, and the lack of such
> a split function seems to be an increasingly common surprise for those
> coming from other languages.
This seems like a problem down in the weeds for beginners. Why put lipstick on a
pig?
...
> Looping over the substrings would then be as easy as:
>
> declare
> use Ada.Strings.Maps;
> use Ada.Strings.Fixed;
> Line : constant String := "Foo;Bar:Baz,Qux"; begin
> for (Substring) of Split(Line, To_Set(":;,")) loop
> Ada.Text_IO.Put_Line(Substring);
> end loop;
> end;
...
OK, but if someone knows enough to do this, why don't they know enough to be
able to use Find_Token for this purpose:
declare
use Ada.Strings.Fixed;
Line : constant String := "Foo;Bar:Baz,Qux";
Working, First, Last : Natural;
begin
Working := Line'First;
while Working <= Line'Last loop
Ada.Strings.Fixed.Find_Token (Line, Ada.Strings.Maps.To_Set(":;,"),
From => Working, Test => Inside,
First => First, Last => Last);
Ada.Text_IO.Put_Line (Line(First..Last));
Working := Last+1;
end loop;
end;
This version is longer than yours mainly because of the extra declarations and
using named parameters in the call for readability. (I'd never trust to get them
in the right order.)
And surely your "Split" routine would have a "Test : in Membership" parameter as
every other set-based search routine in Ada.Strings does, so that lengthens your
version.
> If I understand correctly, this would also play nicely with the
> container aggregates, and allow a simple tokenizer to do:
>
> package Token_Vectors is
> new Ada.Containers.Indefinite_Vectors(Positive, String);
>
> Tokens : Token_Vectors.Vector :=
> [for (Substring) of Split(Line, To_Set(":;,")) => Substring];
>
>
> Such a simple way to split a string would be a very nice and simple
> addition to the standard library and would greatly lower the bar for
> new-comers to the language.
The only way to "lower the bar" to newcomers vis-a-vis string handling would be
to completely dump the existing mess and start over with a clean slate.
A string is not an array!
A string's representation is not relevant to its operations!
No one wants to have to write (or read) Wide_Wide_xxxx nonsense.
You should be able to do all of the operations on a single type (try using an
Unbounded_String without writing operations on type String).
But this is way too radical (and incompatible) to do for Ada. We would need a
reimagined Ada successor to do that. (One would have to start with a
Root_String'Class and build from there.)
Your idea is cool, but as others have pointed out, it almost never works in real
situations. (Neither does Find_Token for that matter.) Dubious if it would help
much for newcomers; the problem is dealing with fixed length strings and the
fact that you can't get away from that even when using Unbounded_Strings. (And
if we fixed that, then the problem would be dealing sanely with encoded strings
- which ought to be strongly typed and have appropriate operations.)
***************************************************************
From: Randy Brukardt
Sent: Friday, December 6, 2019 8:56 PM
>>Parsing / tokenizing a string is often as simple as splitting the
>>string on delimiters and processing each substring individually. In
>>many other languages this means just looping over the result of a
>>call to a split function which is included in the respective standard
>>libraries. In Ada, string handling is one of the main issues for
>>beginners, and the lack of such a split function seems to be an
>>increasingly common surprise for those coming from other languages.
>No, it is NOT that simple.
>
>To properly parse anything more complex than the "regular languages"
>(everything that is arbitrarily nested, like HTML, balanced
>parentheses, or CSV) simply CANNOT be done by "simple string splitting".
I suspect that you are reading parsing like I do, whereas the original poster is
confusing "parsing" and "lexical analysis" (usually shortened to "lexing").
One can't parse anything by string manipulation, by definition: the string
manipulation is a separate lexing step. Sometimes people think they are parsing
when dealing with ultra-simple languages, but in fact they have (and need) no
parsing at all.
>And what of multiple-character delimiters?
>
>Ada's own open-label (<<), close-label (>>), and exponent (**).
That's the easy part when dealing with Ada. The hard part is dealing with:
Character'('A')
in which the interpretation of the first ' is determined by the token that it
follows. (Assuming we don't change the Ada grammar too much and destroy this
property.) This requires having the state of the preceeding token.
...
>>Such a simple way to split a string would be a very nice and simple
>>addition to the standard library and would greatly lower the bar for
>>new-comers to the language.
>The problem I see with this proposal, as-is, is that it introduces a
>simple and partial solution to the problem.
Right. It could be useful for simple problems (splitting text into words, for
instance; I used Find_Token for that, and this is just Find_Token on steroids).
But there should be no assumption that it is good for many real problems,
because it's not enough to lex any programming language, and it has nothing to
do with parsing of anything.
>This could be very bad for the newcomer or novice as its inclusion
>could give the impression that it is more 'powerful' or complete than
>it really is; this is so common in languages that
>use RegEx that there are a not-insignificant portion of programmers
>that believe that HTML can be parsed with RegEx.
HTML and XML don't need to be parsed at all, they are simple enough to skip that
step altogether. And since much of the HTML in the wild is malformed,
traditional parsing would be more of a problem than a solution.
In my uses of HTML analysis, parsing has to be avoided: the HTML may be
malformed (and possibly on purpose in the case of spam), and moreover many small
tasks are doing the analysis, so a recursive descent parser would be at risk of
running out of stack space. (And we all know that Storage_Error is a parachute
that opens on impact -- thanks, Dave Emery, for that truism. :-)
Most of what is typically described as "parsing" of HTML and XML is better
described as semantic analysis - type checking and the like is not parsing!
(Sorry, this is one of my pet peeves about the common description of HTML and
XML processing. :-)
In any case, I don't see much benefit to this operation, unless we're just
looking for additional "cool" examples for Ada. Even then, I don't see much
point in adding it to the Standard; there's a lot cooler things out there in
libraries that people have developed.
***************************************************************
From: Jean-Pierre Rosen
Sent: Saturday, December 7, 2019 12:36 AM
> That's the easy part when dealing with Ada. The hard part is dealing with:
> Character'('A')
Or my favorite:
subtype C is Character;
V : String := C'(')')'Image;
***************************************************************
From: Pascal Pignard
Sent: Saturday, December 7, 2019 12:45 AM
> A string is not an array!
> A string's representation is not relevant to its operations!
> No one wants to have to write (or read) Wide_Wide_xxxx nonsense.
> You should be able to do all of the operations on a single type (try
> using an Unbounded_String without writing operations on type String).
>
> But this is way too radical (and incompatible) to do for Ada. We would
> need a reimagined Ada successor to do that. (One would have to start
> with a Root_String'Class and build from there.)
Is it so definitive?
String is defined as an array in Standard package but obviously doesn't meet the
expected semantic. Why not first rename String into explicitly String_Array in
the standard? (and perhaps Wide_String into Wide_String_Array and so on) The
renaming will be over all the Ada library so no self compatibility issue. Though
there is a user compatibility issue which can be minimise by doing search and
replace over all the user source code or adding the declaration of "subtype
String is String_Array;" or using convenient conversion subroutines (see next).
Then you can declare in Standard for instance:
type Root_String is tagged private;
type String is Root_String'Class;
-- declare adequate useful subroutines
-- declare convenient conversion subroutines with String_Array, Wide_String_Array...
You can add the direct assignment with string literals:
S1 : String := "my string";
Then you have possibility to add in the Ada library "new" subroutines using new
String:
(the old ones have renaming to String_Array) for instance in Ada.Text_IO:
function Name (File : in File_Type) return String; --
If too heavy compatibility issues, change the name:
type Enhanced_String is Root_String'Class;
Is it so naive?
***************************************************************
From: Randy Brukardt
Sent: Monday, December 9, 2019 7:24 PM
> > A string is not an array!
> > A string's representation is not relevant to its operations!
> > No one wants to have to write (or read) Wide_Wide_xxxx nonsense.
> > You should be able to do all of the operations on a single type (try
> > using an Unbounded_String without writing operations on type String).
> >
> > But this is way too radical (and incompatible) to do for Ada. We
> > would need a reimagined Ada successor to do that. (One would have to
> > start with a Root_String'Class and build from there.)
>
> Is it so definitive?
> String is defined as an array in Standard package but obviously
> doesn't meet the expected semantic.
> Why not first rename String into explicitly String_Array in the
> standard?
That would be way too incompatible to do for Ada (as opposed to a successor Ada
language). We want the vast majority of Ada code to continue to compile and work
in new Ada versions. We generally only allow incompatibilities that occur in
unlikely cases or those that actually fix a bug. The definition of String was a
mistake but it would be hard to argue that it is a bug.
> Though there is a user compatibility issue which can be minimise by
> doing search and replace over all the user source code or adding the
> declaration of "subtype String is String_Array;" or using convenient
> conversion subroutines (see next).
It's not as easy as just replacing "String" globally. Just doing that with a
dumb text search would be a disaster, as it would change all of the names that
include String as part of them (pretty common), and also would clobber the
contents of strings and comments. Even if one is using a smarter Ada-aware
substitutor, one has to check and update comments manually (some will be talking
about type String and some will be talking more generally). As well as any
separate documentation.
One could imagine adding an abstract Root_String alongside the existing String
type, but that has its own problems, and doing so would require adding
additional versions of most of the Ada.Strings packages. (Again, we want to be
able to compile existing Ada code unchanged in most cases.)
***************************************************************
From: Joshua Fletcher
Sent: Monday, December 9, 2019 1:25 PM
> Perhaps designing it as a library [built on Ada 2020] first would help flesh
> the idea out. (I think this is how the Ada containers came into existence; but
> that was before I'd learned Ada so I don't know.)
The Ada Containers had their start as a library called Charles (I believe it was
named after Charles Babbage), written by Matthew Heaney, published in Ada-Europe
2003
https://link.springer.com/chapter/10.1007/3-540-44947-7_20
***************************************************************
Questions? Ask the ACAA Technical Agent