Version 1.2 of acs/ac-00325.txt

Unformatted version of acs/ac-00325.txt version 1.2
Other versions for file acs/ac-00325.txt

!standard A.4.3(16)          20-01-31 AC95-00325/00
!class Amendment 20-01-31
!status received no action 20-01-31
!status received 19-12-02
!subject Simplified string splitting / tokenizing with Procedural Iterators
!summary
!topic Iterating over substrings in a delimited string
!reference Ada 202x Draft 23 RM5.5.3
!reference Ada 202x Draft 23 RMA.4.3
!from Egil Harald Hovik 2019-12-06
!keywords iterator string split
!discussion
Parsing / tokenizing a string is often as simple as splitting the string on delimiters and processing each substring individually. In many other languages this means just looping over the result of a call to a split function which is included in the respective standard libraries. In Ada, string handling is one of the main issues for beginners, and the lack of such a split function seems to be an increasingly common surprise for those coming from other languages.
With the introduction of Procedural Iterators (RM 5.5.3), this can easily be remedied by a simple addition to the standard library, specifically in Ada.Strings.Fixed (RM A.4.3) (and similar for the bounded and unbounded versions):
procedure Split
(Item : in String;
Delimiters : in Ada.Strings.Maps.Character_Set; Process : not null access procedure(Substring : in String));
or
procedure Split
(Item : in String;
Delimiters : in Ada.Strings.Maps.Character_Set; Process : not null access procedure(First : Positive; Last : Natural));
Looping over the substrings would then be as easy as:
declare use Ada.Strings.Maps; use Ada.Strings.Fixed; Line : constant String := "Foo;Bar:Baz,Qux"; begin for (Substring) of Split(Line, To_Set(":;,")) loop Ada.Text_IO.Put_Line(Substring); end loop; end;
or
declare use Ada.Strings.Maps; use Ada.Strings.Fixed; Line : constant String := "Foo;Bar:Baz,Qux"; begin for (First, Last) of Split(Line, To_Set(":;,")) loop Ada.Text_IO.Put_Line( Line(First..Last) ); end loop; end;
If I understand correctly, this would also play nicely with the container aggregates, and allow a simple tokenizer to do:
package Token_Vectors is new Ada.Containers.Indefinite_Vectors(Positive, String);
Tokens : Token_Vectors.Vector := [for (Substring) of Split(Line, To_Set(":;,")) => Substring];
Such a simple way to split a string would be a very nice and simple addition to the standard library and would greatly lower the bar for new-comers to the language.
I realize it's too late for new ideas for Ada 202x, but this is more building on one of the new ideas already introduced, rather than an idea on its own, so I think it's worth considering...
***************************************************************
From: Joey Fish Sent: Friday, December 6, 2019 10:50 AM
> Parsing / tokenizing a string is often as simple as splitting the string > on delimiters and processing each substring individually. In many > other languages this means just looping over the result of a call > to a split function which is included in the respective standard > libraries. In Ada, string handling is one of the main issues for > beginners, and the lack of such a split function seems to be an > increasingly common surprise for those coming from other > languages.
No, it is NOT that simple.
To properly parse anything more complex than the "regular languages" (everything that is arbitrarily nested, like HTML, balanced parentheses, or CSV) simply CANNOT be done by "simple string splitting".
... >procedure Split > (Item : in String; > Delimiters : in Ada.Strings.Maps.Character_Set; > Process : not null access procedure(Substring : in String));
And what of multiple-character delimiters?
Ada's own open-label (<<), close-label (>>), and exponent (**).
> If I understand correctly, this would also play nicely with the > container aggregates, and allow a simple tokenizer to do: > > package Token_Vectors is > new Ada.Containers.Indefinite_Vectors(Positive, String); > > Tokens : Token_Vectors.Vector := > [for (Substring) of Split(Line, To_Set(":;,")) => Substring];
Perhaps; but as shown above, the presence of multiple-character delimiters confounds the simplicity, as does the handling of any non-regular language. if you've done maintenance on systems that use RegEx for 'parsing' information, you should be familiar with how easy it is to break out of the restrictions of a "regular language".
> Such a simple way to split a string would be a very nice and simple > addition to the standard library and would greatly lower the bar for > new-comers to the language.
The problem I see with this proposal, as-is, is that it introduces a simple and partial solution to the problem. This could be very bad for the newcomer or novice as its inclusion could give the impression that it is more 'powerful' or complete than it really is; this is so common in languages that use RegEx that there are a not-insignificant portion of programmers that believe that HTML can be parsed with RegEx.
>I realize it's too late for new ideas for Ada 202x, but this is more >building on one of the new ideas already introduced, rather than an >idea on its own, so I think it's worth considering...
I think it is a thrust in the right direction, overall.
Perhaps designing it as a library [built on Ada 2020] first would help flesh the idea out. (I think this is how the Ada containers came into existence; but that was before I'd learned Ada so I don't know.)
***************************************************************
From: Richard Wai Sent: Friday, December 6, 2019 2:32 PM
> Parsing / tokenizing a string is often as simple as splitting the string > on delimiters and processing each substring individually. In many > other languages this means just looping over the result of a call > to a split function which is included in the respective standard > libraries. In Ada, string handling is one of the main issues for > beginners, and the lack of such a split function seems to be an > increasingly common surprise for those coming from other > languages.
This is an interesting approach... I haven't seen this done very often in the wild. I definitely do not share your experience that Ada beginners struggle with the lack of string splitting facilities, as I don't see that getting used as much as you've experienced. As for Strings more generally, the general difficulty that some newcomers have usually follows from their prior focus on garbage collected languages like JavaScript, Python, Java etc, where strings can be thrown away willy-nilly. If the concepts are properly taught, most newcomers are able to work with Ada String comfortably. It's all about having the right mindset - approaching the problem the Ada way.
For all the parsing engines what we've worked with, it is almost universally a stream manipulation operation. You simply have a Character stream that you parse sequentially out of. Though one time we used Bounded_Strings and slices to achieve something tangentially similar to what you describe. I'm of the position that having a few extra steps using, say, Find_Token, and then taking a slice of the string from that only enhances readability.
I can't help but feel that you're really seeing a pathology. I think the real answer is that newcomers to Ada should learn how to do things the Ada-way, instead of trying to bend Ada to be more like all the other languages. This is especially important since the Ada-way has very specific rationale behind it, and is not generally a matter of style. Splitting strings in this way seems to be functionally redundant. It can be easily achieved through existing means. Furthermore, parsing strings instead of streams, is probably not be "right" way to parse text anyways. I think slices with existing search procedures can handle everything that splitting could.
***************************************************************
From: Randy Brukardt Sent: Friday, December 6, 2019 8:34 PM
... > Parsing / tokenizing a string is often as simple as splitting the > string on delimiters and processing each substring individually. In > many other languages this means just looping over the result of a > call to a split function which is included in the respective standard > libraries.
Ada.Strings.Fixed and so on provide Find_Token for this purpose.
> In Ada, string > handling is one of the main issues for beginners, and the lack of such > a split function seems to be an increasingly common surprise for those > coming from other languages.
This seems like a problem down in the weeds for beginners. Why put lipstick on a pig?
... > Looping over the substrings would then be as easy as: > > declare > use Ada.Strings.Maps; > use Ada.Strings.Fixed; > Line : constant String := "Foo;Bar:Baz,Qux"; begin > for (Substring) of Split(Line, To_Set(":;,")) loop > Ada.Text_IO.Put_Line(Substring); > end loop; > end; ...
OK, but if someone knows enough to do this, why don't they know enough to be able to use Find_Token for this purpose:
declare use Ada.Strings.Fixed; Line : constant String := "Foo;Bar:Baz,Qux"; Working, First, Last : Natural; begin Working := Line'First; while Working <= Line'Last loop Ada.Strings.Fixed.Find_Token (Line, Ada.Strings.Maps.To_Set(":;,"), From => Working, Test => Inside, First => First, Last => Last); Ada.Text_IO.Put_Line (Line(First..Last)); Working := Last+1; end loop; end;
This version is longer than yours mainly because of the extra declarations and using named parameters in the call for readability. (I'd never trust to get them in the right order.)
And surely your "Split" routine would have a "Test : in Membership" parameter as every other set-based search routine in Ada.Strings does, so that lengthens your version.
> If I understand correctly, this would also play nicely with the > container aggregates, and allow a simple tokenizer to do: > > package Token_Vectors is > new Ada.Containers.Indefinite_Vectors(Positive, String); > > Tokens : Token_Vectors.Vector := > [for (Substring) of Split(Line, To_Set(":;,")) => Substring]; > > > Such a simple way to split a string would be a very nice and simple > addition to the standard library and would greatly lower the bar for > new-comers to the language.
The only way to "lower the bar" to newcomers vis-a-vis string handling would be to completely dump the existing mess and start over with a clean slate.
A string is not an array! A string's representation is not relevant to its operations! No one wants to have to write (or read) Wide_Wide_xxxx nonsense. You should be able to do all of the operations on a single type (try using an Unbounded_String without writing operations on type String).
But this is way too radical (and incompatible) to do for Ada. We would need a reimagined Ada successor to do that. (One would have to start with a Root_String'Class and build from there.)
Your idea is cool, but as others have pointed out, it almost never works in real situations. (Neither does Find_Token for that matter.) Dubious if it would help much for newcomers; the problem is dealing with fixed length strings and the fact that you can't get away from that even when using Unbounded_Strings. (And if we fixed that, then the problem would be dealing sanely with encoded strings - which ought to be strongly typed and have appropriate operations.)
***************************************************************
From: Randy Brukardt Sent: Friday, December 6, 2019 8:56 PM
>>Parsing / tokenizing a string is often as simple as splitting the >>string on delimiters and processing each substring individually. In >>many other languages this means just looping over the result of a >>call to a split function which is included in the respective standard >>libraries. In Ada, string handling is one of the main issues for >>beginners, and the lack of such a split function seems to be an >>increasingly common surprise for those coming from other languages.
>No, it is NOT that simple. > >To properly parse anything more complex than the "regular languages" >(everything that is arbitrarily nested, like HTML, balanced >parentheses, or CSV) simply CANNOT be done by "simple string splitting".
I suspect that you are reading parsing like I do, whereas the original poster is confusing "parsing" and "lexical analysis" (usually shortened to "lexing").
One can't parse anything by string manipulation, by definition: the string manipulation is a separate lexing step. Sometimes people think they are parsing when dealing with ultra-simple languages, but in fact they have (and need) no parsing at all.
>And what of multiple-character delimiters? > >Ada's own open-label (<<), close-label (>>), and exponent (**).
That's the easy part when dealing with Ada. The hard part is dealing with:
Character'('A')
in which the interpretation of the first ' is determined by the token that it follows. (Assuming we don't change the Ada grammar too much and destroy this property.) This requires having the state of the preceeding token.
... >>Such a simple way to split a string would be a very nice and simple >>addition to the standard library and would greatly lower the bar for >>new-comers to the language.
>The problem I see with this proposal, as-is, is that it introduces a >simple and partial solution to the problem.
Right. It could be useful for simple problems (splitting text into words, for instance; I used Find_Token for that, and this is just Find_Token on steroids). But there should be no assumption that it is good for many real problems, because it's not enough to lex any programming language, and it has nothing to do with parsing of anything.
>This could be very bad for the newcomer or novice as its inclusion >could give the impression that it is more 'powerful' or complete than >it really is; this is so common in languages that >use RegEx that there are a not-insignificant portion of programmers >that believe that HTML can be parsed with RegEx.
HTML and XML don't need to be parsed at all, they are simple enough to skip that step altogether. And since much of the HTML in the wild is malformed, traditional parsing would be more of a problem than a solution.
In my uses of HTML analysis, parsing has to be avoided: the HTML may be malformed (and possibly on purpose in the case of spam), and moreover many small tasks are doing the analysis, so a recursive descent parser would be at risk of running out of stack space. (And we all know that Storage_Error is a parachute that opens on impact -- thanks, Dave Emery, for that truism. :-)
Most of what is typically described as "parsing" of HTML and XML is better described as semantic analysis - type checking and the like is not parsing!
(Sorry, this is one of my pet peeves about the common description of HTML and XML processing. :-)
In any case, I don't see much benefit to this operation, unless we're just looking for additional "cool" examples for Ada. Even then, I don't see much point in adding it to the Standard; there's a lot cooler things out there in libraries that people have developed.
***************************************************************
From: Jean-Pierre Rosen Sent: Saturday, December 7, 2019 12:36 AM
> That's the easy part when dealing with Ada. The hard part is dealing with: > Character'('A') Or my favorite:
subtype C is Character; V : String := C'(')')'Image;
***************************************************************
From: Pascal Pignard Sent: Saturday, December 7, 2019 12:45 AM
> A string is not an array! > A string's representation is not relevant to its operations! > No one wants to have to write (or read) Wide_Wide_xxxx nonsense. > You should be able to do all of the operations on a single type (try > using an Unbounded_String without writing operations on type String). > > But this is way too radical (and incompatible) to do for Ada. We would > need a reimagined Ada successor to do that. (One would have to start > with a Root_String'Class and build from there.)
Is it so definitive? String is defined as an array in Standard package but obviously doesn't meet the expected semantic. Why not first rename String into explicitly String_Array in the standard? (and perhaps Wide_String into Wide_String_Array and so on) The renaming will be over all the Ada library so no self compatibility issue. Though there is a user compatibility issue which can be minimise by doing search and replace over all the user source code or adding the declaration of "subtype String is String_Array;" or using convenient conversion subroutines (see next).
Then you can declare in Standard for instance: type Root_String is tagged private; type String is Root_String'Class; -- declare adequate useful subroutines -- declare convenient conversion subroutines with String_Array, Wide_String_Array...
You can add the direct assignment with string literals: S1 : String := "my string";
Then you have possibility to add in the Ada library "new" subroutines using new String: (the old ones have renaming to String_Array) for instance in Ada.Text_IO:
function Name (File : in File_Type) return String; -- using the new String type
If too heavy compatibility issues, change the name: type Enhanced_String is Root_String'Class;
Is it so naive?
***************************************************************
From: Randy Brukardt Sent: Monday, December 9, 2019 7:24 PM
> > A string is not an array! > > A string's representation is not relevant to its operations! > > No one wants to have to write (or read) Wide_Wide_xxxx nonsense. > > You should be able to do all of the operations on a single type (try > > using an Unbounded_String without writing operations on type String). > > > > But this is way too radical (and incompatible) to do for Ada. We > > would need a reimagined Ada successor to do that. (One would have to > > start with a Root_String'Class and build from there.) > > Is it so definitive? > String is defined as an array in Standard package but obviously > doesn't meet the expected semantic. > Why not first rename String into explicitly String_Array in the > standard?
That would be way too incompatible to do for Ada (as opposed to a successor Ada language). We want the vast majority of Ada code to continue to compile and work in new Ada versions. We generally only allow incompatibilities that occur in unlikely cases or those that actually fix a bug. The definition of String was a mistake but it would be hard to argue that it is a bug.
> Though there is a user compatibility issue which can be minimise by > doing search and replace over all the user source code or adding the > declaration of "subtype String is String_Array;" or using convenient > conversion subroutines (see next).
It's not as easy as just replacing "String" globally. Just doing that with a dumb text search would be a disaster, as it would change all of the names that include String as part of them (pretty common), and also would clobber the contents of strings and comments. Even if one is using a smarter Ada-aware substitutor, one has to check and update comments manually (some will be talking about type String and some will be talking more generally). As well as any separate documentation.
One could imagine adding an abstract Root_String alongside the existing String type, but that has its own problems, and doing so would require adding additional versions of most of the Ada.Strings packages. (Again, we want to be able to compile existing Ada code unchanged in most cases.)
***************************************************************
From: Joshua Fletcher Sent: Monday, December 9, 2019 1:25 PM
> Perhaps designing it as a library [built on Ada 2020] first would help flesh > the idea out. (I think this is how the Ada containers came into existence; but > that was before I'd learned Ada so I don't know.)
The Ada Containers had their start as a library called Charles (I believe it was named after Charles Babbage), written by Matthew Heaney, published in Ada-Europe 2003
https://link.springer.com/chapter/10.1007/3-540-44947-7_20
***************************************************************

Questions? Ask the ACAA Technical Agent