Version 1.2 of ai12s/ai12-0196-1.txt

Unformatted version of ai12s/ai12-0196-1.txt version 1.2
Other versions for file ai12s/ai12-0196-1.txt

!standard A(3/4)          16-09-23 AI12-0196-1/02
!standard A.18(5)
!class Amendment 16-06-07
!status work item 16-06-07
!status received 16-06-05
!priority Low
!difficulty Hard
!subject Concurrent access to Ada container libraries
!summary
parallel processing of containers
!problem
With the advent of multi-core computers, the performance of certain algorithms and data structures can be improved. The Ada standard containers are such data structures that could be involved in lengthy processing and could be candidates for parallelism. Unfortunately, Ada currently says that for concurrent calls, the language-defined subprograms are only reentrant if the parameters of the call are nonoverlapping. Most container calls involve a controlling parameter of the container type, yet the call on the container may only be affecting a subset of the elements in the container, and may not affect state that is global to all of the elements being modified. Other calls only read elements of the container without modification. It seems it should be possible to apply parallelism to such cases so long as each independent thread operates on a unique, nonoverlapping set of elements from the container.
!proposal
We propose to allow certain operations that manipulate or examine cursors (ie. Next, Previous, and Element) to be reentrant if the calls only accept a Cursor as a parameter. This is achieved by specifying that cursors only overlap with the element they designate. For Vector containers, we also specify that To_Cursor is Reentrant.
!wording
Apply the following modifiations
A 3(4) "The implementation shall ensure that each language-defined subprogram is reentrant in the sense that concurrent calls on any language-defined subprogram{s (possibly the same)} perform as specified, so long as all objects that are denoted by parameters that could be passed by reference or designated by parameters of an access type are nonoverlapping."
A.18 (2/2) "A cursor referencing an element in a container is considered to be
overlapping {only} with the {element}[container] object itself."
A.18 (2.a/2) "Reason: The last sentence is intended to clarify that operations
that just use a cursor are {reentrant if the cursor objects designate different elements of the container}[on the same footing as operations that use a container] in terms of the reentrancy rules of Annex A. "
A.18 (2.b/4) Add "Ramification: A cursor is not considered to overlap with
the associated container, thus parallel operations involving a set of cursors each operating on mutually exclusive sets of elements from the same container are expected to work."
A.18 (5.o/2)
"Thus, it is only necessary for a user to protect a container if a single
container needs to be used by multiple tasks {and concurrent calls to the container are not reentrant}."
A.18.2 (125) Append "To_Cursor is reentrant."
Add A.18.2 (125.1) "Ramification: Without the preceding rule, To_Cursor would not be considered reentrant for concurrent calls on the same container object by the reentrancy rules in A.3, since the container object of the concurrent calls would overlap with itself. We want this to be reentrant to allow parallel processing of Vectors to allow the Vector elements to be split into separate "chunks" of processing.
A.18.6 (147.1) Append "An object of a Constant_Reference_Type
is considered to be overlapping only with the object designated by Element."
A.18.6 (147.1) Append "An object of a Reference_Type
is considered to be overlapping only with the object designated by Element."
Similarly apply the above cursor and reference modifications to the other containers.
!discussion
The needs of parallel programs are quite different than those of general concurrent programs. The goal is to achieve performance rather than satisfy needs for access from multiple threads of execution. Parallel programs typically apply a divide and conquer strategy to reduce a bigger problem into smaller ones that can be handed out to separate cores of a multicore processor. By this very nature, the smaller pieces of work tend to be separate independent work items. A good example of such a problem is container iteration, where similar processing needs to be applied to all the elements of the container.
The Ada standard container libraries do not support such concurrent access, even though implementations exist that provide this support. This level of support should be standardized so that parallel programs can be portable across different compiler implementations.
The first obstacle is that the standard says that a language-defined subprogram call is only reentrant if all the parameters of the call are nonoverlapping with respect to other active calls.
The standard also says that a cursor for a container is to be treated as though it overlaps with the container object.
To iterate through a container in parallel however, multiple concurrent cursor objects are needed, so that each can be assigned to process separate portions of the container elements. To allow this to happen we relax the above overlapping rule to say that a cursor is treated only as though it overlaps with the element in the container that it designates. So song as cursor operations are not applied to two or more cursors designating the same element of the container, concurrent calls to separate cursors of the same container object are possible. This allows various cursor manipulation subprograms (eg. next, previous, and Elemen), to be executed concurrently in parallel, since they only accept a single parameter.
For Vector containers, this still leaves us with the problem of obtaining the multiple cursors in the first place, since one normally would want to directly obtain the initial cursor for each chunk of the vector. The To_Cursor subprogram however has a parameter that is the container associated with the cursor to be generated, and thus still is considered to be an overlapping call that is not reentrant if concurrent calls are applied to the same container object.
To_Cursor however, is trivial to implement as a reentrant routine since it only needs to contruct an object that designates the container and contains an index to the vector. To_Cursor has no side effects, and is thus what we might call a pure function, if the language defined such a thing. We specify in the wording that this subprogram is reentrant to ensure that it can be safely called.
Containers that allow random access such as vectors are ideally suited for parallel processing, but other containers such as Doubly_Linked_Lists can still be processed in parallel. For those cases, it can be worthwhile if the amount of time needed to iterate through the container is small relative to the amount of time needed to process the elements in the container.
In this case, one thread of control can be used to sequentially iterate through the entire container, while other threads of execution can be started as the iterating thread reaches the next chunk of iterations to be processed in parallel. These parallel threads then iterate through their respective chunks applying the processing in parallel.
Since Next, Previous, and Element are all that is needed for this type of parallel processing, which are now all defined to be reentrant, no container specific wording is needed for these types of containers.
This AI provides the bare minimum to allow parallel processing of the container libraries. All other rules involving tampering and parameter overlapping still apply, which means that insertion, deletion, or replacing elements into or from the container are not allowed while parallel operations to that container are occurring concurrently.
Calls such as Replace_Element cannot be made concurrently because of the tampering rules. Many of the container library calls could be made reentrant without requiring locking, particularly if tampering checks are suppressed.
A full parallelism solution is needed for parallel loops and container iteration and usage, but that is expected to be covered by a different AI or set of AIs that deal with other issues such as eliminating data races.
This AI just allows some basic parallelism to occur with minimal impact to the standard.
!examples
As an example of usage for a Vector container, consider the problem of adding two vectors together to produce a result vector. Suppose a user defined parallelism library call exists with the following profile;
procedure Parallel_Loop
(From : Natural;
To : Natural; Process : not null access procedure (Start, Finish : Natural))
Where From and To identify the full range of iteration for the loop, and Process identifies a user supplied routine to process a separate "chunk" of loop iterations.
The following code fragment should compile and execute to generate the expected results.
declare
N : constant := 1_000_000;
subtype Loop_Index is Natural range 1 .. N;
package Integer_Vectors is new Ada.Containers.Vectors (Index_Type => Loop_Index, Element_Type => Integer);
Vector1, Vector2 : constant Integer_Vectors.Vector := Integer_Vectors.To_Vector (New_Item => 1, Length => Containers.Count_Type (N));
Result : Integer_Vectors.Vector := Integer_Vectors.To_Vector (Length => Containers.Count_Type (N));
procedure Process (Start, Finish : Parallel.Loop_Index) is begin for I in Start .. Finish loop Result (I) := Vector1 (I) + Vector2 (I); end loop; end Process;
begin Parallel_Loop (From => 1, To => N, Process => Process'access); end;
As an example of processing a Doubly_Linked_List in parallel, consider the problem of applying the same update to all elements in a linked list.
For this example, assume a different user defined library exists with the following profile;
generic type Cursor is private; procedure Generic_Parallel_Loop (From : Cursor; -- First Cursor of the container Iterations : Natural; Get_Cursor : not null access function (Position : Cursor; Offset : Positive) return Cursor; Process : not null access procedure (Iterations : Iteration_Count; Work_Item : in out Cursor));
From is passed a cursor corresponding to the first element in the container. Iterations identifes the number of elements in the container. The Get_Cursor function is a user supplied function that generates a new cursor as an offset from an existing cursor. The Process parameter identifies the user supplied processing that is to be applied to each element in the container.
declare package Integer_Lists is new Containers.Doubly_Linked_Lists (Element_Type => Integer);
procedure Parallel_Loop is new Generic_Parallel_Loop (Cursor => Integer_Lists.Cursor);
function Get_Cursor (Position : Integer_Lists.Cursor; Offset : Natural) return Integer_Lists.Cursor is New_Position : Integer_Lists.Cursor := Position; begin for I in 1 .. Offset loop New_Position := Integer_Lists.Next (New_Position); end loop;
return New_Position; end Get_Cursor;
Integer_List : Integer_Lists.List;
procedure Process (Iterations : Containers.Count_Type; Work_Item : in out Integer_Lists.Cursor) is begin for I in 1 .. Iterations loop Integer_List (Work_Item) := Integer_List (Work_Item) + 1; Integer_Lists.Next (Work_Item); end loop; end Process;
begin
List1.Append (New_Item => 1, Count => Containers.Count_Type (N));
Parallel_Loop (From => Integer.First, Iterations => N, Get_Cursor => Get_Cursor'access, Process => Process'access); end;
!ASIS
No impact.
!ACATS test
An ACATS C-Test can be constructed to check that parallel access works, but the probability of causing a problem is low.
!appendix

From: Brad Moore
Sent: Thursday, June 2, 2016  1:49 AM

My task was to look specifically at referencing separate elements of a container
in parallel. The example to consider is adding two Vectors element by element to
produce a third result vector.

I tried this with the GNAT implementation of Ada.Containers.Vectors, and found
that the issue shows itself during finalization of the container, as the
tampering counts get messed up which concurrent access, and should be zero when
the Vector is finalized but are not.

I found I could correct this issue, by wrapping the updates of these tampering
checks with lock-free atomic primitives.

Using a lock free compare and swap operation, provides the needed safety, while
introducing minimal overhead.

It this the sort of solution we are looking for?

Another idea would be to have a way to disable the tampering checks altogether
such as via an aspect or pragma.

Eg. pragma Suppress (Tamper_Check).

This would have the benefit of providing a faster implementation of containers,
which I recall was a concern raised in the past, but at the expense of safety.
This would also likely result in eliminating the finalization bug mentioned
above for concurrent access, since there would be no tampering flags to get
messed up, and otherwise the concurrent access seems to work fine for the
problem at hand.

Perhaps both solutions would be needed, the first provides better guarantees
that the tamper checks are working properly, for concurrent usage, and the
second allows more efficient processing when tampering is known not to occur in
the users code?

****************************************************************

From: Randy Brukardt
Sent: Thursday, June 2, 2016  6:35 PM

I don't think the exact implementation is particularly relevant (other than to
provide a proof of concept). I trust implementers to figure that sort of thing
out. Especially as the Standard has no such thing as "tampering counts"; they
themselves are an implementation artifact.

What we do need, however, is to figure out what rules need to be changed in the
Standard in order to provide the needed operations. Which operations are those
exactly? Making everything task-safe is clearly too expensive (and seems like
overkill to me).

> Another idea would be to have a way to disable the tampering checks
> altogether such as via an aspect or pragma.
>
> Eg. pragma Suppress (Tamper_Check).

That idea was suggested long ago, supposedly is implemented in GNAT (I say
"supposedly" only because I've never had any reason to try it), and is currently
assigned to Bob Duff to write up. Ergo, it's hardly novel here, and probably
should only be mentioned in your write-up for completeness. (If Bob never
follows through, we can revisit it.)

****************************************************************

From: Tucker Taft
Sent: Thursday, June 2, 2016  9:43 PM

> That idea was suggested long ago, supposedly is implemented in GNAT (I
> say "supposedly" only because I've never had any reason to try it),
> and is currently assigned to Bob Duff to write up.

Yes it is implemented.  The relevant Suppress identifiers are:

   Container_Checks
      -- controls all sorts of run-time checks on containers, including
      -- tampering
   Tampering_Check
      -- controls only the tampering check

If you suppress either one, the tampering checks are omitted.

****************************************************************

From: Brad Moore
Sent: Friday, June 3, 2016 10:28 AM

Wow. That makes quite a difference!

(About 10 times faster for adding two Vectors of 100_000_000 integers for the
sequential case)

I also notice that the just released GNAT GPL version also fixed the problems I
was seeing with vector finalization for the parallel case caused by corruption
of the tampering counts.

I think these suppress pragmas were not available in last years GPL release.

Interesting, when I try to parallelize this, with tampering checks enabled, it
actually takes quite a bit longer than sequential. 1 min 6 seconds instead of 40
seconds. I believe this is because of all the locking needed to make the
tampering checks work correctly. (It appears GNAT is using lock free mechanisms
similar to the approach I described, but there is overhead with using these
locks when there is enough contention for the locks.

With tampering checks disabled however, and 4 cores, I see 4 times faster than
sequential. The run took 1 second instead of 4 seconds. (or 66 times faster for
the parallel case, when tampering checks are enabled. )

So it seems to me that the approach described in AI12-0139-1 of creating a copy
of the entire Ada libraries under a Ada.Protected parent package is unnecessary,
(at least to allow parallel usage of the containers) But we need the Suppress
pragmas to make parallelizing the containers worthwhile. Assuming Bob will
eventually write that part up, I will focus on a writeup that just says that the
containers should work for parallel processing.

****************************************************************

From: Tucker Taft
Sent: Friday, June 3, 2016  10:36 AM

...
> (About 10 times faster for adding two Vectors of 100_000_000 integers
> for the sequential case) ...
> With tampering checks disabled however, and 4 cores, I see 4 times
> faster than sequential.
> The run took 1 second instead of 4 seconds. (or 66 times faster for
> the parallel case, when tampering checks are enabled. )

Glad to hear that!

> So it seems to me that the approach described in AI12-0139-1 of
> creating a copy of the entire Ada libraries under a Ada.Protected
> parent package is unnecessary, (at least to allow parallel usage of
> the
> containers) But we need the Suppress pragmas to make parallelizing the
> containers worthwhile. Assuming Bob will eventually write that part
> up, I will focus on a writeup that just says that the containers
> should work for parallel processing.

Sounds good.

****************************************************************

From: Brad Moore
Sent: Friday, September 23, 2016  7:35 PM

Here is my homework for ai12-0196-1. [This is version /02 of the AI - Editor.]

****************************************************************

From: Bob Duff
Sent: Monday, September 26, 2016  8:09 AM

Looks good.

Are things like First reentrant?  Do we need to say so?

> overlaps with the element in the container that it designates. So song 
> as
                                                                    ^^^^ "So long as"

> cursor operations are not applied to two or more cursors designating 
> the same

>       Parallel_Loop (From           => Integer.First,

Integer_List.First

****************************************************************

From: Brad Moore
Sent: Monday, September 26, 2016  10:05 AM

> Are things like First reentrant?  Do we need to say so?

First/Last is not reentrant since it has a Container parameter, and also
because it is not needed for parallelism. For Vectors, To_Cursor would be used
to generate all the cursors from index values. Length would be called before
the parallelism to determine the chunk boundaries. For the other containers,
First (or Last) is used to obtain the initial cursor, but that is done
sequentially before the parallelism starts, and is not called again.

To add some basic parallelism capabilities to the containers, we only need
Next, Previous, Element, and To_Cursor to be reentrant.

We would really like calls like Replace to be reentrant as well, but I'm
thinking we would want to rely on the capabilities of another AI, such as
the Global aspects, to make this possible. I suppose we could just say that
Replace is reentrant, but it might be tricky to explain why or how it is.

I think it would probably be good to mention this, probably as an AARM note
or ramification.

****************************************************************

From: Randy Brukardt
Sent: Monday, October 3, 2016  9:12 PM

Question: Is "reentrant" sufficiently defined to use it in the way that it is
used in A.18.2(125) here: "To_Cursor is reentrant."

A 3(4) sort of defines it, but I don't see how that definition is sufficient
to apply it as a subprogram requirement. After all, it says that all
language-defined subprograms are reentrant, so that would make the above
statement the null statement. I'm pretty sure that's not what we want.

My presumption from the meeting is that we were going to define the term
"reentrant subprogram" in some meaningful way and then we could apply that
to To_Cursor and any other subprograms that need it.

Perhaps an alternative would be to say that "for the purposes of determining
whether To_Cursor is required to be reenterant, the Container parameter is
considered to not overlap with any other object Redundant[even with itself]".
Since this seems to be the only place where you need to override the default
meaning of A 3(4).

---

I have to wonder if it would be better to define that all language-defined
functions with Global = null (what Brad mentions as a "pure function") are
fully reenterant. Then apply Global = null to To_Cursor and the other functions
in question. Maybe too radical.

---

OTOH, the decscription of To_Cursor is pretty simplistic. It says:

  To_Cursor however, is trivial to implement as a reentrant routine since it
  only needs to contruct an object that designates the container and contains
  an index to the vector. To_Cursor has no side effects, and is thus what we
  might call a pure function, if the language defined such a thing. We specify
  in the wording that this subprogram is reentrant to ensure that it can be
  safely called.

This assumes a very particular implementation of both cursors and To_Cursor.

First of all, a vector cursor needs to have some sort of reference to the
element, but it doesn't necessarily need to be the index of the element. It
could use a pointer at the element instead (Matt and I had a long discussion
on this point). This implementation would necessarily make To_Index more
expensive, but that could be implemented with an address calculation, and it
certainly would make element access by cursor cheaper. That's especially true
if the implementation choose a chunked allocation implementation rather than
allocating one large glob. (That would still meet the O(n) requirements if the
chunk size were fixed and it used a two-level index. Such an implementation
would be optimized for Vector expansion, which might be valuable in some
circumstances.) Also note that if both of these things are true, creating a
cursor will require reading data from the container object.

Similarly, the Janus/Ada cursors will include serial numbers used to detect
dangling cursors. Those serial numbers will need to be initialized from the
matching element in the vector -- so again To_Cursor will need to read the
container object. (Indeed, that is true for all cursor operations that create
a new cursor, including Next and Previous.)

We definitely don't want to prevent implementations like either of these. So
we need to be careful about what assumptions we make about the implementation.
(I could imagine To_Cursor having a side-effect; luckily neither of these
implementations do. It might be OK to ban such things, but we have to remember
that we're adding additional constraints on implementations, and ensure that
we're not throwing out the baby with the bathwater.)

At the very least, the discussion of To_Cursor shouldn't refer to any
particular implementation model, but rather discuss the abstract properties of
a cursor.

---

Since Next, Previous, and Element are all that is needed for this type of
parallel processing, which are now all defined to be reentrant, no container
specific wording is needed for these types of containers.

???

This statement seems like nonsense. You seem to have left out a bunch of steps!

You ought to explain why Next, et. al. are "now defined to be reentrant", since
there is no wording that does that and the logic needed to figure that out is
not obvious. (That is, say something about cursors no longer being overlapping,
so these functions are now required to be reenterant. The normative wording is
all about overlapping, after all.)

---


Typos: paragraph 5 of the discussion: "(eg. next, previous, and Elemen)," the
capitalization is wrong and a T is missing.

Bob had two typos as well.

I updated the AI fixing these typos after posting your original version.

****************************************************************

Questions? Ask the ACAA Technical Agent