Version 1.20 of ai12s/ai12-0119-1.txt

Unformatted version of ai12s/ai12-0119-1.txt version 1.20
Other versions for file ai12s/ai12-0119-1.txt

!standard 5.5.2(2/3)          18-06-23 AI12-0119-1/10
!class Amendment 14-06-20
!status work item 14-06-20
!status received 14-06-17
!priority Medium
!difficulty Hard
!subject Parallel operations
!summary
We propose new syntax and semantics to facilitate parallelism via Parallel Loops and Parallel Blocks.
!problem
The increased presence of parallel computing platforms brings concerns to the general purpose domain that were previously prevalent only in the specific niche of high-performance computing. As parallel programming technologies become more prevalent in the form of new emerging programming languages and extensions of existing languages, safety concerns need to consider the paradigm shift from sequential to parallel behavior.
Ada needs mechanisms to allow the programmer to specify explicity in the code where parallelism should be applied. In particular, loops are often a good candidate for parallelisation, but there also needs to be a way to specify parallelism more generally to allow potentially different groups of calls to be executed in parallel with other groups of calls. While Ada tasks can be used to generate parallel execution, they are better suited for separating concerns of independent processing where the need for parallelisation is mostly a perforamance concern, where it should be easy to convert an existing sequential algorithm into a parallel one, or vice versa. Declaring tasks for this purpose is too error prone, requires significant rework, more difficult to mainain, and not as easily transferable to different hardware platforms. It is better to rely more on the compiler to map the parallelism to the target platform.
!proposal
We propose two parallel constructs, namely parallel blocks and parallel loops. A parallel block consists of a set of concurrent activities each specified by a handled sequence of statements, separated by the reserved word "and," analogous to the syntax for a select statement where the alternatives are separated by the reserved word "or." A parallel loop defines a loop body which is designed such that the various iterations of the loop can run concurrently. The implementation is expected to group the iterations into "chunks" to avoid creating an excessive number of physical threads of control, but each iteration is nevertheless considered for most purposes as its own separate logical thread of control.
Both constructs start with the new reserved word "parallel" to clearly indicate that these constructs are designed for parallel execution. The implementation might still not execute the constructs in parallel, but the intent is that if multiple processors are available, some or all of them should be allocated to the execution of the construct.
!wording
Add to the end of Introduction (28):
Various parallel constructs, including parallel loops and parallel blocks, support the initiation of multiple logical threads of control designed to execute in parallel when multipler processors are available.
Modify 2.9(2/3):
Add "parallel" to the list of reserved words.
Add a sentence to the end of 5.1(1):
A /parallel construct/ is a construct that introduces additional logical threads of control (see 9) without creating a new task. Parallel loops (see 5.5) and parallel_block_statements (see 5.6.1) are parallel constructs.
Modify 5.1(5/2):
add "parallel_block_statement" to the list of compound statements.
Add after 5.1(15):
Within a parallel construct, if a transfer of control out of the construct is initiated by one of the logical threads of control, an attempt is made to /cancel/ all other logical threads of control initiated by the parallel construct. Once all other logical threads of control of the construct either complete or are canceled, the transfer of control occurs. If two or more logical threads of control of the same construct initiate such a transfer of control concurrently, one of them is chosen arbitrarily and the others are canceled.
When a logical thread of control is canceled, it causes it to complete as though it had performed a transfer of control to the point where it would have finished its execution. Such a cancellation is deferred while it is executing within an abort-deferred operation (see 9.8), and may be deferred further, but not past a point where the logical thread initiates a new nested parallel construct, or reaches an exception handler.
Bounded (Run-Time) Errors
During the execution of a parallel construct, it is a bounded error to invoke an operation that is potentially blocking (see 9.5). Program_Error is raised if the error is detected by the implementation; otherwise, the execution of the potentially blocking operation might proceed normally, or it might result in the indefinite blocking of some or all of the logical threads of control making up the current task.
Replace 5.5(3/3):
Iteration_scheme ::= while condition | [parallel] for loop_parameter_specification | for iterator_specification
Add after 5.5(6/5):
[editor: this rule is still part of the syntax rules] An iteration_scheme that begins with the reserved word parallel shall not have the reserved word reverse in its loop_parameter_specification.
Modify 5.5(7):
For the execution of a loop_statement, the sequence_of_statements is executed [repeatedly,] zero or more times, until the loop_statement is complete. The loop_statement is complete when a transfer of control occurs that transfers control out of the loop, or, in the case of an iteration_scheme, as specified below.
Modify 5.5(9/4):
For the execution of a loop_statement with the iteration_scheme being for loop_parameter_specification, the loop_parameter_specification is first elaborated. This elaboration creates the loop parameter and elaborates the discrete_subtype_definition. If the discrete_subtype_definition defines a subtype with a null range, the execution of the loop_statement is complete. Otherwise, the sequence_of_statements is executed once for each value of the discrete subtype defined by the discrete_subtype_definition that satisfies the predicates of the subtype (or until the loop is left as a consequence of a transfer of control). Prior to each such iteration, the corresponding value of the discrete subtype is assigned to the loop parameter. {If the reserved word parallel is present (a /parallel/ loop), each iteration is a separate logical thread of control (see 9), with its own copy of the loop parameter; otherwise the} [These] values are assigned {to the loop parameter} in increasing order unless the reserved word reverse is present, in which case the values are assigned in decreasing order.
Add after 5.5(9/4):
AARM Implementation note: -- Each iteration of a parallel loop is a separate logical thread of control. Normally these logical threads of control will be grouped into "chunks" for the purpose of execution to reduce the cost of creating large numbers of (physical) threads of control.
Add after 5.5(21):
Example of a parallel loop:
-- See 3.6(30/2) parallel for I in Grid'Range(1) loop
Grid(I, 1) := (for all J in Grid'Rang(2)e => Grid(I,J) = True);
end loop;
Add a new subclause:
5.6.1 Parallel Block Statements
A parallel_block_statement comprises two or more handled_sequence_of_statements separated by and where each represents an independent activity that is intended to proceed concurrently with the others.
Syntax
parallel_block_statement ::= parallel do handled_sequence_of_statements and handled_sequence_of_statements {and handled_sequence_of_statements} end do;
Static Semantics
Each handled_sequence_of_statements represents a separate logical thread of control that proceeds independently and concurrently. The parallel_block_statement is complete once every one of the handled_sequence_of_statements has completed, either by reaching the end of its execution, or due to a transfer of control out of the construct by one of the handled_sequence_of_statements (see 5.1).
AARM Implementation Note -- Each handled_sequence_of_statements of a parallel block is a separate logical thread of control. The implementation may choose to combine two or more such logical threads of control into a single physical thread of control to reduce the cost of creating numerous physical threads of control.
Examples
procedure Traverse
(T : Expr_Ptr) -- see 3.9
is begin
if T /= null
and then T.all in Binary_Operation'Class -- see 3.9.1
then -- recurse down the binary tree
parallel do Traverse (T.Left); and Traverse (T.Right); and Ada.Text_IO.Put_Line ("Processing " & Ada.Tags.Expanded_Name (T'Tag)); end do;
end if;
end Traverse;
function Search (S : String; Char : Character) return Boolean is begin if S'Length <= 1000 then -- Sequential scan return (for some C of S => C = Char); else -- Parallel divide and conquer declare Mid : constant Positive := S'First + S'Length/2 - 1; begin parallel do for C of S(S'First .. Mid) loop if C = Char then return True; -- Terminates enclosing "do" end if; end loop; and for C of S(Mid + 1 .. S'Last) loop if C = Char then return True; -- Terminates enclosing "do" end if; end loop; end do; -- Not found return False; end; end if; end Contains;
Modify 9(1/3):
The execution of an Ada program consists of the execution of one or more tasks. Each task represents a [separate thread of control] {separable activity} that proceeds independently and concurrently between the points where it interacts with other tasks. {A single task, when within the context of a parallel construct, can represent multiple logical threads of control which can proceed in parallel; in other contexts, each task represents one logical thread of control.
}The various forms of task interaction are described in this clause, and include:
Add a sentence to the end of 9(10):
... In the context of a parallel construct, a single task can utilize multiple processing resources simultaneously.
Modify 9(11) (aka "Note 1"):
1 Concurrent execution may be implemented on multicomputers,
multiprocessors, or with interleaved execution on a single physical processor. On the other hand, whenever an implementation can determine that the required semantic effects can be achieved when parts of the execution of a [given task] {single logical thread of control} are performed by multiple physical processors acting in parallel, it may choose to perform them in this way.
Modify 9.10(1/3):
If two different objects, including nonoverlapping parts of the same object, are independently addressable, they can be manipulated concurrently by two different [tasks] {logical threads of control} without synchronization. Any two nonoverlapping objects are independently addressable if either object is specified as independently addressable (see C.6). Otherwise, two nonoverlapping objects are independently addressable except when they are both parts of a composite object for which a nonconfirming value is specified for any of the following representation aspects: (record) Layout, Component_Size, Pack, Atomic, or Convention; in this case it is unspecified whether the parts are independently addressable.
Modify 9.10(2):
[Redundant: Separate [tasks] {logical threads of control} normally proceed independently and concurrently with one another. However, task interactions can be used to synchronize the actions of two or more [tasks] {logical threads of control} to allow, for example, meaningful communication by the direct updating and reading of variables shared between [the tasks] {them}.] The actions of two different [tasks] {logical threads of control} are synchronized in this sense when an action of one [task] signals an action of the other [task]; an action A1 is defined to signal an action A2 under the following circumstances:
Modify 9.10(13):
Both actions occur as part of the execution of the same [task] {logical thread of control};
!discussion
It is important for the programmer to be able to indicate places in the code where parallelism is desired, which allows the reader to better understand the algorithm, and allows the programmer to receive confirmation from the compiler whether the desired parallelism is safe.
This proposed model does not define syntax for the explicit parallel invocation of an individual subprogram call, since a single subprogram call is rarely the best unit of parallelism. More frequently there is some setup and/or some post-call actions that should be grouped with the call, so the parallel block notation was felt the more appropriate way to perform a "parallel" call.
parallel Blocks ---------------
example: declare X, Y : Integer; Z : Float; begin
parallel do X := Foo(100); and Z := Sqrt(3.14) / 2.0; Y := Bar(Z); end do;
Put_Line("X + Y=" & Integer'Image(X + Y)); end;
In this example, the calculation of Z and Y occur sequentially with respect to each other, but concurrently with the calculation of X. Note that the compiler, using the rules specified in AI12-0267-1, may complain if the parallel sequences might have conflicting global side-effects, or if the parallel sequences might have Potentially Blocking operations.
The parallel block construct is flexible enough to support recursive usage as well, such as:
function Fibonacci (N : Natural) return Natural is X, Y : Natural; begin if N < 2 then return N; end if;
parallel do X := Fibonacci (N - 2); and Y := Fibonacci (N - 1); end do;
return X + Y; exception when others => Log ("Unexpected Error"); end Fibonacci;
We considered allowing the parallel block construct to be preceded with an optional declare part, but it was observed that it was more likely to be useful to have objects that are shared across multiple parallel sequences to outlive the parallel block. Therefore we reverted to the simpler syntax proposed above. Because there are no local declarations, there was also no point in having a statement_identifier (block label) for a parallel block.
We considered what the semantics might be for a parallel block if the parallel keyword were absent. This might be a good syntactic construct to use for supporting coroutines, for example. Rather than deal with that question in this AI, we leave that for consideration in separate AI's.
We considered whether blocking calls should be allowed in a parallel block statement. We felt that allowing that could add significant complexity for the implementor, as well as introduce safety concerns about potential deadlocking. While such a capability is possible, it was felt that it should be disallowed for now. If the demand and need is felt in the future, it could be added then, but it is better to not standardize that capability until we know it is needed.
By disallowing potentially blocking operations, we can permit the implementation to group the various "logical" threads of the parallel block into a smaller number of "physical" threads. If blocking operations were permitted, this could easily lead to deadlock.
parallel Loops --------------
In most compute-intensive applications, a significant proportion of the computation time is spent in loops, either iterating over arrays/container data structures, or systematically searching a large solution space. To benefit from parallel hardware, the computation associated with a loop should be spread across the available processors.
For example, here is a simple use of a parallelized loop, to add two arrays together to produce a result array:
parallel for I in Result'range loop Result (I) := A(I) + B(I); end loop;
Note: Allowing the parallel keyword on while loops was considered, but was discarded since while loops cannot be easily parallelized, because the control variables are inevitably global to the loop.
We have introduced the term "logical thread of control" to describe the new kinds of parallelism introduced by these parallel constructs, rather than "tasklet" or "executor" or other such term. In various existing rules, we have substituted "logical thread of control" for "task." Shared variables now include any variable shared between two logical threads of control, even if they are from the same task. Note that the compiler, using the rules specified in AI12-0079-1, may complain if the parallel sequences might have conflicting access to shared variables. We don't define such rules in this AI.
Note that there are other AIs that address the issues of incorporating a reduction operation with a parallel iterator (see AI12-0262-1) and with providing more control over "chunking" of parallel loops (see AI12-0251-1/2).
AI12-0267-1 provides rules to check for blocking operations and for race conditions within parallel constructs.
!ASIS
** TBD.
!ACATS test
ACATS B-Test and C-Test are needed for these features.
!appendix

From: Tucker Taft
Sent: Tuesday, June 17, 2014  10:18 AM

Here is the paper the "gang of 4" is submitting to HILT 2014 with a relatively
detailed proposal for parallel programming in Ada 202x:

     http://bit.ly/parallada-gangof4

If there is time at the upcoming ARG meeting, some discussion and feedback from
ARG members would be appreciated.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, June 17, 2014  12:52 PM

Are we going to get formal proposals for some/all of these features? For
instance, we have an open AI on global in/out (AI12-0079-1) which is certainly
more broadly applicable than just parallelism. There is also an AI for
Potentially_Blocking, AI12-0026-1, misnamed "Task_Safe", which is useful for any
program including tasking. In both of these cases, an important part of the
proposals will be how they interact with predefined and language-defined
subprograms. (If they're not specified for predefined "+", nothing useful could
be done, for example.)

As with all of these sorts of things, the devil is in the details. The basic
ideas look fine, but we'll never really know without detailed proposals that can
be tested against.

****************************************************************

From: Tucker Taft
Sent: Tuesday, June 17, 2014   1:54 PM

True, but it is not worth making a more detailed proposal if the overall
reaction is "yuck"!  ;-)

****************************************************************

From: Randy Brukardt
Sent: Tuesday, June 17, 2014   3:10 PM

Well, it looks great to me, but that's mainly because all of the details are
missing. A lot of stuff looks great without the details (it's how modeling
works, after all). The "yuck" only comes when the details are fleshed out...

****************************************************************

From: Bob Duff
Sent: Tuesday, June 17, 2014   5:52 PM

Sounds like there are some good ideas in the paper, but I am very strongly
opposed to ARG getting "detailed proposals" about how to do parallelism, or even
wasting time discussing the matter.  ARG should be moving more to the
"standardize existing practise" way of doing.  And that goes double for features
whose ONLY purpose is efficiency -- such as parallelism.

We (ARG) don't need detailed proposals in that area; we need an experimental
implementation, and MEASUREMENTS of the efficiency of realistic programs.  (And
don't try to show me speedups of the recursive Fibonacci -- that's downright
dishonest!)  Until then, ARG should avoid wasting any time on the matter.  Once
we have proven practise, then ARG should take charge of standardizing it.

My comments above don't apply so much to things like Globals, which are more
about correctness/safety than efficiency.  ARG could reasonably discuss Globals,
although even there, an experimental implementation would be helpful.

Regarding those: You can put Globals and Potentially_Blocking on a package.
That's good.  I think it should apply to child and nested packages of that
package. I think you also want a way to give a default for Overlaps_Storage: "no
overlaps unless specified, in this package and children".

****************************************************************

From: Robert Dewar
Sent: Wednesday, June 18, 2014  2:52 AM

> Sounds like there are some good ideas in the paper, but I am very
> strongly opposed to ARG getting "detailed proposals" about how to do
> parallelism, or even wasting time discussing the matter.  ARG should
> be moving more to the "standardize existing practise" way of doing.
> And that goes double for features whose ONLY purpose is efficiency --
> such as parallelism.

I am in full agreement with this.

At the last RTW, which I attended, I was struck by the fact that the RTW was
debating removing some features in Annex D which no one has ever implemented,
yet alone found useful in actual practice.

The RM is not a research proposal :-)

****************************************************************

From: Alan Burns
Sent: Wednesday, June 18, 2014  5:59 AM

The policy currently in Ada has been implemented by the Ada run-time from
Santandar, and is supported in EDF operating systems (not Ada), though I agree
these are mainly research and experimental platforms, not mainstream industrial
platforms.

****************************************************************

From: Robert Dewar
Sent: Wednesday, June 18, 2014  10:39 AM

But I am talking about the specific Ada features, which for sure have not been
used by anyone in an Ada program :-)

****************************************************************

From: Tucker Taft
Sent: Wednesday, June 18, 2014  8:38 AM

> The RM is not a research proposal :-)

I agree that many language standards come after the fact, and are used to try to
tame a wildly varying set of ad hoc extensions.  That has never been the story
with Ada.  There are relatively few extensions (other than as packages, pragmas,
or attributes) that Ada vendors or users try in any serious way, before the Ada
Rapporteur Group begins to talk seriously about them.  The ideal is that the ARG
gets the ball rolling, indicates a direction we are serious about, and then
vendors try prototyping the ideas, and then we come around to picking among the
"serious" ideas based on prototyping results.  I really don't see anyone beside
the ARG being taken seriously at this point as a fountain for "ideas worth
prototyping" (apologies to TED ;-).

****************************************************************

From: Stephen Michell
Sent: Sunday, June 22, 2014  10:24 PM

Attached is a report that ARG requested from Brad, Tucker  and myself in
investigating how OpenMP could be used to implement Ada parallelism. For
discussion at the ARG meeting this week.

-----

				Report to WG 9/ARG
 		  on the use of OpenMP for Parallelism in Ada

Stephen Michell
with contributions from
Luis Miguel Pinho
Brad Moore
Tucker Taft


Foreward

At the ARG meeting November 2014, ideas were presented about ways to
implement parallelism in Ada, and ways to interoperate with other
language systems while using the multicore capabilities of modern
computers. We were requested to investigate OpenMP from a technical
perspective as well as a business perspective as a potential solution
to implementing multicore parallelism in Ada.


OpenMP Characteristics

OpenMP is a set of libraries and language extensions that permit a
user to access the capabilities of multicore systems from within
certain languages - namely Fortran, C and C++.

OpenMP's parallelism is "explicit parallelism" - i.e. all aspects of
the parallelism are specified by compiler directives in the code that
invoke library routines at the place of the directive through OpenMP API's.

OpenMP lets the user specify code segments (usually "for loops",
blocks and functions) that can be executed by threads (possibly
operating on multiple cores) that are co-ordinated to implement an
algorithm. The language extensions are converted into library calls to
interact with a runtime to implement the algorithm.

In trying to implement OpenMP on an Ada system, one would note that
the concurrency model is quite different than Ada's. The
"threads"  that are created interact with other threads that use OpenMP
primitives. This means that they will not interact with Ada tasks
using Ada tasking primitives.

For the interoperability with Fortran, C and C++ systems that are
using OpenMP, an Ada OpenMP capability would be useful, whetehr or not
it interacted with Ada tasks. This brings us to the business model.

Business Model

To implement An OpenMP for Ada, one must participate in the OpenMP
community. One could develop compiler directives, either as aspects or
pragmas and convert calls to meet the API, but to have it
"standardized", one must be a full participant in the OpenMP group.

One can participate in OpenMP as a university, as an individual or as
a corporate participant. AdaCore was approached to see if they were
interested in joining OpenMP (and paying fees), but declined. There is
no point joining as individuals, as individuals cannot propose new
language specifications for standardization.

This leaves participation through a university. Barcelona
Supercomputing Group is a participant, and Miguel Pinho has approached them about
participating through Barcelona. Barcelona is receptive, but
participation requires that Miguel find a PhD student willing to
participate. To date there has been no final resolution.

There is a GNU OpenMP binding (GOMP) that supports the GNU Fortran, C
and C++ GNU compilers. Ada could define a thin binding to that
interface. This would still leave the challenge of defining Ada syntax
for the Open MP compiler directives and determining how such a library
would interact with the Ada runtime (Tasking is not the only issue -
elaboration issues are also a potential issue).

Conclusion

There are business reasons and technical reasons for creating an
OpenMP Ada implementation. If an interface with OpenMP through
Barcelona Supercomputing Group becomes possible, further exploration should
happen. Otherwise, no further action in this area is anticipated.

****************************************************************

From: Brad Moore
Sent: Monday, October 3, 2016  10:09 AM

[This is version /02 of the AI - Editor.]

Here is my attempt to refine ideas from the gang of 4 working on parallelism,
so we might have something more to talk about on this topic at the upcoming
ARG meeting. There are many ideas to choose from, so I chose what I thought
were the most promising, I'm not sure if all 4 of us would agree, though it is
possible that we might be in agreement, as we haven't yet gotten around to
deciding on the best way forward. Most of the writing related to parallel
blocks is unchanged from the previous version of this AI, and was mostly
written by Tucker. The updates I have are mostly to do with parallel loops
and reduction.

I have attempted to put in wording for these changes, but the wording is not
polished. The wording hopefully does capture and describe the intent at least.

The main hope is to gain some feedback, to see if this direction makes sense,
or if other directions should be pursued.

It would probably also be good to discuss whether this parallel loop proposal
is an improvement over what was written in the previous version of the AI,
which discussed the use of parallel arrays.

****************************************************************

From: Brad Moore
Sent: Saturday, June 10, 2017  1:18 AM

[This is version /03 of the AI - Editor.]

Here is a bit of a redesign in the area of parallel loops, and a new approach,
parallel reduction expressions. The parallel block part is mostly unchanged.

The parallel loop design in much simpler. It now just does parallel iteration,
leaving parallel reductions to reduction expressions.

We can add reduction back to loops later, possibly in a different AI if we
decide we need it, but for the time being, it seems like reduction expressions
are a better fit for the problem.

Parallel reduction expressions are very much similar to quantified expressions,
which is where I started thinking about this idea. It turns out that Tucker was
also thinking along these lines in Parasail, as Parasail apparently already has
this feature.

After some email exchange with the gang of 4, the more this idea evolved, the
more similar it became to the Parasail approach.

The basic idea is that a Reduction expression applies an operation using
iteration, where the previous result of the last iteration is fed back as an
input into the next iteration.

Special syntax, "<value>" is used to indicate where this feedback occurs, which
also encloses the initial value to be used for the first iteration.

Here are a few examples that are in the AI, to give a basic idea.

    -- Sum of the elements of an array
    Sum : Integer := (for parallel Element of Arr => <0> + Element);

    -- Minimum value of an array
    Min : Integer :=
     (for parallel X of Arr => Integer'Min(<Integer'Last>,  X));

    -- Construct the alphabet
    Alphabet : Unbounded_String :=
      (for Letter in 'A' .. 'Z' => <Null_Unbounded_String> & Letter);

    -- Number of people who are 30 or older
    ThirtySomething : contant Natural :=
      (for P of parallel database =>
          <0> + (if Age(P) > 30 then 1 else 0));

    -- Factorial function
    function Fact(N : Natural) return Natural is
      (for J in 1..N => <1> * J);

    -- Taxable_Income : constant Float := Total_Income +
      (for Deduction of Deductions => <0.0> - Deduction);

    --  Compute Taylor expansion for Sin of X
    function Sin(X : Float; Num_Terms : Positive := 5) return Float is
      (for I in 1..Num_Terms =>
         <0.0> + (-1.0)**I * X**(2*I-1)/Float(Fact(2*I-1)));

    -- Compute Pi
    Number_Of_Steps : constant := 100_000;
    Step : constant := 1.0 / Number_Of_Steps;

    Pi : constant Long_Float := Step *
      (for I in parallel 1 .. Number_Of_Steps =>
          <0.0> + (4.0 / (1.0 + ((Long_Float (I) - 0.5) * Step)**2)));

    -- Print the sum of squares
    Put_Line ("Sum of Squares is" &
              Integer'Image(for I in 1 .. 10 => <0> + I**2));

To see the similarity between Quantified expressions and the more general form,
Reduction expressions, consider:

    -- Quantified expressions about the graduating class

    All_Graduated : constant Boolean :=
      (for all Student of parallel Class => Passed_Exam (Student));

    Someone_Graduated : constant Boolean :=
      (for some Student of parallel Class => Passed_Exam (Student));


    -- The same result can be obtained by writing these as
    -- reduction expressions

    All_Graduated : constant Boolean :=
      (for Student of parallel Class =>
         <True> and Passed_Exam (Student));

    Someone_Graduated : constant Boolean :=
      (for Student of parallel Class =>
         <False> or Passed_Exam (Student));

I've taken a first crack at writing this up, but didn't spend a lot of time
wordsmithing. I'd hope to first get some feedback in Vienna to see this idea has
merit.

****************************************************************

From: Randy Brukardt
Sent: Monday, June 12, 2017  5:35 PM

> Here is a bit of a redesign in the area of parallel loops, and a new
> approach, parallel reduction expressions. The parallel block part is
> mostly unchanged.

Here's a couple barely informed comments on this.

First, a bit of a rant. It's disconcerting that we get a wildly different
proposal at each meeting. There's no opportunity to get the details right if we
change the basic proposal completely between every meeting.

In Pittsburgh, we spent a significant amount of time figuring out the right
model for reducers. I see here that all of the conclusions of that meeting have
completely been discarded in favor of what appears on first glance the
reducer-de-jour. Why should we even spend time trying to figure out this wildly
different proposal when it seems obvious that whatever we decide will be
discarded all over again next time?

I want to at least feel like we are making progress toward a solution, rather
than starting from nearly ground zero with each proposal. End of mini-rant.

...
> Special syntax, "<value>" is used to indicate where this feedback
> occurs, which also encloses the initial value to be used for the first
> iteration.

Regardless of the merits of the proposal itself (which I haven't thought about),
I don't think angle brackets is going to fly as syntax in Ada. There would be
substantial confusion with relational operators, which would likely make such
text unparsable. (We discussed this extensively with some of the assignment
target proposals; it didn't fly there and I don't see what could have changed in
8 months.)

You could use square or curly brackets for this, but in any case, such syntax
doesn't seem very Ada-like. I'd suggest using an attribute or "magic" function
call as the appropriate syntax.

> Here are a few examples that are in the AI, to give a basic idea.
>     -- Minimum value of an array
>     Min : Integer :=
>      (for parallel X of Arr => Integer'Min(<Integer'Last>,  X));

     Min : Integer :=
      (for parallel X of Arr => Integer'Min(Integer'Last'Initial,  X));

or

     Min : Integer :=
      (for parallel X of Arr => Integer'Min(Initial(Integer'Last),  X));

Maybe there is a better idea out there (neither of these look very promising),
but whatever we do needs to look and feel like Ada.

****************************************************************

From: Brad Moore
Sent: Tuesday, June 13, 2017  9:21 AM

> First, a bit of a rant. It's disconcerting that we get a wildly
> different proposal at each meeting. There's no opportunity to get the
> details right if we change the basic proposal completely between every
> meeting.
>
> In Pittsburgh, we spent a significant amount of time figuring out the
> right model for reducers. I see here that all of the conclusions of
> that meeting have completely been discarded in favor of what appears
> on first glance the reducer-de-jour. Why should we even spend time
> trying to figure out this wildly different proposal when it seems
> obvious that whatever we decide will be discarded all over again next time?

I started with that proposal, but the more I tried to apply it to real problems,
the more problems I saw.

To describe the problem in a nutshell, for parallel reduction problems there are
two mostly independent algorithms to be expressed, which I call the "what" and
the "how"

The "what" answers the question: What is the application doing?

- This includes the application logic that mostly remains the same regardless
  whether parallelism is being applied or not. It also includes an expression of
  the desire that the application code should be optimized to run in parallel.

The "how" answers the question: How is the application code implemented?
e.g. How does the optimization such as parallelism happen? How is chunking
performed, how are the results being reduced and combined? With parallelism,
this is complex enough that it can be viewed as a separate algorithm being
overlaid on top of the application's algorithm.

I found that syntax that tries to express both "how" and "what" ends up being
messy, and too constraining.

One problem I could see, is that by using array syntax to hold chunk results,
implies that there is an array of chunk results that exists in time that can be
iterated over and examined. In the parallelism libraries I've been working with,
that isn't the case, the chunk results come and go and get combined during the
computation. There typically never is a point in time where all the chunk
results exist.

Having an array of chunk results implies that the reduction does not occur until
after the looping has completed for all the chunks. This is eliminating a
parallelism possibility where the reduction occurs in parallel during the
computation of the loop, rather than sequentially after the looping of all the
chunks is complete.

There are some algorithms where reducers can be written that depend on the
reduction occurring in parallel during the loop processing.

If we look at the example from the minutes, to sum the elements of an array we
have;

    My_Sum := 0;

    parallel [(Num_CPUs*2)]
       Sum : array (<>) of Integer := 0;
    for I in 1 .. 1_000_000 loop
       Sum (<>) := @ + A(I);
    then
       for S of Sum loop
          My_Sum := @ + S;
       end loop;
    end parallel loop;

The middle loop expands into two loops:
    for Chunk_Index in Sum’Range loop
       for I in Split(1..1000000, Chunk_Index, Num_CPUs*2) loop
          Sum(Chunk_Index) := @ + A(I);
       end loop;
    end loop;

The parallel array of Sum, the "then" keyword, and the explanation for the
middle loop demonstrates the intent that the reduction occurs after the main
loop processing.

The point of parallelism is to make the applications logic run as fast as
possible. Why are we attempting to go with syntax that is injecting sequential
processing when it isn't needed? It would be better more parallelism can be
applied, if it is possible, and it is.

Having an array of chunk results also implies that the chunking is more static
in nature, suggesting that the computation of chunks occurs before the loop is
executed, which rules out other implementation possibilities where the chunking
is more dynamic in nature, such as some of the work stealing algorithms where
chunks can be dynamically resized as workers become free, they steal work from
other chunks being processed by other workers. Why are we choosing syntax that
forces implementors to apply a limited range of possible approaches?

If the chunking is computed before the loop begins executing, then it implies
there can be quite a large number of chunk results, if there are a lot of loop
iterations to be chunked, as the chunks should be smaller to allow for better
load balancing between the cores. The number of chunks would be a calculation
similar to;

     Number_of_Cores * x, where x is some factor based on the number of iterations.

This could require larger storage needs, for the case where the reduction result
is a larger data type such as a large matrix.

With the parallelism libraries I have been working with, using dynamic chunking,
the only needs to be a single chunk result for each core.

These are just some of the problems I see with the design we came up with last
meeting.

I also find that there is a lot we expecting the reader to understand. -
- The concept of the parallel arrays
- That chunking is being applied
- That each chunk is going to a different element in the array
- That a second loop that looks like a nested loop is not actually nested, but
  is executed after the outer loop, etc


Being able to express the above problem as an expression such as;

    (for I of Arr => <0> + I)

Seems more useful, a lot simpler to parse for the reader, allows implementors
more freedom, allows for better parallelism, etc

It also seems like an expression is a better tool for the job, when it comes to
reduction problems.

The point of looping is to apply some processing multiple times.
The point of an expression is to produce a result.

I would argue that the point of reduction problems is mainly to produce a
result. Looping is needed as part of the implementation to produce the result,
but looping is not the main goal here. It is secondary.

Further, this reduction expression syntax can be compatible with whatever
parallel loop syntax we might decide to go with, including what we came up with
at the last ARG meeting. This syntax raises the question about whether we even
need reduction syntax to be combined with loop syntax however.

> I want to at least feel like we are making progress toward a solution,
> rather than starting from nearly ground zero with each proposal. End
> of mini-rant.

I feel like is considerable progress toward a solution. It's not that this is
just another alternative that more or less accomplishes the same thing as what
was proposed during the ARG meeting. To me, this feels like a new idea that is a
significant improvement over what we had before.

> ...
>> Special syntax, "<value>" is used to indicate where this feedback
>> occurs, which also encloses the initial value to be used for the
>> first iteration.
>
> Regardless of the merits of the proposal itself (which I haven't
> thought about), I don't think angle brackets is going to fly as syntax
> in Ada. There would be substantial confusion with relational
> operators, which would likely make such text unparsable. (We discussed
> this extensively with some of the assignment target proposals; it
> didn't fly there and I don't see what could have changed in 8 months.)

But we already have angle brackets in Ada, <> box notation appears in several
places in the syntax. This is just another case of box, except that we stick a
value in between.  Tucker has already indicated several other possible syntax
replacements here, so if we dont like the box notation, it shouldn't be too hard
to come up with something else.

I think we could potentially just use <> also, without providing an intial
value, and just say that there has to be at least one iteration, and the intial
value is the value of the first iteration.

> You could use square or curly brackets for this, but in any case, such
> syntax doesn't seem very Ada-like.

Its hard to see how <> isnt Ada-like, since its already in Ada.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, June 13, 2017 12:59 PM

...
> One problem I could see, is that by using array syntax to hold chunk
> results, implies that there is an array of chunk results that exists
> in time that can be iterated over and examined. In the parallelism
> libraries I've been working with, that isn't the case, the chunk
> results come and go and get combined during the computation. There
> typically never is a point in time where all the chunk results exist.

Arrays in this context is just a syntax, there's no reason that it has to be
materialized. So I find this concern mainly one about reading too much into the
syntax. One hopes that the array *isn't* materialized, and the actual semantic
rules allow that.

However, we want the syntax to be as easy as possible to understand by a normal
sequential programmer, and they certainly will understand the use of arrays in
this context.

...
> Being able to express the above problem as an expression such as;
>
>     (for I of Arr => <0> + I)
>
> Seems more useful, a lot simpler to parse for the reader, allows
> implementors more freedom, allows for better parallelism, etc

This seems very limited to me, as it limits what can be parallelized with
reductions to things that can be written as a single expression. It seems as
thought people will end up writing helper functions (which necessarily reduce
parallelism as the compiler can't see the contents) or extremely complex single
expressions (hard to read, hard to debug).

It also seems to indicate (along with the thought to add anonymous subprograms,
lets, and the like) that some people want to force Ada into being a functional
language. Since there is a vocal part of the Ada community that finds functional
languages to be evil, that could be a very controversial direction. (I'm
undecided on this point.)

> > ...
> >> Special syntax, "<value>" is used to indicate where this feedback
> >> occurs, which also encloses the initial value to be used for the
> >> first iteration.
> >
> > Regardless of the merits of the proposal itself (which I haven't
> > thought about), I don't think angle brackets is going to fly as
> > syntax in Ada. There would be substantial confusion with relational
> > operators, which would likely make such text unparsable. (We
> > discussed this extensively with some of the assignment target
> > proposals; it didn't fly there and I don't see what could have
> > changed in
> 8 months.)
>
> But we already have angle brackets in Ada, <> box notation appears in
> several places in the syntax. This is just another case of box, except
> that we stick a value in between.

Umm, no. <> is a lexical symbol. If you try to put "something in between", you
have (at least) three lexical symbols: "<" expression ">", and that has to be
parsed to be understood.

> Tucker has already indicated several other possible syntax
> replacements here, so if we dont like the box notation, it shouldn't
> be too hard to come up with something else.
>
> I think we could potentially just use <> also, without providing an
> intial value, and just say that there has to be at least one
> iteration, and the intial value is the value of the first iteration.

Almost anything is better than "<expr>". Maybe not "+expr+". :-)

****************************************************************

From: Brad Moore
Sent: Thursday, June 15, 2017  2:04 AM

>> One problem I could see, is that by using array syntax to hold chunk
>> results, implies that there is an array of chunk results that exists
>> in time that can be iterated over and examined. In the parallelism
>> libraries I've been working with, that isn't the case, the chunk
>> results come and go and get combined during the computation. There
>> typically never is a point in time where all the chunk results exist.
>
> Arrays in this context is just a syntax, there's no reason that it has
> to be materialized. So I find this concern mainly one about reading
> too much into the syntax. One hopes that the array *isn't*
> materialized, and the actual semantic rules allow that.

Its not just about the storage layout though. If it looks like an array, people
will tend to use it that way. This includes doing slices, copying, bit
operations (if its an array of bits), etc

I can imagine cases where one will want to iterate through the array initially
after loop processing to perhaps do intermediate processing and adjustments to
the values before applying the reduction. Nested functions may be needed to do
this. For example, one might want to apply data smoothing the values. The more
one tries to use this as an array, I think the danger is the more that ends up
having to be the implementation.

> However, we want the syntax to be as easy as possible to understand by
> a normal sequential programmer, and they certainly will understand the
> use of arrays in this context.

That part is pretty easy to understand, since it looks like existing Ada.
However, the chunking, the meaning of (<>), and the overall construct, I would
think will be quite foreign to sequential programmers, and existing Ada
programmers.

> ...
>> Being able to express the above problem as an expression such as;
>>
>>     (for I of Arr => <0> + I)
>>
>> Seems more useful, a lot simpler to parse for the reader, allows
>> implementors more freedom, allows for better parallelism, etc
>
> This seems very limited to me, as it limits what can be parallelized
> with reductions to things that can be written as a single expression.
> It seems as thought people will end up writing helper functions (which
> necessarily reduce parallelism as the compiler can't see the contents)
> or extremely complex single expressions (hard to read, hard to debug).

Similar to the rewriting of array syntax, I would think the compiler could also
do some rearranging of at least certain helper functions, such as inline
functions, to further optimize what actually gets generated by the compiler, if
it is needed.

> It also seems to indicate (along with the thought to add anonymous
> subprograms, lets, and the like) that some people want to force Ada into
> being a functional language. Since there is a vocal part of the Ada
> community that finds functional languages to be evil, that could be a very
> controversial direction. (I'm undecided on this point.)

I think an important point to consider is whether this syntax is appealing
enough to add to Ada. It seems to be useful for other things than parallelism,
for example it is convenient to be able to express factorial as an expression
function, where parallelism likely isn't needed or wanted.

It doesn't necessarily need to be the only way to write reductions, just as
quantified expressions can also be expressed in Ada in other ways, such as via
looping and function calls. We can also have a parallel loop solution, such as
the Pittsburgh proposal as well.

I think this important to discuss the parallel reduction syntax however, because
  a) If we do want to support this, this likely will impact the syntax proposal
     for parallel loop reductions
  b) Even if we decide to not support this, (at least not now), it offers some
     ideas that still could impact the parallel loop proposal.

In particular, I'm referring to the idea for the Reducer aspect, and Identity
aspect that can be applied to function declarations. This is useful to separate
the "what" from the "how" as I described in my previous email, where the "what"
syntax shows what application code to be parallelized, and the "how" syntax
which is separate, provides details on how this is to happen.

The reducer for "+" is always going to be "+" itself, so why force the
programmer to write this out every time one want to add numbers in parallel,
this can be baked into the standard in the form of aspects applied to the
function declarations?

This could eliminate the need for both the "then" clause as well as the need for
the parallel array, which could result in a much simpler syntax construct
involving only adding the work parallel to a loop. This also makes sense because
adding parallelism I think in a lot of cases will be a later step in the
development process of an application - The optimisation phase.

In that case you have existing sequential code to be parallelized, and it is
desirable to have to make minimal changes to the code to get the parallelism.

Further, one might find that adding the parallelism doesn't help performance, or
may even make things worse, in which case the programmer will want to back out
those changes. It would be desirable to be able to undo changes with minimal
effort. For example, just adding or removing the keyword "parallel" would be
appealing.

If we take the sequential code:

   declare
      Sum : Integer := 0;
   begin
      for I in 1 .. 1_000_000 loop
         Sum := @ + I;
      end loop;

      Put_Line ("Sum:" & Integer'Image (Sum));
   end;

And convert it to parallel just by adding the parallel keyword

   declare
      Sum : Integer := 0;
   begin
      for I in parallel 1 .. 1_000_000 loop
         Sum := @ + I;
      end loop;

      Put_Line ("Sum:" & Integer'Image (Sum));
   end;

I think that is closer to what programmers would want to be able to do.

We could say that as written, the compiler would reject this, because the global
aspect could indicate this is a data race for parallel operations.

We could then have a Boolean aspect, "Reduction" that can be applied to a
variable declaration that indicates that variable involves parallel reduction.
This would then allow the compiler to accept the loop, and provides better
feedback to the user that parallel reduction is occurring.

i.e.

   declare
      Sum : Integer with Reduction := 0;
   begin
      for I in parallel 1 .. 1_000_000 loop
         Sum := @ + I;
      end loop;

      Put_Line ("Sum:" & Integer'Image (Sum));
   end;


The compiler and the reader can see that the operation being applied to the
global result is

   function "+" (L, R : Integer) with Identity => 0;

Both can also see that this is also a reducer function because it accepts two
parameters of the same type, and returns a result of that same type, and because
it has the Identity aspect. So this shows how the results of the chunks are
combined. The compiler now has the necessary inputs to be able to implement the
parallelism, and the reader who is mostly only concerned with the application
logic can see that the code is to run in parallel, and in many cases, the
programmer will not care how this happens, but if he does he can look up the
reducer function for the loop, as well as the identity value for that function.

I would think the sequential programmer which prefer to see the above solution
for parallel loops, over the syntax proposed in Pittsburgh.

This also is more harmonious with the parallel reduction expression syntax as
currently written in the AI.

Note that writing as a reduction expression, isn't a data race because the
result is computed before it is assigned to the global value, so in that case,
the Reduction aspect would not be needed on the Sum declaration, though it could
be optionally added as a confirming aspect.

i.e. the above code could be replaced with;

   declare
      Sum : constant Integer := (for I in parallel 1 .. 1_000_000 => <> + I);
   begin
      Put_Line ("Sum:" & Integer'Image (Sum));
   end;

or even just;

   Put_Line ("Sum:" & Integer'Image (for I in parallel 1 .. 1_000_000 => <> + I));

****************************************************************

From: Brad Moore
Sent: Saturday, June 17, 2017 11:53 PM

I think we can further simplify and clarify the Reduction Expression syntax, and
at the same time eliminate some problems.

Specifically, we can eliminate the Reducer aspect and Identity aspect by saying
that the top level operation of the reduction expression must be a reducer
function. (i.e. a function that takes two arguments of the same type, and
returns a result of that same type)

In that case, we would not need to know of an identity value.
Rather, the result of the first iteration of any chunk effectively becomes the
initial value for that chunk. The accumulation back into the same result only
occurs if there are at least two iterations for a chunk.

The expression inside the box is then not the initial value, but instead is the
default value, whose only purpose is to provide the result of the expression if
the reduction expression iterates over a null range, i.e. 0 iterations.

A <> with no initial value would signify that there is no default value for the
expression, which implies that an exception would be raised if a null range were
specified.

This actually is needed for operations such as set intersection, where the
intersection of 0 sets is not defined mathematically, or at least difficult to
produce a value for, if we wanted it to mean the universe.

consider:

We want to find the intersection of animals that can be found in a set of zoos.

Set_Array(Vienna_Zoo).Include("Gorilla");
Set_Array(Vieena_Zoo).Include("Giraffe")
...
Set_Array(Calgary_Zoo).Include("Giraffe");
Set_Array(Calgary_Zoo).Include("Buffalo");


Common_Animals : constant Set := (for Set of Set_Array => Intersection(<>, Set));

It would be pretty much impossible to create an initial value for a zoo that
contains all known animals.

Since that case is only needed when there are zero zoos being examined, we do
not need to supply that value. Also in this case, the programmer knows that
there are at 1 or more zoos in the array, so the default value will not be
needed, and and exception would not be raised.

Probably it would make sense that the exception raised would be
Constraint_Error, if one did try to iterate with a Reduction expression that did
not provide a default value.

This solves another problem. Yesterday Erhard brought up the example of matrix
to the power of N.

Now that the Identity aspect is not needed, the Identity Matrix is only needed
as a default value for the reduction expression. This can now be trivially
supplied by the programmer by a function call.

eg.

   Matrix_To_The_Nth_Power : constant Matrix := (for I in 1 .. N => <Identity (A)> * A;

Where, Identity (A) is a function of a Matrix abstraction that returns the
Identity matrix for the matrix, A.

If we know that N is of subtype Positive, we can further simplify this to,

   Matrix_To_The_Nth_Power : constant Matrix := (parallel for I in 1 .. N => <> * A;

Since, the Identity matrix would not be needed to evaluate the expression.

You may ask, what about the case in the previous examples as currently written
in the AI where the top level operation is not a reducer function.

The answer is that it should be trivial to rewrite the expression so that a
reducer function is used.

Instead of;

    Alphabet : constant Unbounded_String := (for Letter in 'A' .. 'Z' => Append(<>, Letter));

One could write;

    Alphabet : constant Unbounded_String := (for Letter in 'A' .. 'Z' => <> & To_Unbounded_String ("" & Letter));

Does this seem like a way forward?

****************************************************************

From: Brad Moore
Sent: Sunday, June 18, 2017  12:26 AM

Another problem that would benefit from this simplication is bitwise "and" for
modular integer types.

It otherwise would be difficult to produce an initial value for modular integer
types where the modulus is not a power of two. Attempting to set all bits to 1
could result in a value that is outside the range of values for the type.

With this simplification, it should not be a problem anymore, since we are only
"and"ing together supplied values that are all valid, and the result will also
be valid and belong to the subtype.

eg.

   (for Value of Arr => <> & Value)

****************************************************************

From: Tucker Taft
Sent: Sunday, June 18, 2017  2:02 AM

I would rather leave it the way we discussed yesterday, as these changes in my
view are making the construct more complex from the user’s point of view.  It is
easy enough for the compiler to recognize special cases.  It is a pain for the
user to have to obey additional restrictions.

****************************************************************

From: Brad Moore
Sent: Sunday, July  9, 2017  1:23 AM

I have been thinking about iteration schemes, for for loops, and I think there
are 5 that we'd want to support in Ada. By iteration scheme, I am referring to
the combination of chunking and iteration direction.

The 5 are;

- Forward Sequential
- Reverse Sequential
- Forward Parallel
- Reverse Parallel
- All     Parallel


- Forward Sequential
Already exists in Ada. (A for loop without the reverse keyword) Sequential and
Forward are both implicit. Semantically, this is a single chunk, executed by a
single tasklet, with iteration from lowest to highest.

- Reverse Sequential
Already exists in Ada. (A for loop with the reverse keyword) Sequential is
implicit, Reverse is explicit. Semantically, also a single chunk, executed by a
single tasklet, with iteration from highest to lowest.

- Forward Parallel
Forward is implicit, parallel is explicit.
The for loops with the parallel keyword we were discussing in Vienna
Semantically, multiple chunks, each corresponding to a tasklet where iteration
within each chunk is from lowest to highest.

- Reverse Parallel
Both Reverse and Parallel are explicit.
A for loop with both the parallel keyword, and the reverse keyword.
Semantically, multiple chunks, each corresponding to a tasklet where iteration
within each chunk is from highest to lowest.

- All Parallel
Both all and Parallel are explicit.
A for loop with the parallel keyword, and the "all" keyword.
Semantically, each iteration is a separate chunk, or alternatively viewed as
there not being any chunks. There is no order to the iteration, any iteration
can occur before any other.

The case for reverse parallel would admittedly be quite rare, just as Reverse
Sequential loops are quite rare today in Ada, however it is not hard to come up
with an example.

Consider finding the highest index of an array whose element satisfies some
property of interest. If chunking is involved, it makes sense to want to search
in reverse order, because once you've found a candidate, you know that there is
no need to search further within the chunk, and processing can be abandoned.
With a forward sequential loop, one cannot be sure that they've found a
candidate without searching the entire chunk.

Eg.

    --  Note: We do not want exit to pro-actively terminate processing in
    --  other chunks here. We just want to exit processing for the
    --  current chunk if any of the exit conditions hold true.
    --  Updater is a protected object in this example.

    --  Find maximum array index whose element satisfies some property

    for I in reverse parallel Arr'Range loop

       -- This first check causes other chunks to stop processing
       -- if a higher chunk has already found a result
       if I < Updater.Value then
          exit;

       -- This second check abandons the current chunk if a
       -- result is found, since any other finds in the same
       -- chunk will have a lower index
       elsif Satisfies (Arr (I)) then
          Updater.Update (I);
          exit;
       end if;
    end loop;

    Put_Line ("The answer is" & Integer'Image (Updater.Value));

The all parallel case is useful when the programmer wants to do manual chunking.
This is useful for instance in problems such as generating a cumulative array
sum of another array, as two loops are needed and generally both loops should be
chunked the same way.

Another useful case for all parallel loops is when one wants to mix sequential
processing within the parallel processing, for instance if one wants to output
some results at the end of each iteration through the chunks. Consider the
following example to compute a 7 day weather forecast for various cities in the
world, as well as an average of all these values for each day.

    declare
       type Cities is (London,
                       Tokyo,
                       New_York,
                       Beijing,
                       Calgary,
                       Paris,
                       Los_Angeles);

       Today : constant Calendar.Time := Calendar.Clock;
       Forecast : array (Cities) of Float := (others => 0.0);

       Done_Parallelism, Done_Sequential : Synchronous_Barrier
         (Release_Threshold => Forecast'Length);

       function Max_Temp
          (City : Cities; Day : Calendar.Time) return Float is
       begin
         ...
          --  Presumably some complex computer intensive calculation....
          return Result;
       end Max_Temp;

       Notified : Boolean := False;
       use type Calendar.Arithmetic.Day_Count;
    begin

       Day_Loop :
       for Day in 1 .. 7 loop -- Compute 7 day forecast

          City_Loop :
          for City in all parallel Cities'Range loop

             -- Calculated in parallel for each city
             Forecast (City) :=
               Max_Temp (City,
                         Today + Calendar.Arithmetic.Day_Count (Day));

             Synchronous_Barriers.Wait_For_Release (Done_Parallelism,
                                                    Notified);

             -- Transitioning from parallel to sequential
             if Notified then

                Output_Loop :
                for City in Cities'Range loop
                   Put (Cities'Image (City) &
                          Float'Image (Forecast (City)) & ' ');

                   -- Note use of parallel reduction expression

                   Put ("World Avg" &
                          Float'Image
                          ((for Temp of parallel Forecast =>
                            <0.0> + Temp)
                           / Float (Forecast'Length)));
                end loop Output_Loop;
             end if;

            -- Transitioning from Sequential back to Parallel
             Synchronous_Barriers.Wait_For_Release (Done_Sequential,
                                                    Notified);
          end loop City_Loop;
       end loop Day_Loop;

    end;

The synchronous barriers are used to transition between parallel and sequential
processing, and then back to to parallel processing for each day of the
forecast. For this to work, each iteration in City_Loop must iterate in parallel
with each other since the Release_Threshold of the barriers needs to be known in
advance. The calculation of each forecast for each city for each day is done in
parallel, but presumably, the weather calculations are such that one cannot
proceed to the next day until all the weather forecasting is complete for the
previous day for all the cities.

I know in Vienna, we had taken a straw poll, which favored putting the keyword
parallel before the word "for" in the for loop. While this is a useful data
point, I think if we want to support "all parallel" loops, I find it makes a lot
more sense to have all parallel together after the word "in" or "of".

eg
    for I in all parallel 1 .. 10 loop

or

    for Element of all parallel List loop

rather than;

all parallel for i in 1 .. 10 loop
or
parallel all for I in 1 .. 10 loop


as it reads more like an English sentence, similar to reverse parallel loops.

for I in reverse parallel 1 .. 10 loop

instead of;

parallel for I in reverse 1 .. 10 loop

The other point that was not made in Vienna, is that there is a case for saying
that the syntax for specifying iteration scheme should be grouped together in
the same place in the syntax.

Having parallel at the beginning of the loop statement, and reverse in the
middle of the loop statement means you have to look in two places for the reader
to determine the iteration scheme.

To recap some of my other arguments that were made in Vienna, which might
resonate better with the above considerations

1. for ... loop  identifies the control construct which is the most important
   part of the syntax. That the iterations occur in parallel, reverse, forward,
   etc is secondary, which suggests that parallel is not the most important part
   of the syntax and therefore should not appear first.

2. This point was not specifically made in Vienna, but the parallel block is
   also a control structure, so in that case, it makes sense that parallel is
   the first word.

3. In Vienna, we decided that since parallel for loops with parallel at the
   front of the syntax would have some ambiguity with parallel blocks, we felt
   we had to add the Begin keyword to the parallel blocks. I did not like that
   we were being forced to alter our other syntax on account of the choice of
   parallel keyword placement in for loops. In the examples I have used I have
   found that any declarations typically need to have greater scope than the
   parallel block, so it is not clear to me that having a begin declaration area
   as part of the parallel block is particularly useful.

Comments?

****************************************************************

From: Tucker Taft
Sent: Sunday, July  9, 2017  8:08 AM

...
> Both Reverse and Parallel are explicit.
> A for loop with both the parallel keyword, and the reverse keyword.
> Semantically, multiple chunks, each corresponding to a tasklet where
> iteration within each chunk is from highest to lowest.
>

Given where we are placing the “parallel” word in the latest proposal discussed
by the ARG (at the beginning), there seems no particular reason to disallow
“reverse.”  But I don’t find your example very compelling:

...
> Consider finding the highest index of an array whose element
> satisfies some property of interest. If chunking is involved, it makes
> sense to want to search in reverse order, because once you've found a
> candidate, you know that there is no need to search further within the
> chunk, and processing can be abandoned. …

>
> Eg.
>
>   --  Note: We do not want exit to pro-actively terminate processing in
>   --  other chunks here. We just want to exit processing for the
>   --  current chunk if any of the exit conditions hold true.
>   --  Updater is a protected object in this example.
>
>   --  Find maximum array index whose element satisfies some property
>
>   for I in reverse parallel Arr'Range loop
>
>      -- This first check causes other chunks to stop processing
>      -- if a higher chunk has already found a result
>      if I < Updater.Value then
>         exit;
>
>      -- This second check abandons the current chunk if a
>      -- result is found, since any other finds in the same
>      -- chunk will have a lower index
>      elsif Satisfies (Arr (I)) then
>         Updater.Update (I);
>         exit;
>      end if;
>   end loop;
>
>   Put_Line ("The answer is" & Integer'Image (Updater.Value));

The semantics for exiting a loop early as currently proposed is that if any
iteration invokes an “exit,” all other iterations are abandoned.  So you
wouldn’t end up with the highest index overall.  You would end up with the one
that is the highest within its chunk, but of which chunk would be somewhat
random.  And along the way you have added a synchronized operation in each
iteration (if I < Updater.Value) which seems highly likely to slow things down
significantly.

One place where "reverse parallel" definitely makes sense is in the proposed
map/reduce construct, where the “parallel” is really about doing the “map” part
in parallel, and then doing the “reduction” part in some kind of parallel tree,
preserving order.  So “reverse” has a well-defined meaning.  I don’t see
“reverse" having much meaning in the more conventional   parallel iteration
schemes.  It just produces a bit of a bias within each chunk, but has no
particularly well-defined overall semantic effect.

...
> - All Parallel
> Both all and Parallel are explicit.
> A for loop with the parallel keyword, and the "all" keyword.
> Semantically, each iteration is a separate chunk, or alternatively
> viewed as there not being any chunks. There is no order to the
> iteration, any iteration can occur before any other.

I would say using “all” to signify a universal quantified expression as well as
this new idea is asking for confusion.

...
> The all parallel case is useful when the programmer wants to do manual
> chunking. This is useful for instance in problems such as generating a
> cumulative array sum of another array, as two loops are needed and
> generally both loops should be chunked the same way.
...
> The synchronous barriers are used to transition between parallel and
> sequential processing, and then back to to parallel processing for
> each day of the forecast. For this to work, each iteration in
> City_Loop must iterate in parallel with each other since the
> Release_Threshold of the barriers needs to be known in advance. The
> calculation of each forecast for each city for each day is done in
> parallel, but presumably, the weather calculations are such that one
> cannot proceed to the next day until all the weather forecasting is complete
> for the previous day for all the cities.

I guess I understand the semantics, but it just feels too complicated to me.
For something like this, two separate parallel loops, or an explicit array of
tasks might be more appropriate.  Trying to synchronize across tasklets is going
to be complex, and the semantics will be subject to a lot of caveats, I suspect.
Tasklets can generally use run-to-completion semantics, and that clearly won’t
work here with barriers in the middle.  If you want a barrier, you can end the
parallel loop, and then start a new one, which creates a natural, highly
visible, easy-to-understand barrier.

> I know in Vienna, we had taken a straw poll, which favored putting the
> keyword parallel before the word "for" in the for loop. While this is
> a useful data point, I think if we want to support "all parallel"
> loops, I find it makes a lot more sense to have all parallel together
> after the word "in" or "of”.  …

Personally I would rather not re-open this decision.  We have already gone round
and round on a number of these issues.  The usual rule is that only someone who
voted *for* the chosen approach can re-open the discussion.

****************************************************************

From: Brad Moore
Sent: Monday, July 10, 2017  2:24 PM

>> … - Reverse Parallel
>> Both Reverse and Parallel are explicit.
>> A for loop with both the parallel keyword, and the reverse keyword.
>> Semantically, multiple chunks, each corresponding to a tasklet where
>> iteration within each chunk is from highest to lowest.
>>
>
> Given where we are placing the “parallel” word in the latest proposal
> discussed by the ARG (at the beginning), there seems no particular reason to
> disallow “reverse.”  But I don’t find your example very compelling:

I will try to see if I can come up with some more compelling examples.
But I agree that reverse parallel loops are not likely to be used much.
It might be that the best use case is finding the max occurrence of something,
similar to the example I provided, which is not a common problem, but might be
common enough to allow the syntax to cover that case.

...
> The semantics for exiting a loop early as currently proposed is that if any
> iteration invokes an “exit,” all other iterations are abandoned.

Not quite, the AI adds this note:

"Note that aborting a tasklet need not be preemptive, but should prevent the
initiation of further nested parallel blocks or parallel loops."

So if the loop is running its chunks non-preemptively, then presumably the code
would work as I suggested.

I think there are two cases to consider.
1) A high number of iterations with small processing per iteration
      eg. (incrementing integer elements of a large array)
2) A small number of iterations with large processing per iteration.
      eg. (Calculating weather for a handful of cities)


In case 1, the majority of time is being spent doing the iteration. I suspect
that preemptive tasklet abortion in that case will introduce significant
overhead and interfere too much with the parallelism. This is better suited for
the non-preemptive case.

In case 2, the actual iteration is insignificant, and there is much to be gained
by preemptively aborting other tasklets.

I can imagine we might want to give the programmer control of this decision via
an aspect or pragma, such as Preemptive_Aborts but otherwise let the compiler
decide how to deal with such transfer of control.

> One place where "reverse parallel" definitely makes sense is in the proposed
> map/reduce construct, where the “parallel” is really about doing the “map” part
> in parallel, and then doing the “reduction” part in some kind of parallel tree,
> preserving order.

Agreed.

...
>> - All Parallel
>> Both all and Parallel are explicit.
>> A for loop with the parallel keyword, and the "all" keyword.
>> Semantically, each iteration is a separate chunk, or alternatively
>> viewed as there not being any chunks. There is no order to the
>> iteration, any iteration can occur before any other.
>
> I would say using “all” to signify a universal quantified expression as well
> as this new idea is asking for confusion.

I dont think this is a real problem, just as it isn't for using "all"
for dereferencing access objects, because the surrounding context is different.
As in english, "all" typically describes what follows,

for all X ...

    -- meaning for all values of X

for X in all parallel ....

     -- meaning for all parallel iterations

Also Quantified expressions and Loops statements appear in very different
contexts in Ada code.

> ...
>> The synchronous barriers are used to transition between parallel and
>> sequential processing, and then back to to parallel processing for
>> each day of the forecast. For this to work, each iteration in
>> City_Loop must iterate in parallel with each other since the
>> Release_Threshold of the barriers needs to be known in advance. The
>> calculation of each forecast for each city for each day is done in
>> parallel, but presumably, the weather calculations are such that one
>> cannot proceed to the next day until all the weather forecasting is complete
>> for the previous day for all the cities.
>
> I guess I understand the semantics, but it just feels too complicated to me.
> For something like this, two separate parallel loops, or an explicit array
> of tasks might be more appropriate.  Trying to synchronize across tasklets
> is going to be complex, and the semantics will be subject to a lot of
> caveats, I suspect.  Tasklets can generally use run-to-completion semantics,
> and that clearly won’t work here with barriers in the middle.  If you want a
> barrier, you can end the parallel loop, and then start a new one, which
> creates a natural, highly visible, easy-to-understand barrier.

To do this as two separate loops, you need to store much more state. In this
case, you'd need a two dimensional array and store all the values for all days
of the forecast. Using barriers, you only need to track the current day
forecast, so a single dimensional array works here.

There are other similar problems that benefit from the use of barriers.

One is the n-body problem, where the masses of heavenly bodies and direction of
movement are tracked so that the orbital paths can be predicted. The technique
is to use small increments of time and then update the location of all the
heavenly bodies taking gravitational pulls into account, and the position of the
heavenly bodies can then be output for each successive time increment.

Using barriers, only the current location of each body needs to be tracked.
Using your suggestion of two separate loops, one would have to store all the
updated movements in memory before outputting the results. This could
potentially and literally require an astronomical amount of storage.

Using barriers really is the only reasonable solution for this, that I can see.

Yet another example of this sort of problem is of linear algebra solving solving
matrices using gaussian elimination.  As each column is processed, barriers
allow one to find the pivot point for the next column. I dont think I can
imagine how to solve this by breaking into separate loops, not that I have
tried, but it seems like it would be very difficult. I think using barriers is
the best solution for that problem when solving in parallel.

I also think in the AI we already talk about calling protected objects from
tasklets.

" calls by different tasklets of the same task into the same protected object
are treated as different calls resulting in distinct protected actions;
therefore synchronization between tasklets can be performed using non-blocking
protected operations. Note that this is consistent with the current standard
which already supports multiple concurrent calls by a single task in the
presence of the asynchronous transfer of control capability (RM 9.7.4 (23))."

So I think the current AI design suggests that barriers should work as I
describe in the example.

>> I know in Vienna, we had taken a straw poll, which favored putting
>> the keyword parallel before the word "for" in the for loop. While
>> this is a useful data point, I think if we want to support "all
>> parallel" loops, I find it makes a lot more sense to have all
>> parallel together after the word "in" or "of”.  …
>
> Personally I would rather not re-open this decision.  We have already gone
> round and round on a number of these issues.  The usual rule is that only
> someone who voted *for* the chosen approach can re-open the discussion.

I wouldn't ordinarily have suggested reexamining this issue, but I would hope
that it should be oK for anyone to open a discussion if new ideas and arguments
are brought to light that were not considered as inputs in the original
decision.

****************************************************************

From: Tucker Taft
Sent: Monday, July 10, 2017  2:43 PM

>>> ...
>> The semantics for exiting a loop early as currently proposed is that if any
>> iteration invokes an “exit,” all other iterations are abandoned.
>
> Not quite, the AI adds this note:
>
> "Note that aborting a tasklet need not be preemptive, but should
> prevent the initiation of further nested parallel blocks or parallel loops."
>
> So if the loop is running its chunks non-preemptively, then presumably
> the code would work as I suggested.

That is certainly not something to be relied upon.  And even if things run
non-preemptively, there might be tasklets that have never been started, so
certainly those would not start after an “exit.”

> I think there are two cases to consider.
> 1) A high number of iterations with small processing per iteration
>     eg. (incrementing integer elements of a large array)
> 2) A small number of iterations with large processing per iteration.
>     eg. (Calculating weather for a handful of cities)
>
>
> In case 1, the majority of time is being spent doing the iteration. I
> suspect that preemptive tasklet abortion in that case will introduce
> significant overhead and interfere too much with the parallelism. This
> is better suited for the non-preemptive case.
>
> In case 2, the actual iteration is insignificant, and there is much to
> be gained by preemptively aborting other tasklets.
>
> I can imagine we might want to give the programmer control of this
> decision via an aspect or pragma, such as Preemptive_Aborts but
> otherwise let the compiler decide how to deal with such transfer of
> control.

This is adding yet more complexity to the model.  And as pointed out above,
if you have more tasklets than cores, some of them might not be started at
all before the “exit."

>>> ...
>>>
>> I guess I understand the semantics, but it just feels too complicated to
>> me.  For something like this, two separate parallel loops, or an explicit
>> array of tasks might be more appropriate.  Trying to synchronize across
>> tasklets is going to be complex, and the semantics will be subject to a lot
>> of caveats, I suspect.  Tasklets can generally use run-to-completion
>> semantics, and that clearly won’t work here with barriers in the middle.
>> If you want a barrier, you can end the parallel loop, and then start a new
>> one, which creates a natural, highly visible, easy-to-understand barrier.
>
> To do this as two separate loops, you need to store much more state.

You should factor in the amount of space needed to store the stacks of all of
the temporarily suspended tasklets.  You are requiring that all but one tasklet
be suspendible until they all reach the barrier.

****************************************************************

From: Randy Brukardt
Sent: Monday, July 10, 2017  3:16 PM

Just a few words on this, because I don't think it pays to get bogged down in
detail...

...
> So if the loop is running its chunks non-preemptively, then presumably
> the code would work as I suggested.
>
> I think there are two cases to consider.
> 1) A high number of iterations with small processing per iteration
>       eg. (incrementing integer elements of a large array)
> 2) A small number of iterations with large processing per iteration.
>       eg. (Calculating weather for a handful of cities)

This is the crux of the issue. There clearly are multiple termination approaches
that make sense for a parallel loop, and those are mostly independent of the
form of the loop itself. (And there's more than two!)

   Preemptive vs. not preemptive
   Order significant vs. not significant

Specifically, one possibility is the proposed rule that an exit of any tasklet
terminated the others. Another possibility is that the exit effectively works
the same as a sequential loop -- meaning that an exit terminates any tasklets
with indicies above the exiting one, and the tasklets with indicies below
continue (this would work well for the search application among others). Since
the latter is likely to have a higher overhead, one probably would want the
option to select the termination mechanism.

There's other details that might be worth controlling.

My overall point is that it is unlikely that the details could be controlled
solely with keywords, there are just too many. We need something like an aspect
specification for these loops so that the details can be controlled without
introducing loads of keywords (which will be hard to remember).

And the default should be that semantics are as close as reasonable to
sequential loops; many cases can use simpler termination but I think the default
should be fewest surprises.

...
> >> I know in Vienna, we had taken a straw poll, which favored putting
> >> the keyword parallel before the word "for" in the for loop. While
> >> this is a useful data point, I think if we want to support "all
> >> parallel" loops, I find it makes a lot more sense to have all
> >> parallel together after the word "in" or "of".  .
> >
> > Personally I would rather not re-open this decision.  We
> have already gone round and round on a number of these issues.  The
> usual rule is that only someone who voted *for* the chosen approach
> can re-open the discussion.
>
> I wouldn't ordinarily have suggested reexamining this issue, but I
> would hope that it should be oK for anyone to open a discussion if new
> ideas and arguments are brought to light that were not considered as
> inputs in the original decision.

There are always going to be "new ideas" for an AI like this one that it at the
bleeding edge. Allowing unlimited reopening only ensures that nothing ever will
be decided and no progress ever will be made. That's certainly the case for this
AI in the last year or so. Keep in mind that to stay on schedule this thing has
to be mostly frozen in little over a year. Going back to square one for every
decision ensures that there is almost no chance of that happening. (The only
reason IMHO that we didn't completely change parallel blocks at this last
meeting is that we barely talked about them.)

****************************************************************

From: Brad Moore
Sent: Monday, July 11, 2017  9:18 AM

>>>> ...
>>> The semantics for exiting a loop early as currently proposed is that if any
>>> iteration invokes an “exit,” all other iterations are abandoned.
>>
>> Not quite, the AI adds this note:
>>
>> "Note that aborting a tasklet need not be preemptive, but should
>> prevent the initiation of further nested parallel blocks or parallel loops."
>>
>> So if the loop is running its chunks non-preemptively, then
>> presumably the code would work as I suggested.
>
> That is certainly not something to be relied upon.  And even if things run
> non-preemptively, there might be tasklets that have never been started, so
> certainly those would not start after an “exit.”

I think if we did support some sort of non-preemptive mode, probably the most
sensible behaviour for that mode is to just let the tasklets complete on their
own, without any attempt to shut them down midstream. Any tasklets that hit the
exit, or other transfer of control would terminate early, but otherwise tasklets
generated at a POP would run to completion.

...
>> I can imagine we might want to give the programmer control of this
>> decision via an aspect or pragma, such as Preemptive_Aborts but
>> otherwise let the compiler decide how to deal with such transfer of
>> control.
>
> This is adding yet more complexity to the model.  And as pointed out above, if
> you have more tasklets than cores, some of them might not be started at all
> before the "exit."

I think there are a number of controls that the real time community might want,
for example. I think all of these should be the subject of a different AI
though, as they can be added later on their own merit, including an explicit
non-preemptive mode for tasklet termination, which I think should probably allow
starting after an "exit", (or continuing after an exit if tasklets get initially
created to cover the full iterations of the for loop, which tends to be how I
look at things.)

...
>>>>
>>> I guess I understand the semantics, but it just feels too complicated to me.
>>> For something like this, two separate parallel loops, or an explicit array
>>> of tasks might be more appropriate.  Trying to synchronize across tasklets
>>> is going to be complex, and the semantics will be subject to a lot of
>>> caveats, I suspect.  Tasklets can generally use run-to-completion
>>> semantics, and that clearly won’t work here with barriers in the middle.
>>> If you want a barrier, you can end the parallel loop, and then start a new
>>> one, which creates a natural, highly visible, easy-to-understand barrier.
>>
>> To do this as two separate loops, you need to store much more state.
>
> You should factor in the amount of space needed to store the stacks of all
> of the temporarily suspended tasklets.  You are requiring that all but one
> tasklet be suspendible until they all reach the barrier.

Yes, but that storage is bounded, and the programmer can always do manual
chunking if necessary to to bring the number of chunks down to the number of
available cores, or a small multiple of that. The simulation can run
indefinitely in theory without having to worry about running out of storage.
This tasklet stack storage becomes insignificant for the n-body problem,
compared to the real bounds of N * state, where N is the number of bodies, and
state is the amount of state variables for each body.

With separate loops however, the storage needed for the simulation however would
be unbounded, in the sense that the longer the simulation runs, the more storage
is needed, and you need much more of it.

The needs become N * state * time-increments, where time-increments are the
number of simulation steps that are being computed. The notion of running a
simulation for an unspecified amount of time becomes problematic, because of the
risk of running out of storage.

I think this barrier pattern could be commonly applied to a lot of
simulation/modeling scenarios.

****************************************************************

From: Brad Moore
Sent: Tuesday, July 11, 2017  9:45 AM

> This is the crux of the issue. There clearly are multiple termination
> approaches that make sense for a parallel loop, and those are mostly
> independent of the form of the loop itself. (And there's more than
> two!)
>
>     Preemptive vs. not preemptive
>     Order significant vs. not significant

In the gang of four, we had discussed the idea of declaring a local subtype for
a loop iterator, and then applying aspects to that declaration to provide
specific controls.

eg.

declare
   subtype Loop_Iterator is Integer range 1 .. 100000
            with Executors => 10, Chunk_Size => 100,
            Load_Sensitivity, Preemptive_Termination => False; begin
    parallel for I in Loop_Iterator'range loop
       ....
    end loop;
end;

Adding these sort of specific controls should be a separate AI, and can be added
later independent of this AI.

We could say that the "all parallel" notion could be a similar aspect that is
specified this way, but to me at least it seems like this would be an
inconsistency, since all other iteration schemes are built right into the loop
syntax. I think the "all parallel" scheme is useful enough that it ought to be
visible at the same level as the parallel keyword, and the reverse keyword,
regardless whether these two keywords appear adjacent to each other, or
separated as we have coming out of Vienna.

The question is how best to do this in the syntax. It seemed to fit nicely in
with the version of the AI going into Vienna. It doesn't seem to fit very well
with the version of the AI we'll have coming out of Vienna, but maybe we'll
either live with that, or come up with a better idea.

...
> Specifically, one possibility is the proposed rule that an exit of any
> tasklet terminated the others. Another possibility is that the exit
> effectively works the same as a sequential loop -- meaning that an
> exit terminates any tasklets with indicies above the exiting one, and
> the tasklets with indicies below continue (this would work well for
> the search application among others). Since the latter is likely to
> have a higher overhead, one probably would want the option to select
> the termination mechanism.
>
> There's other details that might be worth controlling.
>
> My overall point is that it is unlikely that the details could be
> controlled solely with keywords, there are just too many. We need
> something like an aspect specification for these loops so that the
> details can be controlled without introducing loads of keywords (which will be
> hard to remember).

I think the aspect specification like I described above would cover most of
these, but "all parallel" feels like it should be more integrated with the loop
syntax, for consistency, and importance for semantic behaviour. I think there
are enough problems that would need to rely on these semantics. Most other
controls such as number of executors, or chunk size do not really affect the
semantics. They just impact the speed of the processing.

> And the default should be that semantics are as close as reasonable to
> sequential loops; many cases can use simpler termination but I think
> the default should be fewest surprises.

Agreed.

****************************************************************

From: Tucker Taft
Sent: Tuesday, July 11, 2017  9:55 PM

>>> So if the loop is running its chunks non-preemptively, then
>>> presumably the code would work as I suggested.
>>
>> That is certainly not something to be relied upon.  And even if things run
>> non-preemptively, there might be tasklets that have never been started, so
>> certainly those would not start after an “exit.”
>
> I think if we did support some sort of non-preemptive mode, probably the most
> sensible behaviour for that mode is to just let the tasklets complete on their
> own, without any attempt to shut them down midstream. Any tasklets that hit
> the exit, or other transfer of control would terminate early, but otherwise
> tasklets generated at a POP would run to completion.  ...

I believe this is creating two significantly different semantics for a single
syntactic construct, which is bad news in my view.  You are heading toward
“programming by pragma” (or "by aspect"), I fear.  That is largely what OpenMP
does, and I believe we should avoid it.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 19, 2017  8:36 PM

...
> > And the default should be that semantics are as close as reasonable
> > to sequential loops; many cases can use simpler termination but I
> > think the default should be fewest surprises.
>
> Agreed.

(I think this is the only thing that I wrote that Brad agreed with, but don't
look behind that curtain! :-)

I had the thought that any loop exits have to be explicitly in the loop body
(they can't be in some subprogram that is called). Thus, we could make the
termination semantics depend upon that. (The same applies to AI12-0189-1, the
loop body as anomymous procedure, BTW).

The basic idea is that one could depend on the loop index value determined at
the point of an exit; if the loop finishes normally or exits abnormally, you
couldn't have such a dependence. Thus if there is no exit, the termination is a
bit simpler (not much, really); if there is, we require the loop to get the same
loop index as the similar sequential loop.

You could use an exception for a non-local exit, but that is so ugly I think it
would be OK to ignore it for the purposes of termination semantics.

Thus a loop like the following would work in parallel:

             parallel for I in 1 .. Limit loop
                if DB(I).Name = Name_to_Find then
                   My_PO.Save_Found_Index (I); -- Save the lowest index found.
                   exit; -- Done looking.
                -- else continue
                end if;
             end loop;

P.S. Unrelated aside: I have to wonder about the wisdom of considering atomic
objects task-safe, since Ada doesn't have any form of atomic update. (The only
safe atomic writing operation is initialization.) Almost any use of an atomic
object would have a race condition. I originally wrote the above My_PO use as:
          if Result > I then
              Result := I;
          end if;
with Result an atomic object initialized to Integer'Last (something I do
sequentially all the time) -- but the above is a race condition and definitely
is NOT task-safe. The same would be true for any other test of an atomic object
followed by a write of that object -- but that is the natural sequential code.

It seems better (given the current facilities) to insist on a PO here (since the
natural code in a PO would do the right thing).

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 19, 2017  9:06 PM

...
> >>> So if the loop is running its chunks non-preemptively, then
> >>> presumably the code would work as I suggested.
> >> That is certainly not something to be relied upon.  And
> even if things run non-preemptively, there might be tasklets that have
> never been started, so certainly those would not start after an
> "exit."
> >
> > I think if we did support some sort of non-preemptive mode,
> probably the most sensible behaviour for that mode is to just let the
> tasklets complete on their own, without any attempt to shut them down
> midstream. Any tasklets that hit the exit, or other transfer of
> control would terminate early, but otherwise tasklets generated at a
> POP would run to completion.  ...
>
> I believe this is creating two significantly different semantics for a
> single syntactic construct, which is bad news in my view.  You are
> heading toward "programming by pragma"
> (or "by aspect"), I fear.  That is largely what OpenMP does, and I
> believe we should avoid it.

This is a real concern, but I have to wonder if we really have any choice.

If our primary goal is (as I hope it is) to make it as easy as possible to
convert sequential code into parallel code, the semantics have to be a close as
possible to the equivalent sequential loop. That means that there can't be any
erroneous execution from aborts, the loop index has be meaningful at a normal
exit, and so on. That's going to be more expensive than the straight
abort-everything-on-any-sort-of-completion (I'm skeptical that it would be much
more expensive in usual cases, but I also know that I know just enough to be
dangerous in this area). So people who need absolute performance may need some
other semantics (including abort, for instance).

I suspect that Brad's non-premptive semantics make the most sense for the
default semantics for parallel loops (probably not for parallel blocks; the
problems are very different). That means that anyone that wants preemptive
semantics is going to have to use a pragma or aspect. But I don't think that is
a huge problem semantically; the difference only matters if the loop has a bug
already, or depends on the loop index at exit (in which case the preemptive
semantics is plain wrong and we could even make it illegal to use it in that
case as noted in my other message).

Anyway, back to real work...

****************************************************************

From: Brad Moore
Sent: Thursday, July 20, 2017  12:16 AM

> ...
>> > And the default should be that semantics are as close as reasonable
>> > to sequential loops; many cases can use simpler termination but I
>> > think the default should be fewest surprises.
>>
>> Agreed.
>
> (I think this is the only thing that I wrote that Brad agreed with,
> but don't look behind that curtain! :-)

I disagree. :-)
I'd have to check, but my overall sense was that I was agreeing with more of
what you said.

> I had the thought that any loop exits have to be explicitly in the
> loop body (they can't be in some subprogram that is called). Thus, we
> could make the termination semantics depend upon that. (The same
> applies to AI12-0189-1, the loop body as anomymous procedure, BTW).
>
> The basic idea is that one could depend on the loop index value
> determined at the point of an exit; if the loop finishes normally or
> exits abnormally, you couldn't have such a dependence. Thus if there
> is no exit, the termination is a bit simpler (not much, really); if
> there is, we require the loop to get the same loop index as the similar sequential loop.
>
> You could use an exception for a non-local exit, but that is so ugly I
> think it would be OK to ignore it for the purposes of termination semantics.
>
> Thus a loop like the following would work in parallel:
>
>             parallel for I in 1 .. Limit loop
>                if DB(I).Name = Name_to_Find then
>                   My_PO.Save_Found_Index (I); -- Save the lowest index found.
>                   exit; -- Done looking.
>                -- else continue
>                end if;
>             end loop;

Maybe I am not fully understanding this idea, or maybe these semantics are the
ones I had in my mind all the time, and have not considered the alternate
semantics that you might otherwise have assumed?

> P.S. Unrelated aside: I have to wonder about the wisdom of considering
> atomic objects task-safe, since Ada doesn't have any form of atomic update.
> (The only safe atomic writing operation is initialization.) Almost any
> use of an atomic object would have a race condition. I originally
> wrote the above My_PO use as:
>          if Result > I then
>              Result := I;
>          end if;
> with Result an atomic object initialized to Integer'Last (something I
> do sequentially all the time) -- but the above is a race condition and
> definitely is NOT task-safe. The same would be true for any other test
> of an atomic object followed by a write of that object -- but that is
> the natural sequential code.
>
> It seems better (given the current facilities) to insist on a PO here
> (since the natural code in a PO would do the right thing).

Actually, if I were to write this, I was thinking that a PO would be used to
update a global volatile/atomic variable, but otherwise the global variable
could be read/tested directly.

Something like;

    parallel for I in 1 .. Limit loop
       if DB(I).Name = Name_to_Find and then I < Result then
          My_PO.Save_Found_Index (I); -- Save the lowest index found in Result.
          exit; -- Done looking.
    -- else continue
       end if;
    end loop;

The extra check "and then I < Result" would potentially provide an
improvement/short cut on performance, assuming it is cheaper to test an
atomic/volatile integer, than to always have to access the PO.

On a further aside, I recognise that one would ordinarily consider atomic as the
safe choice for such a variable declaration, but I found that declaring the
global variable to be volatile rather than atomic provided significant
performance improvements. I'm not sure if that is just an artefact of the
combination of compilers and target platforms I was using, or Ada semantics
somehow demands this, but if the type of the object is such that it can be read
atomically without putting memory fences or guards around the access, eg. such
as via a single machine instruction, then I found I'd get consistently correct
results, with improved performance, over declaring the global as atomic. Ada
doesn't distinguish, but I wonder if there are certain data types that when
declared to be volatile, are also atomic. We already know that variable declared
to be atomic are also volatile in Ada. Or maybe there are certain volatile types
that have some useful properties of atomic, but are somehow otherwise
lighter-weight than full atomic?

****************************************************************

From: Brad Moore
Sent: Thursday, July 20, 2017  12:37 AM

> I suspect that Brad's non-premptive semantics make the most sense for
> the default semantics for parallel loops (probably not for parallel
> blocks; the problems are very different). That means that anyone that
> wants preemptive semantics is going to have to use a pragma or aspect.

Not necessarily a pragma or aspect. See below.

> But I don't think that
> is a huge problem semantically; the difference only matters if the
> loop has a bug already, or depends on the loop index at exit (in which
> case the preemptive semantics is plain wrong and we could even make it
> illegal to use it in that case as noted in my other message).

I think the non-preemptive semantics probably also are less likely to introduce
erroneous or unexpected results more generally. But that is just my overall
sense.

I agree here, that I think the non-preemptive behaviour might be the best
default.

Also, I am not advocating a programming by pragma view. The goal should be that
for most cases (> 90-99%), no pragmas would be necessary.

Note, we have currently have other cases in Ada where two very different
semantics can be invoked with respect to preemptive and non-preemtive behaviour,
and I think we should try to be consistent with those already existing syntax if
possible.

In particular, we have;
   requeue
and
   requeue with abort

I am now thinking that we could use a "with abort" on a parallel loop, to get
the preemptive semantics, but otherwise the semantics would be non-preemptive.

eg.

parallel with abort
  for I in 1 .. limit loop
     Long_Calculation (I);
  end loop;

Here, for example, an exception raised in Long_Calculation would be expected to
preempt other parallel iterations involving the same call.

This might also be a good way to tie in the "all" for the "all parallel" loops
that I was describing in recent emails, with the current parallel keyword
placement. a "with all", would imply that any iteration can be expected to
potentially execute in parallel with any other iteration. (ie. no chunking)

parallel with all
   for I in 1 .. limit loop
     Long_Calculation (I);
   end loop;

****************************************************************

From: Randy Brukardt
Sent: Thursday, July 20, 2017  1:32 AM

...
> > Thus a loop like the following would work in parallel:
> >
> >             parallel for I in 1 .. Limit loop
> >                if DB(I).Name = Name_to_Find then
> >                   My_PO.Save_Found_Index (I); -- Save the lowest index found.
> >                   exit; -- Done looking.
> >                -- else continue
> >                end if;
> >             end loop;
>
> Maybe I am not fully understanding this idea, or maybe these semantics
> are the ones I had in my mind all the time, and have not considered
> the alternate semantics that you might otherwise have assumed?

I think you and I agree on this point, but Tucker seems to be using strict
parallel block semantics for this. And that wouldn't work for this loop, since
at the first exit, all of the other tasklets would be aborted (by the parallel
block semantics). So it would be non-deterministic which exit was ultimately
executed (assuming there is more than one).

It's pretty clear to me that "exit" is not like an exception, and should be
treated differently. (Not as certain about gotos, since ones clear out of the
loop may or may not be like an exit.) A block doesn't have anything like exit
(although it could be emulated with a goto).

...
> Actually, if I were to write this, I was thinking that a PO would be
> used to update a global volatile/atomic variable, but otherwise the
> global variable could be read/tested directly.
>
> Something like;
>
>     parallel for I in 1 .. Limit loop
>        if DB(I).Name = Name_to_Find and then I < Result then
>           My_PO.Save_Found_Index (I); -- Save the lowest index found in Result.
>           exit; -- Done looking.
>        -- else continue
>        end if;
>     end loop;
>
> The extra check "and then I < Result" would potentially provide an
> improvement/short cut on performance, assuming it is cheaper to test
> an atomic/volatile integer, than to always have to access the PO.

A PO is supposed to be a monitor, and its rather abstraction breaking to put the
monitored object outside of it. (Yes, you sometimes have to do that, but I don't
recommend it if you don't have to do so.)

> On a further aside, I recognise that one would ordinarily consider
> atomic as the safe choice for such a variable declaration, but I found
> that declaring the global variable to be volatile rather than atomic
> provided significant performance improvements. I'm not sure if that is
> just an artefact of the combination of compilers and target platforms
> I was using, or Ada semantics somehow demands this, but if the type of
> the object is such that it can be read atomically without putting
> memory fences or guards around the access, eg. such as via a single
> machine instruction, then I found I'd get consistently correct
> results, with improved performance, over declaring the global as
> atomic. Ada doesn't distinguish, but I wonder if there are certain
> data types that when declared to be volatile, are also atomic. We
> already know that variable declared to be atomic are also volatile in
> Ada.
> Or maybe there are certain volatile types that have some useful
> properties of atomic, but are somehow otherwise lighter-weight than
> full atomic?

I think you just got lucky. My understanding is that fences are needed to ensure
memory consistency across cores. So you can only see problems if you have two
tasks running simultaneously on two different cores -- and the OS might try to
avoid this particular situation. I don't think there are any memory reads or
writes that work in that situation without a fence; besides, ALL atomic objects
use a single instruction on most implementations.

Volatile is definitely not enough for any data that needs to be synchronized
between multiple tasks (or tasklets); certainly not from a language perspective.
Only atomic or a PO synchronizes data (you can use one of those to synchronize
volatile data). And I don't think experimental results mean much here, because
some of these situations are pretty rare -- but Murphy would say that they'll
only strike at the worst time.

And Atomic isn't really enough, either, since real problems always require some
sort of update.

****************************************************************

From: Jeffrey Cousins
Sent: Thursday, July 20, 2017  11:44 AM

>I know in Vienna, we had taken a straw poll, which favored putting the
>keyword parallel before the word "for" in the for loop. While this is a
>useful data point, I think if we want to support "all parallel" loops, I
>find it makes a lot more sense to have all parallel together after the word
>"in" or "of".
> eg
>    for I in all parallel 1 .. 10 loop
>
> or
>
>    for Element of all parallel List loop


Much as I sympathise with the logic of Brad’s argument, I think we should stick
with what was agreed at Vienna for the time being, whilst we sort out what
behaviour we want.

Terminating loops early seems to be a particular sticking point.

Only yesterday we had a case where one of our apps has M tasks checksumming N
files, where N >> M.  Currently when the abort button is clicked, no new files
are checksummed, but checksumming the files already in progress is allowed to
complete.  (Non-pre-emptive).  The user would like the current checksumming to
be aborted immediately (pre-emptive; maybe he wants to checksum a different set
of files a.s.a.p.).

But I think it equally likely that there will be a case where finishing the
current processing cleanly is the imperative.

So I think we need to provide both options (somehow).

Brad also wrote:

>Note, we have currently have other cases in Ada where two very different
>semantics can be invoked with respect to preemptive and non-preemtive behaviour,
>and I think we should try to be consistent with those already existing syntax
>if possible.
>
>In particular, we have;
>   requeue
>and
>   requeue with abort
>
>I am now thinking that we could use a "with abort" on a parallel loop, to get
>the preemptive semantics, but otherwise the semantics would be non-preemptive.
>
>eg.
>
>parallel with abort
>  for I in 1 .. limit loop
>     Long_Calculation (I);
>  end loop;

No doubt someone will point out holes in the above, but I quite like the above
suggestion.

****************************************************************

From: Erhard Ploedereder
Sent: Thursday, July 20, 2017  4:31 PM

> If our primary goal is (as I hope it is) to make it as easy as
> possible to convert sequential code into parallel code, the semantics
> have to be a close as possible to the equivalent sequential loop.

No. Particularly, if the syntax distinguishes the "new" semantics (or else, why
introduce new language features?)


Canonical example:
> That means that .... the loop index has be meaningful at a normal exit, and so on.

>             parallel for I in 1 .. Limit loop
>                 if DB(I).Name = Name_to_Find then
>                    My_PO.Save_Found_Index (I); -- Save the lowest index found.
>                    exit; -- Done looking.
>                 -- else continue
>                 end if;
>              end loop;

NO, again, to the comment in the example. Consider an array with duplicates.
Canonical semantics gives you the lowest index indeed. If you want that, do not
write "parallel"!

If you do write "parallel":
A parallel loop surely should be allowed to give you some found index but
whether it is the lowest is completely non-deterministic (I hope).

(Execution-wise, I am open to both models of "last one wins" or "exit terminates
the parallel execution".)

****************************************************************

From: Randy Brukardt
Sent: Thursday, July 20, 2017  6:26 PM

> > If our primary goal is (as I hope it is) to make it as easy as
> > possible to convert sequential code into parallel code, the
> > semantics have to be a close as possible to the equivalent sequential loop.
>
> No. Particularly, if the syntax distinguishes the "new"
> semantics (or else, why introduce new language features?)

It's much easier to convert from one to the other if the semantics doesn't
change much, if at all. Indeed, I'm not sure there is any point to having a
parallel loop construct at all if this isn't true, because the existing Ada
facilities are just fine (as demonstrated by Brad's libraries) for people who
are already comfortable with writing parallel code. It's the majority of
programmers who don't understand concepts like race conditions that need help.

> Canonical example:
> > That means that .... the loop index has be meaningful at a normal
> > exit, and so on.
>
> >             parallel for I in 1 .. Limit loop
> >                 if DB(I).Name = Name_to_Find then
> >                    My_PO.Save_Found_Index (I); -- Save the lowest index found.
> >                    exit; -- Done looking.
> >                 -- else continue
> >                 end if;
> >              end loop;
>
> NO, again, to the comment in the example. Consider an array with
> duplicates. Canonical semantics gives you the lowest index indeed. If
> you want that, do not write "parallel"!

So you think it is OK to make it impossible to write a significant proportion of
real problems in parallel? You're essentially saying that this search has to run
4 (or 8 or whatever) times slower, or the programmer has to create their own
solution using Ada tasks. That's insane.

> If you do write "parallel":
> A parallel loop surely should be allowed to give you some found index
> but whether it is the lowest is completely non-deterministic (I hope).
>
> (Execution-wise, I am open to both models of "last one wins"
> or "exit terminates the parallel execution".)

I don't think that the cost of the model I'm suggesting is significantly
different than either of these. (I again note that I know just enough about such
things to be dangerous, so there might be some magic solution that I cannot
imagine.) For each of the models, you have between 2 and 4 possible outcomes
when a tasklet finishes a chunk (normal completion, exit, goto, exception; some
of these might be treated the same in some models, but surely the first has to
be treated differently in all models).

So the only difference between a model that is index-aware and one that is not
is that the indexes have to be known when the chunk finishes. They're obviously
known when the chunk starts, so in the worst case, the starting index has to be
saved with the other tasklet information. That's hardly a hardship, and the loop
termination code is essentially the same in any of these models.

Ergo, I don't see much reason to make the semantics non-deterministic.
Non-determinism is evil; sometimes you can't avoid it, but it should never be
employed if there is an option. And in this case, there is an option.

****************************************************************

From: Brad Moore
Sent: Friday, July 21, 2017 12:30 AM

> I suspect that Brad's non-premptive semantics make the most sense for
> the default semantics for parallel loops (probably not for parallel
> blocks; the problems are very different).

A parallel block is effectively just a parallel loop with a small number of
iterations that has been unrolled. One should be able to convert source code
from either construct to the other (at least for simpler loops with a small
number of iterations).

If anything, I see an even greater need for non-preemptive semantics for
parallel blocks.

A parallel loop is typically applying similar processing iteratively. A parallel
block can also be that, (for example calculating Fibonacci recursively), but it
can also be suitable for completely independent processing. In one branch you
might want to exit the processing of that branch early, but (I think more
typically) without affecting the processing in the other branches.

eg.

Note: I am not using the exact syntax that we arrived at in Vienna, only because
I cannot remember exactly what it was, until I see the minutes.

parallel

   exit when Do_A_Step1 = Failure;
   exit when Do_A_Step2 = Failure;
   exit when Do_A_Step3 = Failure;
   exit when Do_A_Step4 = Failure;
   Result := Do_A_Step5;

and
   Do_B;
end parallel

-- Activity B has completed here, regardless whether Activity finished
-- successfully.

If one wants the processing for activity A to kill the processing for activity
B, then I think the programmer ought to explicitly ask for those semantics. The
default should be the more deterministic semantics.

Again, the "with abort" clause seems like a consistent and natural way to do
this.

parallel with abort

   exit when Do_A_Step1 = Failure;
   exit when Do_A_Step2 = Failure;
   exit when Do_A_Step3 = Failure;
   exit when Do_A_Step4 = Failure;
   Result := Do_A_Step5;

and
   Do_B;
end parallel

-- A failure in activity A, may have caused activity B to terminate early.

****************************************************************

From: Brad Moore
Sent: Thursday, July 20, 2017  12:40 AM

> ? I know in Vienna, we had taken a straw poll, which favored putting
> the ? keyword parallel before the word "for" in the for loop. While
> this is a ? useful data point, I think if we want to support "all
> parallel" loops, I ? find it makes a lot more sense to have all
> parallel together after the word ? "in" or "of".
>>
>> eg
>>    for I in all parallel 1 .. 10 loop
>>
>> or
>>
>>    for Element of all parallel List loop
>
> Much as I sympathise with the logic of Brad’s argument, I think we
> should stick with what was agreed at Vienna for the time being, whilst
> we sort out what behaviour we want.
> Terminating loops early seems to be a particular sticking point.

I have actually changed my view here. I am now in favour of having the parallel
keyword for loops at the beginning, as we discussed in Vienna. The reason is
that I find that if we do want to specify controls on the parallelism, such as
"with abort", or "with all", or anything else we come up with, applying these to
the keyword parallel is the best way I've seen so far to do this, in my opinion.
If "parallel" is located in the middle of the loop syntax, I don't think this
would work (too ugly and messy), but when parallel is at the beginning, this
provides a "header" that describes the parallelism without interfering with the
loop logic.

****************************************************************

From: Randy Brukardt
Sent: Thursday, July 20, 2017  1:22 PM

> If anything, I see an even greater need for non-preemptive semantics
> for parallel blocks.
>
> A parallel loop is typically applying similar processing iteratively.
> A parallel block can also be that, (for example calculating Fibonacci
> recursively), but it can also be suitable for completely independent
> processing. In one branch you might want to exit the processing of
> that branch early, but (I think more typically) without affecting the
> processing in the other branches.

Well, we don't have "exit" for blocks (it would be very incompatible); I suppose
we could have it only for parallel blocks but that seems like asking for trouble
with the forces of consistency. One could use a goto for that effect, but that
seems less likely than  an exit from a loop.

So, in the block case I was thinking mainly about exceptions; and exceptions
usually (not always) represent bugs. I don't think it makes much sense to
continue buggy code -- indeed I thought that it made sense to use some sort of
"terminate as soon as practical" semantics for exceptions in all cases. It's
only exit (and similar gotos) that it makes sense to treat non-preemptively.

(And I was thinking that "non-preemptive" simply meant that the tasklet ran to
the next task dispatching point, not that it necessarily ran to completion. A
queued and blocked tasklet might never finish otherwise, especially in the
exception case.)

...
> Again, the "with abort" clause seems like a consistent and natural way
> to do this.
>
> parallel with abort
>

Yes, that syntax seems OK. Better than some your other recent ideas. ;-)

****************************************************************

From: Tucker Taft
Sent: Tuesday, July 25, 2017  4:36 PM

I am quite strongly against having multiple interpretations for "exit" from a
parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I
have some convincing to do.  The interpretation I recommend is that an “exit” or
“return” (or “raise”) within a parallel loop causes all other iterations to be
terminated as soon as possible.  For a “return,” there is a “race” to see which
iteration reaches a return first, and that is resolved  using some kind of
synchronization, and then only the “return” that “wins” is allowed to proceed to
evaluate its return expression (presuming it is returning from a function).
Similarly, if an iteration would propagate an exception, then again, a
synchronization is performed against other operations that can terminate the
loop, and the winner gets to determine the net effect, be it a simple exit, an
exception, a return, or (if we permit it) a goto.  Essentially any operation
that would normally terminate a loop goes through a “two-phase” commit, where
the first phase requests the “right” to decide the fate of the loop, and then
the winner determines what happens next.

I have not seen any convincing (to me!) argument to allow unstarted iterations
to continue processing, as this seems to defeat one of the whole points of a
parallel “divide-and-conquer” approach.  I can see some arguments to not
terminating iterations that are already executing, if there is danger of them
leaving some object visible outside the iteration in an abnormal state.  In
ParaSail, if a loop includes an operation that could result in an early exit, it
cannot manipulate a variable that is declared outside the loop unless it is
protected.  This doesn’t quite work in Ada since any parallel loop could have an
iteration die due to the failure of a run-time check.  But to some extent that
is already a problem because we recognize that the failure of a run-time check
can always leave certain variables in an abnormal state.  For a parallel loop, a
run-time check failure could leave any variable manipulated by any iteration in
an abnormal state.  But if you look at 11.6(6/3), even a “regular” loop that
dies due to a failed run-time check can have the same effect, since the Ada RM
has always permitted implicit parallelization (e.g. see NOTE 1 in chapter 9 —
aka RM 9(11)) and instruction re-ordering.  So we can effectively ignore failed
run-time checks, and only worry about explicit “raise” as relevant to this
discussion.  So what it effectively means is that the action of propagating a
“raise” out of an iteration is just like a “return” or an “exit” in terms of
having to “compete" to decide which terminating construct gets to determine the
“fate” of the parallel loop.

The bottom line to me is that we should terminate parallel loops as soon as
possible, be it from an exit, return, raise (or goto or requeue for that
matter), using a two-phase approach that first determines which one will
actually decide the fate of the loop as a whole, and then a second step that
actually does the deed.

****************************************************************

From: Tuillo Vardanega
Sent: Wednesday, July 26, 2017  7:43 AM

I think I understand the semantics as Tuck proposes and I find it to make sense
in the specific context of parallel computation. To me, wanting exactly the same
source code to operate identically within and without a parallel construct
requires writing it to that end, which means understanding the implications of
the most demanding side. In a sense, this is no different (only more complex)
than it was for concurrency: the code that you embed in concurrent threads must
be amenable to concurrent semantics else problems occur. I suppose that what
matters most here is that the programmer has the means (for syntax and
semantics) to ensure that the externally visible effect of a parallel construct
"stabilizes" before it can be used outside of it. To my reading, Tuck's
"two-phase commit" notion does that quite fine, including the
"as-soon-as-possible" part of it.

****************************************************************

From: Brad Moore
Sent: Thursday, July 27, 2017  12:26 PM

> I am quite strongly against having multiple interpretations for "exit" from a
> parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I
> have some convincing to do.

Unfortunately, I am not yet convinced. I think if we want to have only one
interpretation for Exit, then that interpretation should be non-preemptive. Lets
see if I can convince you.

First I have some findings to report first however that should be of interest.

I have added preemptive termination to most of my Paraffin libraries (except the
Ravenscar ones) so that if any exception occurs, it preemptively terminates
other tasklets.

I found that adding this, (using asynchronous transfer of control), did not add
any significant performance hit to any of the libraries with GNAT at least. This
applies even to problems such as adding the sum of a large range of integers,
which I had mentioned previously might be a case where preemptive abort might
introduce significant overhead.

So, for exceptions, I think it makes sense to have preemptive termination,
generally. However, I think exit, return, and goto should be handled
differently.

Here are some arguments

1. I think it is quite a common problem to be searching for the first or last
   occurrence of something in an array or container. For example, consider an
   array or container of events stored in chronological order.

I should think answering question such as;
   When is the first time that conditions X occurred?, or when was the last time
   that conditions Y occurred?

I find it quite objectionable if we have to say, sorry, but you cannot do such
searches efficiently in native Ada using parallelism. You should either use
libraries such as Brad Moore's Paraffin libraries, or call out to another
language such as C or C++ and write that code in OpenMP.

If Exit is preemptive, then we cannot continue searching in other tasklets to
find better or correct answers. We effectively have to not use Exit at all, and
search the full range of iterations, which can be a waste of time if we already
know the answer.

We can use exceptions as the mechanism for preemptive abort. If a programmer
wants preemptive termination, they could raise their own exception in the loop,
and write an exception handler at the appropriate level to handle that. We could
even define a language defined exception, such as
Parallelism_Termination_Exception, which could be defined as an exception that
is implicitly handled prior to the completion of the loop, and can be explicitly
raised by the user if preemptive termination is desired.

We still need a mechanism for non-preemptive abort to solve problems such as the
above however, and exit, return, and gotos could provide that mechanism.

2. Having Exit and Return treated effectively as an exception and causing
   preemptive is very different from sequential Ada, and likely to introduce a
   lot of unexpected errors to users programs. For instance, I found adding the
   preemptive termination to Paraffin introduced at least one bug where one task
   might be waiting on an entry while the aborted code that would have
   ultimately serviced that code would have left another task waiting on the
   entry indefinitely. Another example could be where a loop iteration creates
   dependent tasks, and aborting the loop iteration might shortcut interactions
   with those dependent tasks, leaving those tasks in a state where they cannot
   terminate, and where the aborted iteration is left in an abort-deferred
   operation, effectively deadlocking.

   I think when the user's code explicitly uses a select then abort statement,
   the user is fully aware of the asynchronous transfer of control, and can take
   care to ensure that the preemptive behaviour makes sense. I think it would be
   undesirable however to have the compiler implicitly introducing such
   asynchronous transfers of control, where the user is likely in many cases to
   not even be aware of this, which could lead to difficult to reproduce
   failures in deployed code.

3. I think we should also be supporting parallelism in Ravenscar compliant code.
   I have Paraffin libraries that do this, which shows that it can be done.
   However, in Ravenscar, I believe all termination needs to be non-preemptive,
   including exceptions, since Ravenscar has the restrictions:
   No_Abort_Statements, No_Dependence => Ada.Asynchronous_Task_Control,
   No_Select_Statements.

It seems pretty clear that preemptive aborting behaviour is not supported in a
simpler Ravenscar runtime. I think this is OK, as you still get the benefits of
divide and conquer with non-preemptive termination, just maybe not as responsive
for early exits, but that is traded off with better deterministic behaviour
which is generally desired in Ravenscar. The main point here is that if we have
Exit, goto, and return as non-preemptive, then we have consistent, single
semantic interpretations for these constructs for all of Ada. If we make these
preemptive, then we are actually forcing two interpretations if we want these to
work in Ravenscar also.

Another suggestion, if we really do want preemptive exit, return, or goto, then
perhaps we need a different flavour of syntax for those.

I had previously suggested that a "with abort" clause could be added to the
parallel keyword of the loop. Rather than do that, I think it would be better to
attach that clause to the actual Exit, return, or goto statement, as in;

parallel
    for I in 1 .. N loop
       if Arr (I) > X then
          exit with abort;
       end if;
    end loop;

It is clearer to tie the "with abort" closer to where the premptive termination is desired.


   The interpretation I recommend is that an “exit” or “return” (or
“raise”) within a parallel loop causes all other iterations to be
terminated as soon as possible.  For a “return,” there is a “race” to
see which iteration reaches a return first, and that is resolved  using
some kind of synchronization, and then only the “return” that “wins” is
allowed to proceed to evaluate its return expression (presuming it is
returning from a function).  Similarly, if an iteration would propagate
an exception, then again, a synchronization is performed against other
operations that can terminate the loop, and the winner gets to determine
the net effect, be it a simple exit, an exception, a return, or (if we
permit it) a goto.  Essentially any operation that would normally
terminate a loop goes through a “two-phase” commit, where the first
phase requests the “right” to decide the fate of the loop, and then the
winner determines what happens next.

A two phase commit is pretty much the only possibility that I see.
Paraffin has always had such a two phase commit for handling exceptions,
except until just recently, the exception handling was non-preemptive.
If multiple exceptions were raised in different tasklets, the
implementation decides which one wins, prior to re-raising that
exception and propagating it out into the user's code. The two-phase
commit idea is orthogonal to this discussion about how Exit statements
are treated I think.

> I have not seen any convincing (to me!) argument to allow unstarted iterations
> to continue processing, as this seems to defeat one of the whole points of a
> parallel "divide-and-conquer" approach.

You do get benefits of divide and conquer regardless whether termination
is preemptive or not. The preemptive termination can be viewed as an
additional optimization that can be added on top of the divide and
conquer optimization, but only if it does not introduce errors!


   I can see some arguments to not terminating iterations that are
already executing, if there is danger of them leaving some object
visible outside the iteration in an abnormal state.  In ParaSail, if a
loop includes an operation that could result in an early exit, it cannot
manipulate a variable that is declared outside the loop unless it is
protected.  This doesn’t quite work in Ada since any parallel loop could
have an iteration die due to the failure of a run-time check.  But to
some extent that is already a problem because we recognize that the
failure of a run-time check can always leave certain variables in an
abnormal state.  For a parallel loop, a run-time check failure could
leave any variable manipulated by any iteration in an abnormal state.
But if you look at 11.6(6/3), even a “regular” loop that dies due to a
failed run-time check can have the same effect, since the Ada RM has
always permitted implicit parallelization (e.g. see NOTE 1 in chapter 9
— aka RM 9(11)) and instruction re-ordering.  So we can effectively
ignore failed run-time checks, and only worry about explicit “raise” as
relevant to this discussion.  So what it effectively means is that the
action of propagating a “raise” out of an iteration is just like a
“return” or an “exit” in terms of having to “compete" to decide which
terminating construct gets to determine the “fate” of the parallel loop.

I dont think this last sentence follows from the previous, particularly
in the case of Exit. In Paraffin, an Exit statement is simply treated as
an Exit statement in the user's code. It causes that tasklet to exit,
without affecting any others, no special treatment than other sequential
statements in Ada. There is no need to decide which exit wins prior to
leaving the loop, as there no result or exception being propagated out
of the loop. The implementation requires no special treatment of this.
Exceptions do require special treatment, since a specific exception
should be propagated out of the loop.

For a return statement, there is a stronger case for preemptive
behaviour, but I think considering the other arguments, it should be
left non-preemptive to side with Exit statements, rather than side with
the preemptive nature of exceptions.

Goto's should probably not be too relevant to this discussion if we even
support them at all. I doubt this feature would be very rarely used or
useful in a parallel loop, but they are similar to return statements.

>
> The bottom line to me is that we should terminate parallel loops as soon as
> possible, be it from an exit, return, raise (or goto or requeue for that
> matter), using a two-phase approach that first determines which one will
> actually decide the fate of the loop as a whole, and then a second step that
> actually does the deed.

The bottom line to me is that the user should be able to write programs
that execute as fast as possible. Sometimes terminating loops as soon as
possible comes at a greater expense than being able to use a more
optimal algorithm. The user should be able to choose either option.

Below I have execution times showing a case where the non-preemptive
algorithm beats the preemptive one.

A two phase approach is recommended for sure, but Exit's do not need to
be involved in the second stage of the commit. They do not contribute to
any result associated with the loop. Return results do need to be
considered in the second phase, but I would think that any exception
raised should trump any return result.

Here are some more results comparing performances using preemptive vs
non-preemptive termination.

I have a simple program that scans an array of 600_000_000 integers
looking for the value in the highest index of the array that satisifies
some property. In this case, the element in Arr'Last is the one that
satisfies the query.

Using a Reverse parallel loop with non-premptive exit, the elapsed time
for the loop is 0.13 seconds.

The parallel loop looks like;

      Maximum_Index : Index := Index'First with Volatile;

      parallel
       for I in reverse Start .. Finish loop
          if I < Maximum_Index then
             exit;
          elsif Satisfies (Arr (I)) and then I > Maximum_Index then
             Maximum_Index := I;
          end if;
       end loop;

If I take out the reverse keyword, I still get the correct result, but
the elapsed time becomes 1.4 seconds, since with forward iteration, the
Maximum_Index has to iterate through the full chunk of the highest
chunk, since each iteration improves upon the previous value.

If we assume that Exit must be preemptive, then we cannot use Exit,
because it gives us an incorrect result, so I have to rewrite the loop
as the following if we assume that reverse parallel loops are still allowed;

     parallel
       for I in reverse Start .. Finish loop
          if Satisfies (Arr (I)) and then I > Maximum_Index then
             Maximum_Index := I;
          end if;
       end loop;

This produces an elapsed time of 1.3 seconds;

If we say that reverse parallel loops are not allowed, then I have to
remove the reverse keyword, and the elapsed time becomes 3.8 seconds.

So in this particular example, it appears that the decision to not
support non-preemptive Exit results in a program that runs 9.8 times
slower on my computer than the version that does allow non-premptive
Exit, for reverse parallel loops.  For the non-reverse loops, the
preemptive version runs 2.7 slower.

The preemptive forward loop runs 29.2 times slower than the
non-preemptive reverse parallel loop.

****************************************************************

From: Tucker Taft
Sent: Thursday, July 27, 2017  4:29 PM

>> I am quite strongly against having multiple interpretations for "exit" from a
>> parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I
>> have some convincing to do.
>
> Unfortunately, I am not yet convinced. I think if we want to have only
> one interpretation for Exit, then that interpretation should be
> non-preemptive. Lets see if I can convince you.
> ...
>
> Here are some arguments
>
> 1. I think it is quite a common problem to be searching for the first
> or last occurrence of something in an array or container. For example,
> consider an array or container of events stored in chronological order.
>
> I should think answering question such as;  When is the first time
> that conditions X occurred?, or when was the last time that conditions
> Y occurred?
>
> I find it quite objectionable if we have to say, sorry, but you cannot
> do such searches efficiently in native Ada using parallelism. You
> should either use libraries such as Brad Moore's Paraffin libraries,
> or call out to another language such as C or C++ and write that code
> in OpenMP.

There are many ways to search efficiently, including in parallel. If you want
the lowest-numbered element that satisfies a particular criteria, then that
should be explicit in the algorithm, rather than something that is hidden in an
assumption about how non-preemptive chunked parallel iteration works.  It seems
unwise to make requirements about the order in which things are done, e.g.
whether lower-numbered or higher-numbered iterations are processed first,
whether "exit" ends one chunk but not all chunks, whether exit stops new chunks
from starting, etc.  This could be seen as defeating the purpose of parallel
iteration, and forcing the semantics to be restricted in what seem somewhat
unnatural ways just for this particular case.

> If Exit is preemptive, then we cannot continue searching in other
> tasklets to find better or correct answers. We effectively have to not
> use Exit at all, and search the full range of iterations, which can be
> a waste of time if we already know the answer. ...

In my view, "exit" means the entire loop is done, and trying to imply that under
certain circumstances it means "well not really if there is a better answer yet
to be found" seems like it would require a compiler that can guess your
intentions.  I suspect what would be most efficient for such a problem is a more
explicit divide and conquer, not something where you are implicitly relying on
chunking to accomplish what you want.

For example, if you want the item with the lowest index which satisfies some
predicate, you would want all of the cores looking at the first few items in the
array, and then if nothing is found there, you would have them all start looking
at the next few items in the array.  It would be less efficient to break the
array into big chunks, and have half of the cores looking at the last half of
the array.

Here is an example algorithm that would devote all of the cores to search the
beginning of the array first.  This is just one example, but I suspect it would
be faster than one that made various assumptions about chunking and assumed that
"exit" didn't really mean “exit” when there were better answers waiting.  My
point is that having multiple meanings for "exit" adds a lot of complexity
without providing sufficient general benefit.

   Assume we have "N" cores, and an array A where we are trying to find the
   lowest-indexed element of A that satisfies the predicate P.

function Find_First(A : Elem_Arr; P : access function(E : Elem_Type) return Boolean)
   return Integer is
   N : constant Positive := Num_CPUs;
   Lowest_Index : array(1..N) of Integer := (others => A’Last+1)
   Found : Integer := A'Last+1 with Atomic;
       --  Atomic flag used to stop other threads early begin
   parallel for I in 0 .. N-1 loop
      -- find A(J) with lowest valued J satisfying P(A(J)) by looking
      -- at A(A'First+I), A(A'FIrst + I + N), A(A'First + I + 2*N),
      declare
         J : Integer := A'First + I;
      begin
         while J <= A'Last and then J < Found loop
             if P(A(J)) then
                 -- This is the lowest-numbered element
                 -- satisfying P with index of form A’First + I + K*N
                 Lowest_Index(I) := J;
		 Found := J;  --  Stop others as soon as possible.
                    -- Note that there is a benign race condition here, since
                    -- we are merely trying to get other threads to stop if further
                    -- searching would be a waste of time.  "Found" is not presumed
                    -- to be the overall lowest value.
                 exit;
             end if;
             J := J + N;
          end loop;
      end;
   end loop;

   -- Use proposed "map/reduce" construct to find overall minimum index
   return (for I in 1 .. N => Integer'Min(<A'Last+1>, Lowest_Index(I)); end;

****************************************************************

From: Brad Moore
Sent: Tuesday, August 1, 2017  2:08 AM

> There are many ways to search efficiently, including in parallel. If you want
> the lowest-numbered element that satisfies a particular criteria, then that
> should be explicit in the algorithm, rather than something that is hidden in
> an assumption about how non-preemptive chunked parallel iteration works.  It
> seems unwise to make requirements about the order in which things are done,
> e.g. whether lower-numbered or higher-numbered iterations are processed first,
> whether “exit” ends one chunk but not all chunks, whether exit stops new chunks
> from starting, etc.  This could be seen as defeating the purpose of parallel
> iteration, and forcing the semantics to be restricted in what seem somewhat
> unnatural ways just for this particular case.

I see your point, but I also think treating an exit like an exception is also
asking for trouble. Your arguments have moved me somewhat though. I now feel
that If we cannot have both preemptive and non-premptive exits, returns, and
gotos, then I think we shouldn't allow them to transfer control out of a
parallel loop at all. I think they introduce too much in the way of confusion,
unexpected behaviour, and semantic problems.

If one really wants to preemptively terminate a parallel loop, they can always
raise an exception and handle it at the appropriate level. Otherwise I think
gotos, returns, and exits should only be allowed to exit loops that do not have
the parallel keyword. This also simplifies the parallelism proposal, I should
think.

> For example, if you want the item with the lowest index which satisfies some
> predicate, you would want all of the cores looking at the first few items in
> the array, and then if nothing is found there, you would have them all start
> looking at the next few items in the array.  It would be less efficient to
> break the array into big chunks, and have half of the cores looking at the
> last half of the array.

That can be tricky to program though without introducing too much overhead of
having the cores needing to interact with each other to get the next search
region, and it depends on the probability of finding things in the bottom half
of the array. If what you are looking for is rare enough, that could raise the
chances that what you are looking for can be found in the upper half of the
array. But yes, it could be an alternate strategy that might work well enough in
certain situations.

> Here is an example algorithm that would devote all of the cores to search the
> beginning of the array first.  This is just one example, but I suspect it
> would be faster than one that made various assumptions about chunking and
> assumed that "exit" didn't really mean "exit" when there were better answers
> waiting.  My point is that having multiple meanings for "exit" adds a lot of
> complexity without providing sufficient general benefit.

This approach can be summarized as manual chunking, where the programmer writes
an outer parallel loop that executes a number of chunks in any order, and an
inner sequential loop that calculates the chunk boundaries, where you can then
use sequential semantics including reverse loops, with non-preemptive exits.

I think given the syntax we are proposing, manual chunking is likely the only
solution that that works for this problem, if we do not have non-preemptive
exits and want to write all the code in line. A point to mention is that with
manual chunking perhaps that can eliminate the need for reverse parallel loops,
since one can use the inner sequential loop to provide the reverse searching.

I think I have a better solution for this problem though, because there are a
number of criticisms I see here.

For starters, I dislike that the programmer is having to make decisions about
how many chunks are needed, and the values of the chunk boundaries.

The reasons are;

   1) Choosing the number of cores as the number of chunks is likely too course
      a chunk size. If one core finishes its chunk early, there are no
      opportunities to steal or seek work from the other cores. It is better to
      choose a smaller chunk size to provide better load balancing, but having
      the programmer choose this, is likely not going to be as good a decision
      that a compiler or library writer might choose.

   2) Manual chunking in this manner is a form of static parallelism, where the
      decisions on chunk boundaries is made before loop execution begins. One
      ideally shouldn't necessarily have to commit to static chunking decisions,
      as dynamic chunking algorithms are possible that might be beneficial in
      cases where the amount of time to evaluate a single iteration can vary
      wildly between iterations.

   3) Having to write the code to convert back and forth from chunk index to
      actual index is messy and error prone. It is more difficult to both read
      and write.


I think this particular problem is best solved using a library approach, in
combination with the anonymous loop body AI.

A library approach can provide the outer parallel loop that corresponds to the
outer loop of the manual chunking solution, without the programmer having to
deal with decisions about chunking, or converting from chunk numbers to real
index values. A library also can provide more sophisticated parallelism
strategies that might be too difficult for a programmer to write from scratch.

If we take a library approach to find the maximum index that satisfies some
condition, using anonymous loop bodies from AI we could write;

       declare
         package Search is new
           Parallel.Loops.Static_Reduce (Loop_Index  => Index,
                                         Result_Type => Index,
                                         Reducer     => Index'Max,
                                         Identity    => Index'First,
                                         Result      => Maximum_Index);
        begin
         for (Start, Finish, Maximum_Index) of Search.Parallel_Loop
            for I in reverse Start .. Finish loop
               if I < Maximum_Index then
                  exit;
               elsif Satisfies (Arr (I)) and then I > Maximum_Index then
                  Maximum_Index := I;
               end if;
            end loop;
         end loop;
       end;

Note, the Maximum_Index of the Parallel_Loop anonymous procedure is an in out
parameter. The value coming in, is the best value so far for that executor.

Otherwise to do this with explicit manual chunking, I would write;

         declare
            Number_Of_Chunks : constant Positive := 50;
            Iterations : constant Positive
              := Positive (Arr'Last - Arr'First + 1);
            Chunk_Size : constant Positive :=
              Iterations / Number_Of_Chunks;
            Results : array (1 .. Number_Of_Chunks) of Index :=
               (others => Index'First);
         begin
            parallel
              for I in 1 .. Number_Of_Chunks loop
                 for J in reverse Arr'First +
                    (Index (I) - 1) * Index (Chunk_Size)
                 .. (if I = Number_Of_Chunks then Arr'Last
                     else Arr'First + Index (I * Chunk_Size) - 1) loop

                    if Satisfies (Arr (J)) then
                       Results (I) := I;
                       exit;
                    end if;
                 end loop;
              end loop;

           Maximum_Index := (for Element of Results => <0> + Element);

         end;

I can tell you that the amount of time it took me to write the second version
involved quite a bit more time and mental power than the previous version. I
think the previous one is also a lot more readable, because it provides a higher
level of abstraction. The code only focuses on the actual algorithm, whereas the
second version mixes the algorithm with a simplistic chunking algorithm.

Note: The exit in the last example is non-preemptive, since the inner loop is
not a parallel loop. If you look at that exit statement, one might worry for a
moment whether that is a preemptive exit or a non-preemptive exit. If we
disallow exits to transfer control out of a parallel loop, then suddenly that
exit feels "safer" to me, since all exits would not have preemptive abortive
behavior to worry about.

Otherwise, I can imagine one might want to scan all the source code in a system
looking for preemptive exits when doing code reviews, and feeling the need to
provide extra analysis to show that any found are safe.

****************************************************************

From: Tucker Taft
Sent: Tuesday, August 1, 2017  12:45 PM

I think you have confused the issue of "exit" meaning "exit" with the issue of
whether "abort" semantics or "as soon as possible" are used to terminate the
other iterations.

lI can understand the concern about using "abort" semantics, but after an exit
by one iteration, when any other iteration hits a “natural” stopping point, or
hasn't started at all, it should terminate.

So let's separate these two issues, if possible.  What I mostly object to is the
notion of giving meaning to multiple exits from the same parallel loop.  I don’t
mind waiting a bit for iterations that are already running, for the loop to
actually terminate.  But trying to define what it means to have multiple exits
is a mistake in my view.  I believe the two-phase approach is implementable, and
I believe should be used for any kind of "early" termination of a parallel loop,
be it exception propagation, exit, return, goto, or requeue.  That is, if an
iteration wants to terminate the loop, it "requests" a termination, and then if
it “wins” the race, it determines the fate.  How quickly the loop actually
finishes may vary, but it should be as soon as possible, and certainly no
additional iterations should be started.

So no need to bring up the "boogie" man of "abort."  I could imagine some aspect
that might affect how quickly termination happens, but it shouldn’t affect the
basic semantics of the loop, which are “terminate as soon as possible.”

****************************************************************

From: Brad Moore
Sent: Wednesday, August 2, 2017  9:03 AM

> So let's separate these two issues, if possible.

OK, I can buy this. We really now have 3 potential ways to exit a loop though.
What I would call hard preemption, soft preemption, and non-preemption, where
hard preemption is the abortive exit likely involving asynchronous transfer of
control, soft preemption which is a polled exit that only occurs on iteration
boundaries (grain or chunk boundaries most likely), and non-preemption which
wouldn't be directly supported in syntax, but if one does manual chunking either
inline, or via a library call, an exit is exiting a sequential inner loop, which
then is non-preemptive. The searching example of the previous email represents a
loop that must involve manual chunking that results in a non-preemptive exit.

I think hard preemption should probably be reserved for exceptions, which would
allow breaking out of a loop like the following on a quad-core machine, which
with only soft preemption would likely not result in any savings in execution
time;

     parallel
        for I in 1 .. 4 loop
            if Flip_Coin then
               raise Get_Me_Out_Of_Here;
            else
               Do_12_Hour_Calculation;
            end if;
        end loop;

All parallel loops involving chunking are actually nested loops at the
implementation level where the outer loop is the parallel loop that is provided
by the implementation, and the inner sequential loop is a callout to user code.
So in this sense, an exit from the inner loop should be treated as a
non-preemptive exit. But we want to create the illusion that there is only a
single loop at the syntax level, so an exit would involve soft preemption to
more closely mirror sequential behaviour that when an exit statement is
executed, there are no further iterations. Parallel execution is not exactly
like that, as execution likely continues after an exit, and further iterations
can be started, particularly if a tasklet involves multiple iterations and soft
preemption polling occurs at tasklet boundaries. But we want to at least try to
approximate the sequential behaviour as best we safely can.

I actually have these three exit strategies implemented in Paraffin now. I had
forgotten that I had implemented soft preemption some time ago, where a special
exception can be raised from a user callback and is handled by the library which
sets a flag that is polled at grain/tasklet boundaries to effect early
termination.

Any other exception raised in a Paraffin library call results in a hard
preemptive abort involving asynchronous transfer of control, which is a feature
I just added.

Any exit statement in a user provided callback is treated as a regular exit
statement in Ada, which is just a non-preemptive exit, which would also work
well for manual chunking examples involving the anonymous loop body procedure
AI.

So I am happy now with this treatment of exit, return, and gotos if we say it
has the soft-preemption model.

I do think we probably want to distinguish between exceptions and the other
forms of transfer of control out of the loop (exit, gotos, returns, requeues) to
consider treating exceptions as the hard-abortive case, particularly for
exceptions such as Constraint_Error at least, but probably all exceptions for
consistency sake.

****************************************************************

From: Jeff Cousins
Sent: Thursday, August 3, 2017  11:16 AM

It's good to see this progressing even if it's taking a lot of to and fro.
Tucker's ideas sound desirable, but are the others from the implementors' side
happy that the "some kind of synchronization" won't be an undue implementation
burden?

****************************************************************

From: Randy Brukardt
Sent: Thursday, August 3, 2017  8:54 PM

> It's good to see this progressing even if it's taking a lot of to and fro.
> Tucker's ideas sound desirable, but are the others from the
> implementors' side happy that the "some kind of synchronization" won't
> be an undue implementation burden?

To answer your question directly, I don't think there is any alternative. What
Tucker describes is a bit more complex than the model I had in mind, but they're
pretty similar. That is, there is a loop "manager" (for the lack of a better
term) that determines the chunks, creates (or more likely, requests existing)
tasklets for execution, gives the tasklets chunks, and then waits for the
tasklets to finish their chunk (or complete some other way, such as exit or
propagate an exception). That waiting implies some sort of synchronization, but
it seems to me to be the cheapest kind available. The manager would generally be
asleep waiting for tasklets, and would only wake up when one completes - it
would then either assign another chunk (if there aren't any, just sleeping
waiting for the others), or if the loop needs to complete (exit or exception),
wait for the other tasklets to reach a termination point (or abort them - in
Janus/Ada, these are effectively the same but probably not in most
implementations).

I definitely would allow iteration termination at any task dispatching point
(especially for tasklets that are blocked), lest a blocked tasklet prevent loop
termination indefinitely.

With this sort of structure, I don't really see any issue with having exit work
deterministically. The only requirement are to ensure that any iterations on the
left of the iteration of the exit (that is, with lower indexes) run to
completion. There certainly isn't any requirement to start new tasklets after an
exit; you just have to assign chunks to tasklets in a left-to-right order (and
it would be a lot of extra work to do anything else). Thus, if an exit occurs,
any chunk not ever assigned to a tasklet can be immediately discarded (it
necessarily comes after the exit). Only tasklets logically before the exit would
need to continue.

Exceptions are different, of course, they terminate everything as soon as
possible as the loop is not going to provide an answer. Those are rather
different use-cases.

----

I want to step back a minute and think a bit about the uses for these constructs
and our actual goals for them. I'm getting concerned that there can be no real
uses of the parallel loop construct.

We've already eliminated reduction loops from using the parallel loop (they have
their own construct now). That's a substantial number of the loops that I've
written that would be sensibly parallel.

Almost all of the other loops that I've written or maintained are searches of
some sort. Depending upon the rules adopted, it's quite likely that many of
those could not use the parallel loop construct. First of all, most of the
search loops I write are intentionally not going to execute long, and many are
walking through linked lists; probably the parallel overhead would be far more
than any possible savings from parallel execution. (The cost to walk the list to
find chunks would be more than the comparison costs.) More of the loops wouldn't
work if a non-deterministic solution for exit is adopted; you wouldn't be able
to find the *first* match or the *next* match as opposed to some random match.
Coding this using a non-deterministic loop is very expensive - you want to avoid
any further iterations above any match but continue to run the ones below. The
only way I could imagine doing it would require pre-checking on every iteration
if something is found below: but that would require synchronization, so a lock
or fence would be needed -- most likely costing more than the iteration itself.
It probably wouldn't be worth doing, meaning that one would have to search the
entire space regardless of where or when any matches are found. And, depending
on the likelihood of matches in the first third of the data, probably means that
a parallel solution would perform worse than a sequential one.

There must be some searches that would work with non-deterministic exit, but I
can't think of any at the moment.

So, if a loop is neither a reduction or a search (that is, no early exits), what
else could it do? Something like a matrix multiply comes to mind, but that is
very likely to violate the static safety check on the loop contents. (The
compiler probably can't prove that the array elements being written are
independent for each iteration, and certainly any fully static check based on
Global alone cannot do that - you need information about how the array is
indexed.) So, what can such a loop do? Is there enough to justify the extensive
work on this feature (as opposed to parallel blocks and parallel reductions)??

----

Going back to first principles. The most important thing about these constructs
is that they tell the compiler that the user doesn't care about the details of
which iteration runs first, or last, or in between. Because of exceptions (and
finalization), an Ada compiler cannot parallelize any code itself (unless of
course it can prove the absence of such things, but that's hard and unusual).
Most of the time, the programmer doesn't really care about the state of the loop
when an exception is raised; the loop has a bug and we're not going to use the
results of the loop at all. OTOH, a "normal" exit (whether via an exit
statement, return statement, or a goto) usually just means that we have the
needed results and don't need to continue.

We're also looking to make checks to ensure that these constructs are "safe",
that is free of race conditions and conflicts. The vast majority of programmers
are not going to be able to successfully avoid these things (even if they know
to look for them) -- I know I can't. Ideally, I want to have a programming
construct that gives the benefit of parallel execution without any of the pain.

For all of these reasons, I much prefer a model where exceptions and exits are
treated differently, and exits give the same answer as a sequential loop. In
such a model, almost all searching loops can be written in parallel. In the
"first exit wins" scenario, only a few searching loops could be written in
parallel. Since most other loops are reductions or would be prohibited by the
parallel safety checks, that model seems necessary for parallel loops to be
useful at all.

****************************************************************

From: Tucker Taft
Sent: Monday, August 14, 2017  8:46 AM

> ...
> With this sort of structure, I don't really see any issue with having
> exit work deterministically. The only requirement are to ensure that
> any iterations on the left of the iteration of the exit (that is, with lower
> indexes) run to completion. There certainly isn't any requirement to
> start new tasklets after an exit; you just have to assign chunks to
> tasklets in a left-to-right order (and it would be a lot of extra work
> to do anything else). Thus, if an exit occurs, any chunk not ever
> assigned to a tasklet can be immediately discarded (it necessarily
> comes after the exit). Only tasklets logically before the exit would
> need to continue. ...

I am not convinced we should imply there is any ordering of iterations in a
parallel loop. This just adds complexity to the implementation with little
benefit in my mind.  If you have a “many-core” machine, then it is quite
possible that all of the iterations (or at least all of the chunks) get going
simultaneously, so what value is there to promising ordering in some particular
cases?  As I suggested in an earlier e-mail, if you really want the
lowest-indexed element that satisfies your predicate, then you wouldn’t want to
use “normal” chunking, but rather you would want to bias things to have multiple
cores looking at the early part of the range first.  In any case, are there
really that many cases where you want the lowest-indexed item?  More likely in
my view is that you would either want any item, or you would want all items that
satisfy the predicate.  If you want all items, then of course “exit” would not
be used, and you would build up a list of all desired items.  If y want any
item, then the first you find should “win”.

****************************************************************

From: Randy Brukardt
Sent: Monday, August 14, 2017  9:23 PM

At the risk of continuing to beat this same horse...

> ... If you
> have a "many-core" machine, then it is quite possible that all of the
> iterations (or at least all of the chunks) get going simultaneously,
> so what value is there to promising ordering in some particular cases?

The benefit is obvious: loop algorithms work as they are written. No mental
gymnastics needed.

> ...  As I suggested in an
> earlier e-mail, if you really want the lowest-indexed element that
> satisfies your predicate, then you wouldn't want to use "normal"
> chunking, but rather you would want to bias things to have multiple
> cores looking at the early part of the range first.

That was a terrible idea:
(1) It inverts the looping algorithm, obscuring the actual operative code;
(2) It forces the use of complex code to get parallel operation, defeating
    the purpose of making getting better performance easy;
(3) It assumes a particular chunking mechanism, which makes the code much less
    portable (at least if performance matters).

To elaborate on (3): A good compiler will tailor the chunking to the iteration
being executed (specifically whether it is balanced so the iterations probably
take the same amount of time, or if they are obviously unbalanced, and whether
or not there are any explicit exits), the hardware (the number of cores and how
they are mapped to Ada tasks), and possibly do that at runtime (especially as
the tasks that are running aren't known until the moment the loop is launched).
An algorithm using what Brad called "manual chunking" is going to end up
fighting that compiler-generated chunking, and that is going to scrub off
performance. You'd be better off using a library like Parafin to implement such
a loop -- but in that case, there is no longer enough value to the syntactic
parallel loop to bother with defining it.

But an algorithm like the one you propose makes a lot of sense as the
implementation generated by the *compiler* when the code contains explicit
exits. It can do that with all of the knowledge needed and adding no portability
issues; there's surely no requirement that a chunk run consecutive iterations!

> ... In any case, are there really that many cases where you want the
> lowest-indexed item?

Yes, this is pretty much the only thing I search for. Usually, I need to be able
to restart the search if the found item eventually turns out to not work for
some complex reason. That would be difficult if some random result was returned.

> More likely in my view is
> that you would either want any item, or you would want all items that
> satisfy the predicate.  If you want all items, then of course "exit"
> would not be used, and you would build up a list of all desired items.

Which you can't do with a parallel loop; you have to use a reduction expression
for that.

> If you want any item, then the first you find should "win".

The only case where it makes sense to me to want "any item" is when you are
looking for the absence of something (in such a case there shouldn't be any
match). That's pretty rare in my view.

I still think that "exit" should produce a deterministic answer; there's no
problem if you don't need a deterministic answer -- don't use "exit" in that
case. The goal here is to make it as easy as possible to write a parallel loop.
If you have to know lots of unusual patterns in order to get anything to work,
then only experts will be able to write parallel code. But that's already true;
it's not the least bit hard for an Ada real-time expert to write parallel
algorithms (as Brad has proved). The whole point of these features is to bring
this programming capabilities to the rest of us. If we fail to do that, then
these features become essentially useless (and a massive amount of work for
that).

****************************************************************

From: Brad Moore
Sent: Tuesday, August 15, 2017  10:12 AM

> I want to step back a minute and think a bit about the uses for these
> constructs and our actual goals for them. I'm getting concerned that
> there can be no real uses of the parallel loop construct.

I have been thinking about this also, and I am thinking we probably should
consider dropping parallel loop syntax from the proposal.

The discussion about find first, find last and preemptive exits has led to show
that the anonymous loop body AI coupled with a parallelism library call is a
better solution to these problems than the parallel loop syntax solutions we
have been discussing.

Once you come to that realization, it opens a bit of a slippery slope which
suggests that once you need parallelism libraries to solve certain problems,
probably they are also the best way to solve other manual chunking problems as
well.

This includes what I was calling the "parallel all" loops where an outer
sequential loop encloses an inner nested loop that is desired to be executed in
parallel. The best performance for these sorts of problems, which I have
encountered several, is to use barriers with manual chunking to push the
parallelism out to an outer loop, that encloses the sequential loop, and use
barriers to transition between parts of the algorithm that need to run
sequential and other parts that need to run in parallel.

It felt admittedly a bit like trying to force a square peg into a round hole to
graft the "parallel all" syntax onto the parallel loop syntax.

If you take out all the manual chunking problems and reductions problems out of
the picture, whats left?

Not much, as I would say easily more than half of the parallel loop problems I
have encountered that are not reductions are of the manual chunking variety.

I have only two representative problems left that would fit under parallel loop
syntax.

  1. Find any
  2. Combining arrays/containers

The Find any problems are the ones that Tucker was talking about where a
preemptive exit would be needed to break out of the loop.

I think these sorts problems are actually better served by quantified
expressions, which as you may recall are actually a form of reduction
expression.

A quantified expression is a boolean expression that a compiler should break out
of evaluating once it has found a case that determines the value of the
expression. This also applies for parallel quantified expressions. In this case,
the programmer doesn't even need to write the exit statement. It is expected
that the compiler will implicitly do this.

Consider determining if a number is a prime number.

Is_Prime : constant Boolean :=
   (parallel for all I in 1 .. Integer(Sqrt(N)) => N mod I != 0);

Whether the parallel keyword is present or not, a compiler should stop further
evaluation once if finds a factor of N.

Sometimes you want a non boolean value as the result of what you are looking
for. A quantified expressions works here also, except a side affect of the
function is needed to store the result. As an output, you still need a boolean
result also, to indicate if you found what you were looking for.

Consider finding any prime value within a range of integers.

Prime_Number : Integer := 0 with Atomic;

function Test_And_Store_Prime (N : Integer) return Boolean is begin
  if Is_Prime (N) then
     Prime_Number := N;
     return True;
  else
     return False;
  end if;
end Test_And_Store_Prime;

Prime_Found : constant Boolean :=
    (parallel for all I in Start .. Finish => Test_And_Store_Prime (I));

if Prime_Found then
   Put_Line (Integer'Image (Prime_Number) & " is a prime number"); end if;

There will be some naysayers who dislike seeing a side-effect in a quantified
expression, but I think for this particular problem it feels good enough to use
here.

That leaves combining arrays/containers, however those can be written as a
reduction expression also.

Array_Result : array (ArrA'First .. ArrA'Last) of Integer :=
   (for I in ArrA'First .. ArrA'Last => <""> & (ArrA(I) + ArrB(I)));
Where concatenation is he reducing operator.

Perhaps this looks inefficient, though whose is to say how the compiler can
optimize this. It is otherwise a simple enough expression to read or write, I
think.

A more visually appealing way to do this would be to allow the keyword parallel
with the new array aggregate syntax, which eliminates the concatenation
semantics.

Array_Result : array (ArrA'First .. ArrA'Last) of Integer :=
    (parallel for I in ArrA'First .. ArrA'Last => (ArrA(I) + ArrB(I)));

See RM 4.3.3 5.1/5 for the array initialization syntax.

I have not been able to yet think of any other cases where parallel loop syntax
would be needed, so it seems to me that we should not introduce parallel loop
syntax given that there are no use cases for parallelism problems that cant be
better solved with reduction expressions or anonymous loop body syntax combined
with parallelism library calls.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, August 16, 2017  12:22 AM

...
> The discussion about find first, find last and preemptive exits has
> led to show that the anonymous loop body AI coupled with a parallelism
> library call is a better solution to these problems than the parallel
> loop syntax solutions we have been discussing.

There's a couple of problems using the "anonymous loop body" AI in this way:
(1) Without some additional mechanism, we'd lose the safety checks of the
    parallel loop. They don't make any sense in the context of the "anonymous
    loop body" proposal per-se. So I think we'd still need a "parallel" keyword
    or something like that to trigger the safety checks. (Without them, parallel
    operations are just a giant race condition; no human programmer could avoid
    them.)
(2) There's no certainty that the "anonymous loop body" proposal even makes it
    into the Standard. Several people have expressed strong reservations about
    that proposal.

As an aside, I'd like to note that the "exit" problem with the "anonymous loop
body" proposal is in fact very similar to the problems that arise with the
parallel loop proposal. In both case, implementing exit/return/goto is very
painful, because you have to get control, somehow save whatever is supposed to
happen after the loop, do some actions, and then proceed to doing the after the
loop actions. The only place where the "anonymous loop body" is a bit worse is
that a user-defined routine can see some of what happens there; the parallel
loop is all-compiler generated.

...
> If you take out all the manual chunking problems and reductions
> problems out of the picture, whats left?
>
> Not much, as I would say easily more than half of the parallel loop
> problems I have encountered that are not reductions are of the manual
> chunking variety.

Exactly my point.

> I have only two representative problems left that would fit under
> parallel loop syntax.
>
>   1. Find any
>   2. Combining arrays/containers
>
> The Find any problems are the ones that Tucker was talking about where
> a preemptive exit would be needed to break out of the loop.

I think "Find any" per-se is rather unlikely, although they make sense in the
case of an "ensure none" predicate.

The array combination cases are most likely going to trigger the safety checks
(unless we build in some special holes, and even then only a few special forms
would work). The compiler is probably not going to be able to tell that
different iterations write different components unless the array index is of a
special, recognized form. (It definitely will not work with the raw static
Global check, since the entire array is considered a single object, which
necessarily overlaps with itself.)

> I think these sorts problems are actually better served by quantified
> expressions, which as you may recall are actually a form of reduction
> expression.

Makes sense. Certainly "ensure none" is the sort of case that quantified
expressions were designed for.

...
> That leaves combining arrays/containers, however those can be written
> as a reduction expression also.
>
> Array_Result : array (ArrA'First .. ArrA'Last) of Integer := (for I in
> ArrA'First .. ArrA'Last => <""> & (ArrA(I) + ArrB(I))); Where
> concatenation is the reducing operator.

They have to be written this way (or some other reduction expression), as
otherwise the array objects would be overlapping (making a parallel loop
illegal).

> Perhaps this looks inefficient, though whose is to say how the
> compiler can optimize this. It is otherwise a simple enough expression
> to read or write, I think.
>
> A more visually appealing way to do this would be to allow the keyword
> parallel with the new array aggregate syntax, which eliminates the
> concatenation semantics.
>
> Array_Result : array (ArrA'First .. ArrA'Last) of Integer := (parallel
> for I in ArrA'First .. ArrA'Last => (ArrA(I) + ArrB(I)));
>
> See RM 4.3.3 5.1/5 for the array initialization syntax.

Right, and this probably is a better way to write this expression rather than
writing a loop in the first place.

> I have not been able to yet think of any other cases where parallel
> loop syntax would be needed, so it seems to me that we should not
> introduce parallel loop syntax given that there are no use cases for
> parallelism problems that cant be better solved with reduction
> expressions or anonymous loop body syntax combined with parallelism
> library calls.

Agreed, so long as we find a way to make the safety checks that we have in
parallel loops.

****************************************************************

From: Tucker Taft
Sent: Thursday, August 17, 2017  7:43 PM

> ...
>
>> I have only two representative problems left that would fit under
>> parallel loop syntax.
>>
>>  1. Find any
>>  2. Combining arrays/containers
>>
>> The Find any problems are the ones that Tucker was talking about
>> where a preemptive exit would be needed to break out of the loop.
>
> I think "Find any" per-se is rather unlikely, although they make sense
> in the case of an "ensure none" predicate.

I don’t agree, based on my experience with ParaSail.  YMMV.

> The array combination cases are most likely going to trigger the
> safety checks (unless we build in some special holes, and even then
> only a few special forms would work). The compiler is probably not
> going to be able to tell that different iterations write different
> components unless the array index is of a special, recognized form.
> (It definitely will not work with the raw static Global check, since
> the entire array is considered a single object, which necessarily overlaps
> with itself.) ...

This is very well traveled area.  Compilers have been recognizing the special
case of independent loop iterations for decades.  Given:

   parallel for I in A’Range loop
       C(I) := A(I) + B(I);
   end loop;

Many, many compilers are smart enough to understand that this is safe.  I agree
the loop-body procedure construct is very nice, but you still end up wanting to
do the safety checks, and something as simple as the above, whether the user is
driving the “chunking” or the compiler does it by itself, has exactly the same
safety-check requirements.

We have been talking with existing Ada customers recently about planned Ada 202X
enhancements, and this is the one that gets them drooling every time.  They
*really* want parallel loops.  Of course if they are too restrictive, they won’t
solve the problem, but I believe with the proposed reduction expression, the
above simple parallel loop, and the nice loop-body procedure, we will give them
what they need to be quite happy.  I really believe we need all three.

****************************************************************

From: Randy Brukardt
Sent: Thursday, August 17, 2017  9:32 PM

...
> > The array combination cases are most likely going to trigger the
> > safety checks (unless we build in some special holes, and even then
> > only a few special forms would work). The compiler is probably not
> > going to be able to tell that different iterations write different
> > components unless the array index is of a special, recognized form.
> > (It definitely will not work with the raw static Global
> check, since
> > the entire array is considered a single object, which
> necessarily overlaps with itself.) ...
>
> This is very well traveled area.  Compilers have been recognizing the
> special case of independent loop iterations for decades.  Given:
>
>    parallel for I in A'Range loop
>        C(I) := A(I) + B(I);
>    end loop;
>
> Many, many compilers are smart enough to understand that this is safe.

Surely. But that's not the point. The point is that we have to have Ada Legality
Rules which makes this sort of thing legal and portable. The proposed rules
using the Global aspect certainly would not allow the above. Unless you have
been secretly planning some much more complex rules to allow such things, loops
like the above couldn't be legal Ada.

Note that I'm not against allowing loops like the above, but that has to be
portably defined. (The above case: a single dimensioned target that doesn't
overlap anything else, and is indexed directly by the loop parameter, is clearly
safe and could be defined that way, but hardly anything else could be. Is this
case alone enough for the extra complication???)

...
> I agree the loop-body procedure construct is very nice, but you still
> end up wanting to do the safety checks, and something as simple as the
> above, whether the user is driving the "chunking" or the compiler does
> it by itself, has exactly the same safety-check requirements.

Not exactly, as noted above. But I definitely agree that whatever construct is
used needs safety checks.

> We have been talking with existing Ada customers recently about
> planned Ada 202X enhancements, and this is the one that gets them
> drooling every time.  They *really* want parallel loops.  Of course if
> they are too restrictive, they won't solve the problem, but I believe
> with the proposed reduction expression, the above simple parallel
> loop, and the nice loop-body procedure, we will give them what they
> need to be quite happy.  I really believe we need all three.

Probably because no one will be able to grok reduction expressions. :-) It took
me a three hour ARG presentation to get any idea...

People naturally understand parallel loops. At least they think they do, until
they find out that exit doesn't work sensibly. :-)

****************************************************************

From: Brad Moore
Sent: Friday, August 18, 2017  12:01 AM

I have some more thoughts towards answering the question on whether we should
eliminate parallel loop syntax from our proposal.

Firstly, I had said in the previous email that find-first and find-last problems
should be written using parallel libraries and the anonymous loop proposal.

I think these can also be written well enough using reduction expressions.

Consider finding the highest prime number in a range of integers; One could
express this using the following reduction expression.

Highest_Prime_Found : constant Integer :=
   (parallel for I in reverse Start .. Finish =>
      Integer'Max(<Integer'First>, (Is_Prime(I) then I else Integer'First));

A simpler compiler might only apply divide and conquer here, which means each
chunk fully executes. This should still cut the execution time down from the
sequential version by a factor of the number of cores.

A smarter compiler might recognize that when the reduction is 'Max or 'Min, and
the expression value result is of the loop iterator in such a pattern, that once
a higher index is found to be prime, there is no longer a need to keep searching
the lower indices. Indeed a smarter compiler might recognize this as a special
case and insert a completely custom algorithm optimized for find first/find last
problems.

So there appears to be no need to rely on library packages nor parallel loop
syntax for these problems.

Now consider another one of the manual chunking problems where I was suggesting
using barriers would be the best approach when the outer loop must be a
sequential loop that happens to enclose a parallel loop.

Consider a Newton package that simulates the movement of heavenly bodies in a
solar system. A procedure Update_Position can be called that considers the mass,
gravity, direction, and movement of neighbouring bodies to calculate the
position of a body for the next time increment. An Advance function might be
needed to swap internal buffers for the next and current values for the next
time increment.

One could express this using a parallel loop inside a sequential loop that
advances time.

for Current_Time in Now .. Time + 100_000_000 loop

   for parallel all Heavenly_Body in Solar_System'Range loop
      Newton.Update_Position (Heavenly_Body);
   end loop;

   Newton.Advance;
end loop;

This doesn't work well with reduction expressions if the data arrays/containers
are hidden behind an abstraction in a package body.

However, if one were to modify the Update_Position subprogram to be a function
that returns a boolean status to indicate the success of the calculation, then
one can use a reduction expression here. The Newton.Advance function might raise
an exception if the boolean input parameter is false, indicating some failure
occurred, such as two planets colliding, that isn't handled by the simulation.

for Current_Time in Now .. Time + 100_000_000 loop

   Newton.Advance (for parallel all Heavenly_Body in Solar_System'Range
                     => Newton.Update_Position (Heavenly_Body)); end loop;

So, if we did not have parallel loop syntax, it might influence people more to
write functions for parallelism problems, rather than procedures, but I wouldn't
say thats a bad thing necessarily.

A smarter compiler might recognize the pattern of a reduction expression inside
a parallel loop and treat that as a special case using barriers to get optimal
performance, or it may be that the default implementation of the compiler just
works well here. If the programmer finds the performance is poor, he could
always write the problem himself using barriers and use the anonymous loop
procedure with a parallelism library.

The point again though is that I don't think we need to rely on the availability
of parallelism libraries for this problem, nor do we need parallel loop syntax,
unless we feel strongly that one should be able to use a procedure instead of a
function for the Update_Position call.

The one other manual chunking problem I have that is quite different from the
others is to generate a cumulative sum of an array where the result array has
the sum of the corresponding element from the input array plus all values
preceding that value in the input array. This solution involves 3 loops. The
first loop is a parallel loop that gets the sum of each chunk and stores that
value in a partial sum intermediate array.

The second loop is a sequential loop that performs a cumulative sum operation on
the partial sum intermediate array.

The third loop is a parallel loop that takes the corresponding partial result
values as a starting point for each chunk, and then applies the cumulative sum
to each chunk.

Using manual chunking I can write the first loop using a nested reduction
expression where the outer parallel reduction expression concatenates sum values
calculated from the inner reduction expression for each chunk.

However, the third loop seems to pretty much need to be a parallel loop.

declare
    Chunk_Size : constant Natural := Natural'Min(Arr'Length, 10_000);
    Chunks     : constant Natural := Arr'Length / Chunk_Size;

    function Chunk_Start (Chunk : Natural) return Index_Type is
      (Arr'First + (Chunk - 1) * Chunk_Size);

    function Chunk_Finish (Chunk : Natural) return Index_Type is
      (Arr'First + (if Chunk = Chunks then Arr'Last else Chunk * Chunk_Size - 1));

    -- Create the Partial sums in parallel
    Partial    : array (1 .. Chunks) of Element_Type :=
       (parallel for I in 1 .. Chunks => <""> &
           (for J in Chunk_Start (I) .. Chunk_Finish(I) => <0> + Arr(J))); begin

    -- Sequential carry partial sums through array
    for I in 2 .. Partial'Length loop
       Partial (I) := @ + Partial (I - 1);
    end loop;

    for parallel I in 1 .. Chunks loop
      declare
        Sum : Element_Type := (if I = 1 then 0 else Partial (I - 1));
      begin
        for J in Chunk_Start (I) .. Chunk_Finish (I) loop
           Sum := @ + Arr (J);
           Cum (J) := Sum;
        end loop;
      end;
    end loop;
end;

However, as this is a manual chunking problem, I find the code written using
anonymous loop procedures and a library call is quite a bit simpler and easier
to read, and requires less knowhow on the programmer to decide how the loops
should be chunked. The chunking logic is hidden inside the library call.

declare
    package Scan is new Parallel.Loops.Static (Loop_Index => Positive);
    Partial : array (1 .. Scan.Worker_Count) of Element_Type := (others => 0); begin

    -- Get the sum of each chunk
    for (Start, Finish) in Scan.Parallel loop
      declare
         Worker : constant Parallel.Worker_Id := Scan.Get_Worker_Id;
      begin
          for I in Start .. Finish loop
             Partial(Worker) := @ + Histogram (I);
          end loop;
       end;
    end loop;

    --  Sequential exclusive scan phase
    for I in 2 .. Worker_Id (Effective_Workers) loop
       Partial (I) := @ + Partial (I - 1);
    end loop;

    for (Start, Finish) in Scan.Parallel loop
      declare
         Worker : constant Parallel.Worker_Id := Scan.Get_Worker_Id;
         Sum : Integer := (if Worker = 1 then 0 else Partial (Worker - 1));
      begin
         for I in Start .. Finish loop
            Sum := @ + Arr (I);
            Cum (I) := Sum;
          end loop;
      end;
    end loop;

So I would say this is the one case where a parallel library call with anonymous
loop procedure appears to be the best solution, particularly if it means that we
can simplify the parallelism proposal and not have parallel loop syntax.

Tucker did have a proposal involving parallel array syntax that was designed
specifically for this problem. At the time we were considering using that as the
reduction solution, but since reduction expressions appears to be a better
solution, it seems unlikely that we would want to bring back parallel array
syntax if it only is useful for this one problem.

So in summary, I still think most parallel looping problems can be handled with
reduction expressions.

There may be a few odd cases involving manual chunking where one might want to
use a parallelism library with the anonymous loop procedure.

My sense is that there are not enough of these problems needing parallelism
libraries to warrant standardizing a set of libraries for this, given that
libraries are available, and one can also write Ada wrapper libraries around
OpenMP calls, for example, although standardizing a set of parallelism libraries
could be another thing to consider.

While there are some special cases where parallel loop syntax could be useful
for manual chunking problems, the alternative of using a reduction expression
seems to be a good enough alternative, otherwise using anonymous loop body with
a library call is another viable alternative.

That leaves me still thinking that parallel loop syntax is not worth adding at
this point, which would simplify the AI to just cover reduction expressions and
parallel blocks. But it would be good to get others to weigh in as well, of
course.

****************************************************************

From: Brad Moore
Sent: Tuesday, August 29, 2017  10:01 AM

> ...
>> The discussion about find first, find last and preemptive exits has
>> led to show that the anonymous loop body AI coupled with a
>> parallelism library call is a better solution to these problems than
>> the parallel loop syntax solutions we have been discussing.
> There's a couple of problems using the "anonymous loop body" AI in this way:
> (1) Without some additional mechanism, we'd lose the safety checks of
> the parallel loop. They don't make any sense in the context of the
> "anonymous loop body" proposal per-se. So I think we'd still need a
> "parallel" keyword or something like that to trigger the safety
> checks. (Without them, parallel operations are just a giant race
> condition; no human programmer could avoid
> them.)

I have been thinking about this. I think the safety checks could be provided if
we could allow the global aspect on formal subprograms and formal access to
subprograms, with an addition to the global syntax to indicate a subrange of an
array.

with global => array(x .. y)

Which would indicate that there is  global usage of an array from the indices x
.. y. The array might be a name of an array, or just the "array" keyword which
would mean any array.

This might be used with a parallel loop library such as the following.

with System.Storage_Elements;

generic
    type Loop_Index is range <>;
package Parallel is

    procedure Parallel_Loop
      (Process : not null access
            procedure (Start, Finish : Loop_Index) with
                    global => (Input => array (Start .. Finish),
                                      In_Out => array (Start .. Finish),
                                      Out     => array (Start .. Finish));
       From : Loop_Index := Loop_Index'First;
       To      : Loop_Index := Loop_Index'Last)
    with global => null;

end Parallel;

The idea would be that the compiler would reject any call to Parallel_Loop that
involved a Process callback that had any globals other than array references of
a slice using the parameters of the call.

The compiler would also reject the code if there was an attempt to reference the
array using indices that are outside the range of the specified limits, or not
derived from the specified parameters.

Note the Parallel_Loop library call itself does not involve any globals.

The compiler would not reject calls involving a Process callback that didn't
involve any globals. i.e. It doesn't necessarily expect the actual to involve
globals, but if it does, it cannot involve globals outside of the specification.
This should eliminate the race conditions that we are worried about I think.

example usage of the above:

   -- Anonymous loop body call
   for (Start, Finish) of Parallel_Loop loop

       -- Sequential inner loop
       for I in Start .. Finish loop
          C(I) := A(I) + B(I);           -- OK
       end loop;

       C(1) := 3; -- Illegal, Index is not derived from Start or Finish
   end loop;

> (2) There's no certainty that the "anonymous loop body" proposal even
> makes it into the Standard. Several people have expressed strong
> reservations about that proposal.
>
> As an aside, I'd like to note that the "exit" problem with the
> "anonymous loop body" proposal is in fact very similar to the problems
> that arise with the parallel loop proposal. In both case, implementing
> exit/return/goto is very painful, because you have to get control,
> somehow save whatever is supposed to happen after the loop, do some
> actions, and then proceed to doing the after the loop actions. The
> only place where the "anonymous loop body" is a bit worse is that a
> user-defined routine can see some of what happens there; the parallel loop
> is all-compiler generated.

Actually, this is not a problem for exit because with the anonymous loop body,
the users code appears as a sequential loop nested inside the outer anonynmous
loop. Since the inner loop is a sequential loop, exit works just as it does
today, it only exits the inner sequential loop.

I suspect we'd probably want to make returns and gotos illegal from inside an
anonymous loop body.

So the anonymous loop body needs to have no knowledge that there is parallelism
involved. Its just a library call which happens to hide the parallelism inside
the library call.

****************************************************************

From: Tucker Taft
Sent: Tuesday, August 29, 2017  2:05 PM

Although SPARK only allows names of stand-alone objects in Global annotations,
for Ada we allow any object name, presumably including an indexed component or a
slice.  We also have proposed in AI12-0079 to allow them on formal subprograms
and access-to-subprogram, so I think the cases you mentioned are consistent with
the current AI.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, August 30, 2017  10:16 PM

How do the static checks work in such a case, as the bounds in question would
typically be non-static? For instance, in Brad's

  with global => array(x .. y)

X and Y probably would be outer loop indexes, or parameters to the current
subprogram.

****************************************************************

From: Tucker Taft
Sent: Wednesday, August 30, 2017  10:26 PM

You just need to prove two slices don’t overlap, which is equivalent to
proving the high bound of one is less than the low bound of the other. This
is the kind of thing that static analysis and more advanced proof tools are
pretty good at doing! This is not significantly harder than eliminating
run-time checks for array indexing when the bounds are not known at
compile-time, which is something that many Ada compilers can do based on other
information available at the point of usage.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, August 30, 2017  10:40 PM

OK, but again these are Legality Rules, not something that compilers are doing
on their own. The entire set of rules have to be defined formally and required
of all Ada compilers. What "static analysis" and "advanced proof tools" can or
can't do is irrelevant (unless of course we allowed what is legal or not to be
implementation-defined -- when I previously proposed that for exception
contracts, the idea seemed to be as effective as a lead balloon).

****************************************************************

From: Randy Brukardt
Sent: Wednesday, August 30, 2017  11:02 PM

> I have some more thoughts towards answering the question on whether we
> should eliminate parallel loop

...

Let me throw another classic problem (one that I recently wrote for a hobby
program), the problem of determining correlations and other statistics between
two parallel arrays of data.

The core of the problem is calculating 6 pieces of data: the sum of the
elements of each array, the sum of the squares of the elements of each array,
the product of the matching elements of the arrays, and the square of the
product of the arrays.

The sequential solution is a single loop creating all 6 items:

     for I in 1 .. MAX loop
        A_Sum := @ + A(I);
        A_Sum_Sqr := @ + A(I)*A(I);
        B_Sum := @ + B(I);
        B_Sum_Sqr := @ + B(I)*B(I);
        Prod_Sum := @ + A(I)*B(I);
        Prod_Sum_Sqr := @ := (A(I)*B(I))**2;
     end loop;

MAX is over 3000 for this particular problem, so it would seem that a parallel
solution might add some performance.

These are clearly reduction expressions. However, we have to write 6 separate
reduction expressions, which means 6 times the loop overhead. Also, the above
is full of common subexpressions; it's likely that a compiler would calculate
and load A(I) and B(I) once each, and possibly eliminate some of the other
operations as well (in particular, the A*B product). None of these
optimizations can be done with separate reduction expressions, so we'll end up
with 5 times the memory reading as well (each A(I) and B(I) is read 5 times in
the sequential loop).

Thus, a parallel rewrite into 6 reduction expressions is likely to be
disappointing. With the extra parallel execution overhead, it would likely
take 8 cores to improve the performance over the sequential loop, and the
performance would be worse for 4 cores or less.

This is not a theoretical concern: Janus/Ada doesn't do many floating point
optimizations; recompiling the program with GNAT (which presumably does do
those optimizations) improved the performance by a large factor (more than 10,
as I recall).

How would you parallelize this loop??

****************************************************************

From: Tucker Taft
Sent: Thursday, August 31, 2017  11:02 PM

I do not believe we should eliminate parallel loop syntax.  The question that
relates more directly to this example is whether we need special reduction
syntax beyond the “map/reduce” construct.  This example argues that in cases
where you have multiple accumulators, you might want some way of declaring
them so that you get one per chunk.

Alternatively, we encourage the use of an accumulator “record” for this sort
of computation.  E.g.:

   type Accums is record
       A_Sum, A_Sum_Sqr : Float;
       B_Sum, B_Sum_Sqr : Float;
       Prod_Sum, Prod_Sum_Sqr : Float;
   end record;

   function Update_Accums(Orig : Accums; A_Elem, B_Elem : Float) return Accums is
   begin
       return (A_Sum => Orig.A_Sum + A_Elem,
                  A_Sum_Sqr => Orig.A_Sum_Sqr + A_Elem * A_Elem,
                  B_Sum => Orig.B_Sum + B_Elem,
                  B_Sum_Sqr => Orig.B_Sum_Sqr + B_Elem * B_Elem,
                  Prod_Sum => Orig.Prod_Sum + A_Elem * B_Elem;
                  Prod_Sum_Sqr => Orig.Prod_Sum_Sqr + (A_Elem * B_Elem) **2);
   end Update_Accums;

   Result : constant Accums :=
      (for I in 1..MAX => Update_Accums(<(others => 0.0)>, A(I), B(I));

****************************************************************

From: Tucker Taft
Sent: Thursday, August 31, 2017  3:19 PM

>> You just need to prove two slices don't overlap, which is equivalent
>> to proving the high bound of one is less than the low bound of the
>> other.  This is the kind of thing that static analysis and more
>> advanced proof tools are pretty good at doing!  This is not
>> significantly harder than eliminating run-time checks for array
>> indexing when the bounds are not known at compile-time, which is
>> something that many Ada compilers can do based on other information
>> available at the point of usage.
>
> OK, but again these are Legality Rules, not something that compilers
> are doing on their own. The entire set of rules have to be defined
> formally and required of all Ada compilers. What "static analysis" and
> "advanced proof tools" can or can't do is irrelevant (unless of course
> we allowed what is legal or not to be implementation-defined -- when I
> previously proposed that for exception contracts, the idea seemed to
> be as effective as a lead balloon).

This is a general problem with data race detection, I would say.  My suggestion
is to define what we consider to be a "potential data race" in a way that
run-time detection can be used, but make it clear that compilers are permitted
to give compile-time errors if they can show that the "potential data race"
exists in a given instance.  This is somewhat analogous to the "potentially
blocking operation," but in this case, we require the equivalent of
"detect-blocking."  And of course a compiler can omit the run-time check if it
can prove there is no potential data race.  Anything in between presumably many
compilers would provide warnings.

****************************************************************

From: Randy Brukardt
Sent: Thursday, August 31, 2017  3:38 PM

That's effectively making the checks implementation-defined (whether a
particular piece of code will compile is impl-def). Not sure how well that
will fly. Ada usually requires a run-time check in such instances and does
not allow a compile-time check (the latter because of the mess that comes
up in conditional code). Anyway, food for thought.

****************************************************************

From: Tucker Taft
Sent: Thursday, August 31, 2017  8:17 PM

Perhaps you are right.  No big reason to treat this differently than any other
run-time check.  Most compilers already produce a warning when a run-time check
is certain to fail, and often provide a mode where certain classes of warnings
are treated as errors, so that would effectively have the same net effect.  In
generic bodies, we often define things to raise Program_Error under certain
circumstances, but for compilers that do macro-expansion, almost all of these
end up as compile-time warnings, which are often treated as errors by the user.

****************************************************************

From: Brad Moore
Sent: Tuesday, October 10, 2017  12:15 AM

I am trying to write up my homework for AI12-0119-1 (parallel operations),
with an eye to possibly writing up AI12-0197-4 (Coroutines and Channels),
and have come across some observations and ideas that I think improves from
where we left off at Vienna but is modified to account for some new issues.

The basic change is that instead of parallel "begin ... end" for parallel
blocks, I am planning to substitute "do ... end do".

Eg.

    parallel do
       Foo;
    and
       Bar;
    end do;

My reasoning for this is;

1) The sequence of statements in each branch should just be a
  sequence-of-statements, rather than a handled-sequence-of-statements
  because attempting to put an exception handler in this construct I find
  is very confusing; I recall now that this is one of the main reasons
  why the gang-of-four settled on syntax modeled after the select
  statement rather than a begin statement for parallel blocks.

For example, if one writes;

    parallel begin
       Foo;
    and
       Bar;
    exception
       when others =>
         Put_Line ("Exception Caught!");
    end;

    I think it is very confusing. Is the exception handler only catching
    exceptions for the call to Bar, or is it also catching exceptions for
    the call to Foo? If we say that the exception handlers are not allowed,
    then it forces the programmer to either enclose the construct in a
    block statement or equivalent, or use block statements within each arm
   of the construct.  Either way, I find to be much clearer.

    eg.

     begin

           parallel do
              Foo;
           and
              Bar;
           end do;

     exception
        when others =>
          Put_Line("Exception Caught!");
     end;

or

       parallel
       do
           begin
              Foo;
           exception
              when others =>
                Put_Line ("Bad Foo!");
           end;

        and

           begin
             Bar;
           exception
              when others =>
                 Put_Line ("Bad Bar!");
           end;

        end do;

The use of "do" has the benefit that "begin" from Vienna, had in that you can
remove the keyword parallel and still have a working construct.
The parallel keyword is a modifier. In the previous parallel block proposal
this would not work as parallel was the name of the construct. If you remove
"parallel", you have a bunch of code that needs to be cleaned up before it
will compile.

Secondly, I don't think there is much use in having a declarative region in
this construct. Generally, you need the results to outlive the construct, so
that you can do something useful with the results.
This suggests again that the construct should be enclosed in another construct
such as a block statement.

eg.

    declare
       X : Integer;  -- Do Foo & Bar get their own copies of these?
       Y : Float;
    parallel
    begin
       X := Foo;
    and
       Y := Bar;
    exception  --  Is this handler for both Foo & Bar, or just Bar?
      when others =>
        Put_Line("Exception Caught!);
    end;

    Put_Line (????);  -- !!! I Cant print out the results !!!

In this example, two results have been calculated, but if you want to examine
both these results at the same time, you cannot, because the scope of the
results have ended before you can examine both results.

Also, having a declarative region on a parallel begin is confusing. I think some
users would be confused by this, wondering if each arm of the parallel begin
gets its own local instances of those declarations, or whether they are shared
between all the arms.

By eliminating the declarative region, it eliminates this confusion, and also
not likely to be missed because it is not useful to begin with.

If one instead writes;

declare
   X : Integer;
   Y : Float;
begin

    parallel do
       X := Foo;
    and
       Y := Bar;
    end do;

    Put_Line ("X=" & Integer'Image(X) & ", Y=" & FLoat'Image(Y));

exception
    when others =>
      Put_Line("Exception Caught!");
end;

I find this to be non-ambiguous, and provides the scope needed to examine both
results of the arms.

So it becomes clearer to me that this construct is quite different than a
block statement, and therefore it probably should have its own distinct
syntax, rather than try to make the block statement accommodate this usage,
which feels like a hack to me, and also feels like it would be making a bit
of a mess in the syntax.

Why "do"?  The reserved keyword do is relatively unused in Ada, appearing only
in the selective accept statement. It seems to fit better than begin in terms
of English semantics.

The construct really is a list of things to be done.
   e.g.
       Do X and Y and Z.

   reads better than a list of things to be begun I think.

       Begin X and Y and Z.

   Begin really is more tied to a declarative region, where you
   declare a bunch of things, then you need to specify where to "begin"
   executing. Since this construct doesn't seem to need a declarative
   region, there no need to indicate where the execution begins.

   Also many syntax constructs in Ada have the form
      name
      end name;

      e.g.

      loop
      end loop;

      if
      end if;

      select
      end select;

      case
      end case;

      record
      end record;

      If one sees "end" without looking at the above, the assumtion is that it
      corresponds to a normal block statement.

      If one sees "end do" it provides better feedback that the preceding code
      has different semantics.

      I suspect that the reason "begin" is not paired with
      "end begin" mostly because end begin looks like a weird oxymoron
      that would also be confusing to readers. "end do" does not seem
      to have this problem.

      Finally, I think the coroutine concept of ai-0197 pretty much fall
      out of this for free. If the keyword "parallel" is not present,
      then the semantics could be that each arm gets its own executor,
      similar to when the parallel keyword is present, but each executor
      is tied to the same core, thus each arm of the construct executes
      concurrently, but not in parallel, as only one arm can be executing
      at one time. If you want more parallelism, simply add the parallel
      keyword.

      I dont think channels are needed or anything else, one can use
      existing capabilities in Ada to provide the channels.
      For example, Ada.Containers.Synchronous_Queues could be used to
      provide a channel.

      Here is an example of coroutines involving two producers and one
      consumer.  The first arm is an endless loop that produces integer
      values, the second arm is a bounded loop that produces higher
      integer values, and the third arm is the consumer which will end
      up pulling values from both the other arms.

      The Exit statement of the consumer is used here to terminate the
      construct, which will abort the endless loop of the 1st arm.
      There may be reasons why adding "Exit" to block statements would
      not fit very well with syntax. I suspect there would be less reason
      to disallow Exit in a do construct.

      Alternatively, this could be a goto, return, or exception, which
      is treated as a transfer of control out of the construct, which
      we've already discussed.

declare
   Queue : Integer_Queues.Queue;
begin

    do
       declare
          X : Integer := 0;
       begin
          loop    -- endless loop
             Integer_Queue.Enqueue (X);
             X := X + 1;
          end loop;
       end;

    and

       for I in 1 .. 3 loop
          Integer_Queue.Enqueue (1_000_000 + I);
       end loop;

    and

       declare
          Value : Integer;
       begin
          for I in 1 .. 5 loop
             Integer_Queue.Dequeue (Value);
             Put_Line ("Value=" & Integer'Image (Value));
          end loop;

          Exit; -- or raise exception???
       end;

    end do;

    I've tried this code in Paraffin, using Ada's asynchronous
    transfer of control mechanism, and I see the expected results;

    Value= 1000001
    Value= 1000002
    Value= 1000003
    Value= 0
    Value= 1

The last two values might be output before the first three depending on
which arm gets to execute first. I think we could go for better determistic
behaviour and say that the initial order of execution is top down. Once all
have had their initial start, they proceed in a concurrent fashion dependent
on the behaviour of the "channel" and the blocking of the arm branches.

    Anyway, I think these are compelling reasons for using "do" rather
    than "begin", so I will write my homework up that way, unless someone
    can convince me otherwise before then.

    If we decide to go back to "begin", I dont think it will be a big
    change to go that way. I just wanted to present these ideas earlier
    so that it wont be as much a surprise when we meet in Boston,
    and to possibly receive comment earlier.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, October 10, 2017  12:42 AM

...
> I just wanted to present these ideas earlier so that it wont be as
> much a surprise when we meet in Boston, and to possibly receive
> comment earlier.

With the homework deadline about 40 hours from now, there isn't much time for
comment and AI update. And last minute flurries of mail is an issue for me, too
as it takes time to file e-mail threads. Ergo, at this point, just do it.

(Also, do as I say, not as I do... ;-)

>Finally, I think the coroutine concept of ai-0197 pretty much fall out
>of this for free. If the keyword "parallel" is not present, then the
>semantics could be that each arm gets its own executor, similar to when
>the parallel keyword is present, but each executor is tied to the same
>core, thus each arm of the construct executes concurrently, but not in
>parallel, as only one arm can be executing at one time.

OK, but what is providing the "yield" semantics? Without that, it's just stupid
tasking. And it can't require a lot of work to write, lest it hardly solve
anything that can't already be solved with an existing task.

> I dont think channels are needed or anything else, one can use
> existing capabilities in Ada to provide the channels.
> For example, Ada.Containers.Synchronous_Queues could be used to
> provide a channel.

I don't think that is what Tucker had in mind -- if normal task communication
would work, there'd be no need to propose such an idea in the first place. (Nor
any reason for coroutines, but I digress.) In any case, I want to find out
exactly what he had in mind, and trying to preempt him is not helpful.

> There may be reasons why adding "Exit" to block statements would not
> fit very well with syntax. I suspect there would be less reason to
> disallow Exit in a do construct.

Exit fits fine with the syntax of a block statement. But allowing it is wildly
incompatible, consider the Ada 95 code:

     loop
         begin
             exit when ...;
         end;
     end loop;

If block statements had exit, this exit would exit the block, rather than the
loop. That's clearly not the intention of the writer of the Ada 95 code.

As you note, there's no compatibility problem with "do", but I don't think it is
a particularly good idea to allow that in one construct but not in a similar
one. Is this really necessary? It seems like a rare need, and a goto works fine
(even if clunky).

****************************************************************

From: John Barnes
Sent: Tuesday, October 10, 2017  1:53 AM

i rather like do.

****************************************************************

From: Tullio Vardanega
Sent: Tuesday, October 10, 2017  2:25 AM

So do I.

****************************************************************

From: Brad Moore
Sent: Tuesday, October 10, 2017  9:24 AM

> Ergo, at this point, just
> do it.

Interesting choice of words. If you had said, just begin it, it would probably
lower the chances of me getting it done. ;)

> Finally, I think the coroutine concept of ai-0197 pretty much fall
>> out of this for free. If the keyword "parallel" is not present, then
>> the semantics could be that each arm gets its own executor, similar
>> to when the parallel keyword is present, but each executor is tied to
>> the same core, thus each arm of the construct executes concurrently,
>> but not in parallel, as only one arm can be executing at one time.
> OK, but what is providing the "yield" semantics? Without that, it's
> just stupid tasking. And it can't require a lot of work to write, lest
> it hardly solve anything that can't already be solved with an existing task.

We could have two subtype attributes, 'Consume and 'Yield allowed only in "do"
statements that semantically read and write to an implicit buffer or queue of
that subtype. The scope of the implicit queue would be tied to the scope of the
do statement. This would be expected to work regardless whether the do statement
has the parallel keyword or not. The default length of the queue could be 1, but
perhaps could be controlled by an aspect on the subtype declaration.

e.g.

do
   for I in 1 .. 1_000_000 loop
      Integer'Yield(I);
   end loop;
and
   for I in 1 .. 1_000_000 loop
      Float'Yield(Sqrt(I));
   end loop;
and
   for I in 1 .. 10 loop
      Put_Line ("The Square Root of" & Integer'Image(Integer'Consume) & " is"
                & Float'Image(Float'Consume));
   end loop;
   goto Done;
end do

<<Done>>

****************************************************************

From: Brad Moore
Sent: Wednesday, October 11, 2017  4:01 PM

Here is my homework for this AI. [This is version /04 of the AI - Editor.]

I have applied major rework to the reduction expression construct, since it was
very sketchy to begin will.

I have also renamed parallel blocks to concurrent blocks, using the do ... and
.. end do syntax I described in a recent email. If the parallel keyword is not
present, then the do statement executes concurrently within the same task, which
is safer and more useful than just applying sequential execution.

This will also provide a good starting point for AI12-0197, coroutines, since
most of the construct is already in place.

For parallel loops, the biggest change was to move the parallel keyword from in
the middle of the syntax, to before the "for" keyword.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, October 11, 2017  4:01 PM

> Here is my homework for this AI.

I fixed some minor issues in this AI (without making a separate version):

The reference to AI12-0064-1 is stale (that alternative is abandoned), and the
aspect is named Nonblocking.

The paragraph starting "Any transfer of control..." had two left-over uses of
"parallel sequences", which I replaced by "concurrent sequences" to match the
other changes (including earlier in the same sentence).

The paragraph starting "Note that the same rules..." talks about "parallel
blocks", which was otherwise changed to "concurrent blocks".

There was a missing ) in the first paragraph of the Legality Rules for 4.5.9.

There was a stray "Modify 5.5.2", which was removed.

Some of the formatting was unusually narrow (while it was unusually wide last
time - can't win, I guess).


===

Comment: Examples in the RM need to be complete, typically by depending on
declarations from previous examples. None of the examples in 4.5.9 or 5.6.1 seem
to do that. (A couple might be stand-alone, which is OK, but I didn't check
carefully.) That needs to be fixed.

Comment: You added "reducer" aspects as needed to Ada.Strings and the like. But
don't the containers need something similar? I could do that in AI12-0112-1, but
I'd have to know what is needed (and I'm not the right person to figure that
out).

****************************************************************

From: Brad Moore
Sent: Wednesday, October 11, 2017  11:26 PM

...
> I fixed some minor issues in this AI (without making a separate version):
...
> There was a missing ) in the first paragraph of the Legality Rules for
> 4.5.9.

Thanks for catching these, Randy.

> There was a stray "Modify 5.5.2", which was removed.
>
> Some of the formatting was unusually narrow (while it was unusually
> wide last time - can't win, I guess).

I dont know how wide is good. I had my text editor set at a 72 character
margin by default and tried to stick to that. Would 80 characters be
better?

 ===
>
> Comment: Examples in the RM need to be complete, typically by
> depending on declarations from previous examples. None of the examples
> in 4.5.9 or 5.6.1 seem to do that. (A couple might be stand-alone,
> which is OK, but I didn't check carefully.) That needs to be fixed.
>
> Comment: You added "reducer" aspects as needed to Ada.Strings and the like.
> But don't the containers need something similar? I could do that in
> AI12-0112-1, but I'd have to know what is needed (and I'm not the
> right person to figure that out).

I thought I had gone through the containers in this AI. It should be there.
For instance, see

  Modify A.18.2 (15/2-18/.2)
and so on....

****************************************************************

From: Randy Brukardt
Sent: Wednesday, October 11, 2017  11:41 PM

...
> I dont know how wide is good. I had my text editor set at a
> 72 character margin by default and tried to stick to that.
> Would 80 characters be better?

The usual is either 79 or 80 (depends on whether a program or I am doing it).
Some of the old text seemed to be about 100 characters, and the new was 72.
Tough to compare.

****************************************************************

From: Brad Moore
Sent: Thursday, December 14, 2017  9:45 AM

I have been thinking more about parallelization problems that need what I have
been calling, manual chunking, where the user needs to have more control of
how loops are broken down into chunks.

Previously I had identified a number of such problems, including loops that
involve synchronous barriers, such as gaussian elimination for solving
matrices, or where multiple loops need to operate on an array using the same
chunking, such as for computing a cumulative sum of an array of elements.

Since then, I have identified more generally the need, which basically amounts
to cases where user-provided initialization and/or finalization needs to be
applied to the chunks.

I think a good example for this need is the case of parallel file IO. Consider a
large file of fixed-size records where all records in the file need to be
processed, and where it is desired to to this in parallel.

It would be too inefficient to open and close a file handle for each record, and
then seek to the position of the record to be modified.

What is needed here is for each chunk to open its own file handle, process the
records of that chunk, then close the file handle. It is not reasonable to
expect that the compiler will be able to automatically provide such chunk
initialization and finalization, and I think trying to accomplish this with
syntax using aspects would be too messy. A library approach seems like a better
alternative.

Another example is memory allocation, where when processing the elements of an
array or container, a temporary data structure is needed for the processing.
Again, rather than allocate and deallocate this data structure for every element
of the array or container, it makes better sense to allocate the structure once
for each chunk (whether it be allocated from the stack or the heap), and then
free the data structure once the chunk processing is complete.

It seems that there ought to be a standard library/facility that can be used to
generate the chunking. This makes sense in particular because;
  - Such a library can be non-trivial to write
  - Such a library if provided by the vendor can be better tuned to how the
    vendor implements parallelism for parallel loops, blocks, etc.

I have been prototyping what such a library might look like, and have a working implementation that also ties in to how parallel iteration can be added to all the standard containers.

The basic idea is to provide a special container library that can generate a chunk-vector. The chunk-vector is an abstraction that essentially is a vector of chunk boundaries, where each chunk boundary identifies the start and end cursors of a larger set o
f iteration.

I have two generic libraries for this. One is intended for use when iterating over discrete subtypes, and the other is intended for use when iterating over containers.

The basic skeleton of the discrete chunker is;

generic
   type Loop_Cursor is (<>);
package Ada202x.Containers.Loop_Chunking.Discrete_Iteration is

   type Chunk_Bounds is tagged private;

   function Start  (Chunk : Chunk_Bounds) return Loop_Cursor;
   function Finish (Chunk : Chunk_Bounds) return Loop_Cursor;

   type Chunk_Vector (<>) is tagged limited private
   with
      Constant_Indexing => Constant_Reference,
      Default_Iterator  => Iterate,
      Iterator_Element  => Chunk_Bounds;

   function Chunks
     (From : Loop_Cursor := Loop_Cursor'First;
      To   : Loop_Cursor := Loop_Cursor'Last) return Chunk_Vector;

   type Cursor is private;

   package Chunk_Vector_Iterator_Interfaces is new
     Ada202x.Iterator_Interfaces (Cursor, Has_Element);

   function Iterate (Container : Chunk_Vector)
      return Chunk_Vector_Iterator_Interfaces.Parallel_Iterator'Class;

private
   ...
end Ada202x.Containers.Loop_Chunking.Discrete_Iteration;

The basic skeleton of the container chunker is very similar, differing only in
the generic formal parameters;

generic
   type Loop_Cursor is private;

   with package Container_Iterator_Interfaces is new
     Ada202x.Iterator_Interfaces(Cursor       => Loop_Cursor,
                                 Has_Element  => <>);

package Ada202x.Containers.Loop_Chunking.Container_Iteration is

   type Chunk_Bounds is tagged private;

   function Start  (Chunk : Chunk_Bounds) return Loop_Cursor;
   function Finish (Chunk : Chunk_Bounds) return Loop_Cursor;

   type Chunk_Vector (<>) is tagged limited private
   with
      Constant_Indexing => Constant_Reference,
      Default_Iterator  => Iterate,
      Iterator_Element  => Chunk_Bounds;

   function Chunks
     (Iterator : Container_Iterator_Interfaces.Parallel_Iterator'Class)
      return Chunk_Vector;

   type Cursor is private;

   package Chunk_Vector_Iterator_Interfaces is new
     Ada202x.Iterator_Interfaces (Cursor, Has_Element);

   function Iterate (Container : Chunk_Vector)
      return Chunk_Vector_Iterator_Interfaces.Parallel_Iterator'Class;

private
 ...
end Ada202x.Containers.Loop_Chunking.Container_Iteration;

I've also added the following parallel iterator interface to
Ada.Iterator_Interfaces;

   use type Ada202x.Containers.Count_Type;
   subtype Count_Type is Ada202x.Containers.Count_Type;

   type Cursor_Offset is range -(Count_Type'Last - 1) .. Count_Type'Last - 1;

   type Parallel_Iterator is limited interface and Forward_Iterator;

   function Length
     (Object   : Parallel_Iterator) return Count_Type is abstract;
   -- A count of the number of elements in the container to be iterated over

   function Jump
     (Object   : Parallel_Iterator;
      Position : Cursor;
      Offset   : Cursor_Offset) return Cursor is abstract;
   -- A means to generate a cursor based on an offset number of elements
   -- from another cursor of the container.

   type Reversible_Parallel_Iterator is
      limited interface and Parallel_Iterator and Reversible_Iterator;

A chunk-vector is a container that can be iterated over in parallel using
constructs such as parallel loops and reduction expressions.

To see how this might be used, consider the parallel IO example I described
above, where we have a file of employee records, and we want to update all the
records to give everyone a 2% raise of their current salary.

One could write;

with Ada.Direct_IO;

with Ada202x.Containers.Loop_Chunking.Discrete_Iteration;

procedure Test_Chunking
is
   subtype Last_Name is String (1 .. 12);
   subtype Address is String (1 .. 20);
   type Annual_Salary is delta 0.01 digits 15;

   type Employee is record
      Employee_Name : Last_Name;
      Home_Address  : Address;
      Salary        : Annual_Salary;
   end record;

   package Employee_IO is new Ada.Direct_IO (Element_Type => Employee);

   Data_Filename : constant String := "Employees.dat";

   Record_Count : constant Employee_IO.Positive_Count := Get_Record_Count;

   subtype Loop_Index is Employee_IO.Positive_Count range 1 .. Record_Count;

   package Manual_Chunking is new
      Ada202x.Containers.Loop_Chunking.Discrete_Iteration (Loop_Index);

   procedure Give_Raise
     (From, To    : Loop_Index)
   is
      Employee_File    : Employee_IO.File_Type;
      Current_Employee : Employee;
   begin

      Employee_IO.Open (File => Employee_File,
                        Mode => Employee_IO.Inout_File,
                        Name => Data_Filename,
                        Form => "shared=no");

      Employee_IO.Set_Index (File => Employee_File,
                             To   => From);

      for Position in From .. To loop

         Employee_IO.Read (File => Employee_File,
                           Item => Current_Employee);

         --  2% Raise to everyone!
         Current_Employee.Salary := @ + @ * 0.02;

         Employee_IO.Write (File => Employee_File,
                            Item => Current_Employee);
      end loop;

      Employee_IO.Close (Employee_File);

   end Give_Raise;

begin --  Test_Chunking

   parallel
   for Chunk of Manual_Chunking.Chunks loop
      Give_Raise (From => Chunk.Start,
                  To   => Chunk.Finish);
   end loop;

end Test_Chunking;

One might further enhance this example to use the anonymous loop body syntax, if
that AI is approved.

A similar example could be concocted for manually chunking parallel iteration of
a container.

Further to this, once we've defined the parallel iterator interface, we could
update all the standard containers to use this interface.

For example, I have prototyped the changes to
Ada.Containers.Doubly_Linked_Lists;

The relevant bits are;

with Ada202x.Iterator_Interfaces;

generic
...
package Ada202x.Containers.Doubly_Linked_Lists is

   ...

   package List_Iterator_Interfaces is new
     Ada202x.Iterator_Interfaces (Cursor, Has_Element);

   ...

   function Iterate (Container : List)
      return List_Iterator_Interfaces.Reversible_Parallel_Iterator'Class;

   function Iterate (Container : List; Start : Cursor)
      return List_Iterator_Interfaces.Reversible_Parallel_Iterator'Class;

    ...
private

end Ada202x.Containers.Doubly_Linked_Lists;

If a loop is written without the parallel keyword or reverse keyword, you get a
forward iterator. If you have the reverse keyword, you get the reverse iterator,
or if you have the parallel keyword, you get the parallel iterator semantics.

Note also that when user initialization/finalization of chunks is not needed,
the implementation could fall back on its own chunking strategies, but a vendor
may find that the chunking libraries I've described above are also useful for
implementing its own parallelism for these cases.

Does this seem like ideas worth pursuing? If so, should this be added to the
parallel operations AI (AI12-0119-1), or should this be a separate AI?

***************************************************************

From: Tucker Taft
Sent: Thursday, December 14, 2017  4:56 PM

The paper about Generalized Parallel Iterators from Ada 2016 seems relevant:

   http://www.ada-europe.org/archive/auj/auj-37-2.pdf  (scan to physical page
   31, logical page 95)

The accompanying presentation is here:

    http://www.cister.isep.ipp.pt/ae2016/presentations/taft.pdf

It proposes a syntax that would allow the following:

declare
      package String_Hyp_Objs is new
          Indefinite_Hyper_Objects (String, Identity => “”, Reducer => “&”);
      String_Accum :  String_Hyp_Objs.Accumulator (Count => Num_Chunks);

begin
     for Elem of parallel(C in 1 .. String_Accum.Count) My_Str_Vector loop
         String_Accum.Update (Index => C, Element => Elem);
         -- Concatenate Elem onto end of growing accumulated result
    end loop;

    Put_Line (String_Accum.Reduce);
    -- Put concatenation of all strings from My_Str_Vector end;

Since we have moved the location of the "parallel" reserved word in newer
proposals, we would have to move the part that declares the "chunk index" C,
namely "C in 1..String_Accum.Count"  (using syntax reminiscent of an
entry-family index).  But the basic idea of providing an explicit chunk index
still seems viable.

Providing a chunk index to the body of the loop might enable many of the
capabilities you are suggesting.  Would such a chunk index be enough?

****************************************************************

From: Randy Brukardt
Sent: Thursday, December 14, 2017  5:17 PM

...
> Does this seem like ideas worth pursuing? If so, should this be added
> to the parallel operations AI (AI12-0119-1), or should this be a
> separate AI?

I can't really answer the first part (again, it makes me wonder if there is
sufficient value to the "usual" parallel loop), but the second part is pretty
clear: it has to be a separate AI. If we don't buckle down and finish some
version of the parallel loop stuff ASAP, there is no chance that any of it
will be ready the summer after next (when our work on Ada 2020 completes).

If this really isn't sufficiently mature to standardize (and the fact that you
have a radically different proposal every 2-3 months strongly suggests that),
then we should completely forget it for this cycle and spend our time on other
proposals that actually have a chance to converge. We've only got
18 months to finish this!!!

****************************************************************

From: Tucker Taft
Sent: Thursday, December 14, 2017  5:21 PM

> Does this seem like ideas worth pursuing? If so, should this be added
> to the parallel operations AI (AI12-0119-1), or should this be a separate AI?

I would recommend a separate AI.  I believe we have also recommended splitting
AI12-0119 into two AIs, one about reduction expressions and one about parallel
loop/bock.  Perhaps you could do that sooner rather than later, so we will have
a new AI for reduction expressions that we can reference.  Ed Schonberg at
AdaCore has already implemented a version of reduction expressions, and it is
pretty nice.

****************************************************************

From: Randy Brukardt
Sent: Thursday, December 14, 2017  5:29 PM

Yes, all of the above is true. And with a January 26th homework deadline, all
homework needs to be done sooner rather than later. Hint, Hint!! :-)

BTW, the "new idea" deadline for ARG members is June 15th, so keep that in
mind. After June, we're supposed to ignore new feature ideas from everyone,
including each other. (For the general public, it is January 15th.)

****************************************************************

From: Tucker Taft
Sent: Thursday, December 14, 2017  5:41 PM

> If this really isn't sufficiently mature to standardize (and the fact
> that you have a radically different proposal every 2-3 months strongly
> suggests that), then we should completely forget it for this cycle and
> spend our time on other proposals that actually have a chance to
> converge. We've only got 18 months to finish this!!!

This is not a good outcome from my perspective.  Our prime directive should
be to get parallel blocks and parallel loops done.  So I agree with Randy that
we should finish the basic capabilities that we started, before inventing
anything new.  Splitting the AI into two parts would be a good start, and then
refining the wording of each as needed.

****************************************************************

From: Jeff Cousins
Sent: Tuesday, December 19, 2017  4:48 AM

How active is the “Gang of Four” concept?  Probably naively, I’ve been
assuming that they would have been having e-mail discussions, and possibly
even a meeting, outside of the full ARG.  Or is the correspondence that we’ve
been seeing on the ARG list everything?

****************************************************************

From: Tucker Taft
Sent: Tuesday, December 19, 2017  8:08 AM

No regular "gang-of-four" meetings these days.  I think we are largely past
that stage, and moved into the "AI" stage, where the ARG is the center of
action.

****************************************************************

From: Brad Moore
Sent: Wednesday, December 27, 2017  11:00 PM

Here is a first chunk of my homework.

Here we have what remains of AI12-0119-1. [This is version /05 - ED.]

As you may recall, reduction expression syntax is being separated from this AI,
to be placed in a new so-far unknown AI. Coroutine capabilities are also
similarly deferred to a different AI.

I have hopefully addressed the comments from the last meeting.
Some notable new bits are;

- I have added parallel iterator interfaces to Ada.Iterator_Interfaces to
  facilitate container iteration, and updated all the containers to have
  parallel iterators (in addition to the iterator schemes currently supported).

- I have also attempted to add legality rules to eliminate data races
  in parallel loops and parallel blocks, and a new suppressible check called
  Loop_Parameter_Check, which dynamically checks that the loop parameters
  are not used non-sequentially between tasklets to eliminate that as a source
  of erroneousness.

- I am hoping that the global AI means that most data races can be eliminated as
  a legality check, rather than a dynamic runtime check, or static warning.

- I think the only ones that need dynamic checking are those that update global
  variables that also mention the loop parameter variable.

Cheers, Merry Christmas, and Happy New Year!

***************************************************************

From: Jean-Pierre Rosen
Sent: Thursday, December 28, 2017  12:07 AM

> On the other hand, calls by different
> tasklets of the same task into the same protected object are treated
> as different calls resulting in distinct protected actions; therefore
> synchronization between tasklets can be performed using protected
> operations.

What if a tasklet calls an entry of the enclosing task, and another one accepts
it? It would work with real parallelism, but it would be a sure deadlock if
executed sequentially.

Maybe forbid accepts in parallel statements?

***************************************************************

From: Jeff Cousins
Sent: Thursday, December 28, 2017  2:25 AM

Thanks for doing that Brad, and Happy Christmas and New Year everyone.

***************************************************************

From: Brad Moore
Sent: Thursday, December 28, 2017  11:03 AM

> Maybe forbid accepts in parallel statements?

Good point, I agree that probably a legality rule should be added to forbid
accept statements in parallel blocks and loops.

Note, a similar problem can occur for protected objects, for example if one side
of a parallel block is a producer and another is a consumer. If executing
sequentially, the consumer might block waiting for the producer, which could
deadlock, whereas this would work fine for parallel execution. I think what we'd
want is to fall back to coroutine-like behaviour if executing sequentially where
if a tasklet blocks, it implicitly yields allowing other tasklets to execute.

Just prior to last meeting, the AI dealt more with sequential behaviour for the
case of a parallel block without the parallel keyword. It was considered to be a
coroutine. We decided at last meeting to move the coroutine part of it to a
separate AI, but these AIs are still closely intertwined because even if the
parallel keyword is present we need to consider the case where the
implementation decides to ignore that hint from the programmer (perhaps due to
oversubscription of cores from other parallel statements) and execute the
parallel block sequentially.

I think I need to add some wording to this AI that talks about implicit yields
when tasklets are blocked. The other AI can probably focus more on explicit
yield statements.

***************************************************************

From: Randy Brukardt
Sent: Friday, December 29, 2017  12:35 AM

> > Maybe forbid accepts in parallel statements?
>
> Good point, I agree that probably a legality rule should be added to
> forbid accept statements in parallel blocks and loops.

I thought the intent was to not allow blocking in bodies of parallel blocks and
loops. Otherwise, a tasklet could hardly be a lightweight entity, as it would
require the entire mechanisms of an Ada task (including queuing, priority
inheritance, and the like). While that certainly would be more capable, it also
would be much more dangerous.

We have a mechanism for compile-time enforcement of nonblocking (just finished),
and I had thought part of the reason for defining that mechanism was to use it
in parallel blocks and loops. Otherwise, it seems like it was a lot of work for
minimal gain.

> Note, a similar problem can occur for protected objects, for example
> if one side of a parallel block is a producer and another is a
> consumer. If executing sequentially, the consumer might block waiting
> for the producer, which could deadlock, whereas this would work fine
> for parallel execution. I think what we'd want is to fall back to
> coroutine-like behaviour if executing sequentially where if a tasklet
> blocks, it implicitly yields allowing other tasklets to execute.

Again, I'd expect this sort of thing to be banned. Protected operations still
could be used to get mutual exclusion, but entry calls would not be allowed.

I don't see any reasonable way to reason about the behavior of tasklets if
scenarios like the above are allowed, since neither the number of tasklets nor
how they "chunk" iterations is known to the author of the parallel loop. The
only portable way to write such a loop is to avoid blocking altogether (since
there can be no reliable way to ensure that deadlock can't happen).
Communication between tasklets has to be prevented (outside of reduction
expressions, of course).

> Just prior to last meeting, the AI dealt more with sequential
> behaviour for the case of a parallel block without the parallel
> keyword. It was considered to be a coroutine.
> We decided at last meeting to move the coroutine part of it to a
> separate AI, but these AIs are still closely intertwined because even
> if the parallel keyword is present we need to consider the case where
> the implementation decides to ignore that hint from the programmer
> (perhaps due to oversubscription of cores from other parallel
> statements) and execute the parallel block sequentially.
>
> I think I need to add some wording to this AI that talks about
> implicit yields when tasklets are blocked. The other AI can probably
> focus more on explicit yield statements.

You can't be serious. This is way worse than my original fear; not only do the
supposedly lightweight tasklets have to carry all of the machinery of Ada tasks,
but they also have to act like coroutines, with an entire additional level of
complication.

Assuming that the coroutine proposal actually get fleshed out and into the
Standard, a user could write such a mechanism if they wanted it. But paying for
it all the time seems mad to me, and you still wouldn't be able to reason in any
useful way about how tasklets execute (since you don't know the mapping).

Communication between parallel loop iterations (and between parallel block
"limbs") should be banned. Otherwise, it means that the supposedly "parallel"
operations aren't the independent entities that can actually be run in parallel
in the first place.

I could imagine relaxing such restrictions in the case when the number of
tasklets and their mapping is specified via some future mechanism. (In that
case, one could actually portably reason about the behavior.) But it seems to me
that we have Ada tasks for such needs; parallel blocks and loops are supposed to
be easy-to-use, safe-by-construction facilities to that make it simple to
introduce parallelism into a program. If things are complex enough, a library
like your (Brad) Parafin makes more sense. It can't have safety-by-construction
but can allow more complex structures.

***************************************************************

From: Brad Moore
Sent: Friday, December 29, 2017  12:33 AM

> We have a mechanism for compile-time enforcement of nonblocking (just
> finished), and I had thought part of the reason for defining that
> mechanism was to use it in parallel blocks and loops. Otherwise, it
> seems like it was a lot of work for minimal gain.

One of the later developments of the gang of four was that we determined that it
would be OK to support protected actions in tasklets. That doesn't necessarily
mean we should support that, but that it is an option, so we included it in our
proposal. I'd have to look back, but it might have been that we were more keen
on supporting protected function calls and protected procedure calls, but not so
much blocking calls.

I agree that disallowing blocking calls would simplify the proposal, and would
be OK with doing that if that is the general consensus. We could always add that
in later if we felt it was needed, but it is much harder to disallow something
later once it gets in the standard, as you know.

***************************************************************

From: Randy Brukardt
Sent: Friday, December 29, 2017  6:44 PM

...
> One of the later developments of the gang of four was that we
> determined that it would be OK to support protected actions in
> tasklets. That doesn't necessarily mean we should support that, but
> that it is an option, so we included it in our proposal. I'd have to
> look back, but it might have been that we were more keen on supporting
> protected function calls and protected procedure calls, but not so
> much blocking calls.

Protected subprogram calls can be considered nonblocking assuming the protected
type is declared that way (meaning that static nonblocking checking is applied
to the operations). These are important for mutual exclusion, and so they would
have to be allowed (and they are with suitable declarations, which applies to
everything).

> I agree that disallowing blocking calls would simplify the proposal,
> and would be OK with doing that if that is the general consensus. We
> could always add that in later if we felt it was needed, but it is
> much harder to disallow something later once it gets in the standard,
> as you know.

Right, that's part of my thinking here. If we find some compelling examples
where some limited blocking would work and make sense even if the mapping of
iterations to tasklets is unknown, then we can revisit. But the goal is
safety-by-construction: if the loop compiles, it will execute safely without
deadlocks or races. That's a lot easier if we don't have to worry about blocking
(it's not clear to me that it is even possible if blocking is involved).

And I want tasklets to be a light-weight as possible. Indeed, I would like to
view tasks as being constructed on top of tasklets by adding blocking
capabilities and the like.

***************************************************************

From: Brad Moore
Sent: Saturday, December 30, 2018  1:34 PM

I have been looking more closely at Tuckers paper from the Ada Users Journal on
Container Iteration, and I think it would be worth incorporating some of those
ideas.

 http://www.ada-europe.org/archive/auj/auj-37-2.pdf  (scan to physical page 31,
 logical page 95)

In particular, providing a Split primitive on the Parallel_Iterator_Interface
instead of a Jump function to Jump to an arbitrary cursor relative to another
cursor, gives the container writer more flexibility in determining which cursors
of the container should be used for the chunks, as well as possibly the number
of chunks to be applied.

This also provides a common, better defined interface that can be used by both
the implementation, and the user. The compiler would use the interface under
normal circumstances, but the user could also use the interface when manual
chunking is needed, which would eliminate the need to provide a separate
interface for that purpose.

I think this can be accomplished with some fairly minor tweaks to the existing
AI.

As it stands, I would replace the addition to Ada.Iterator_Interfaces of;

   use type Ada.Containers.Count_Type;
   subtype Count_Type is Ada.Containers.Count_Type;

   type Cursor_Offset is range -(Count_Type'Last - 1) .. Count_Type'Last - 1;

   type Parallel_Iterator is limited interface and Forward_Iterator;

   function Length
     (Object   : Parallel_Iterator) return Count_Type is abstract;

   function Jump
     (Object   : Parallel_Iterator;
      Position : Cursor;
      Offset   : Cursor_Offset) return Cursor is abstract;


with;

   type Chunk_Array is limited interface;

   function Length
     (Object  : Chunk_Array) return Natural is abstract;

   function Start
     (Object : Chunk_Array;
      Index  : Natural) return Cursor is abstract;

   function Finish
     (Object : Chunk_Array;
      Index  : Natural) return Cursor is abstract;

   type Parallel_Iterator is limited interface and Forward_Iterator;

     function Split
       (Object   : Parallel_Iterator) return Chunk_Array'Class is abstract;

The Chunk_Array interface would provide the means to extract the cursors out of
the Split result.

The Length function indicates how many chunks were returned by the Split call,
and Start, Finish return the start and end cursors for a particular chunk.

Then for each of the standard containers, I would add a type definition for the
Chunk_Array.

For example, for Ada.Containers.Doubly_Linked_Lists, I would add;

   type Chunk_Array (<>) is new
     List_Iterator_Interfaces.Chunk_Array with private;

I believe the Chunk_Array definition would be the only additional line needed in
each container.

Then, if a user wanted to manually chunk a list container, one could write
something like;

      Chunks : constant My_Lists.List_Iterator_Interfaces.Chunk_Array'Class :=
          List.Iterate.Split;
   begin

      parallel
      for Chunk in 1 .. Chunks.Length loop
         Inner_Loop :
         for Position in List.Iterate (Start => Chunks.Start(Chunk)) loop
            ...
            exit Inner_Loop when Position = Chunks.Finish (Chunk);
         end loop Inner_Loop;
      end loop;
   end;

Note that no additional syntax support is needed than is already proposed in the
AI.

Does this seem like an improvement?
If so, I will update the AI to include these changes.

***************************************************************

From: Randy Brukardt
Sent: Saturday, December 30, 2018  3:55 PM

Tucker's idea certainly makes more sense than your old idea, which requires
adding indexed access to containers that don't naturally have such access. But I
would have expected the Split interface to at least take a requested number of
chunks. Otherwise, the container has no way of knowing how many chunks would be
optimal (since that depends on the number of tasklets available and other
implementation details). It shouldn't be guessing. And it would seem that a
sensible implementation of the interface you have would be to simply return a
single chunk for every query, since that is easy to implement and makes as much
sense as any other result. I think you'd want to prevent that somehow, and the
best I can think of is to have a suggested number of chunks.

Of course, it might make sense for the container to return some other number of
chunks (if the container doesn't have enough elements for the requested number
of chunks, for instance). So that part of the interface makes sense.

But we definitely don't want the container interface to be tightly tied to the
parallel loop implementation, which it would have to be if there is no number of
chunks in the Split request. Hopefully both things can be implemented separately
(quite possibly by different groups for larger compiler vendors).

P.S. You might want to consider whether the routines of this interface ought to
be declared nonblocking. One can't do that with the existing iterators because
of compatibility, but you have no such concern.

***************************************************************

From: Brad Moore
Sent: Sunday, December 31, 2018  3:05 PM

> Tucker's idea certainly makes more sense than your old idea, which
> requires adding indexed access to containers that don't naturally have such
> access.

I have actually modified Tuck's ideas from his paper in several ways, which I
think are improvements.

In particular, he had Parallel_Interfaces in a generic child package of
Iterator_Interfaces, using a Cursor_Array formal as an incomplete type.

I started off with that but couldn't get it to work, as the Cursor_Array had to
be a private type, which then nobody, (Implementations or Users) could do
anything with the private type. By defining the Chunk_Array interface in the
same place as the iterator, I was able to eliminate the generic formal, allowing
the interface definition to be moved into iterator_interfaces, which means only
one instantiation is needed to get all iterator interfaces, which I think is
more convenient.

He also had the Cursor array being passed into Split as an out parameter. I
think returning the Chunk_Array as a function result will be more convenient to
users.

Lastly, his Split operation was returning an array of cursors, whereas my Split
returns an array of chunks where each chunk has a start and end cursor.

I think this is safer, particularly for user usage, since with an array of
cursors, the user has to be sure to not dereference the cursor representing the
end of the chunk, because that is being used as the start of another chunk.
Also, it eliminates the user having to deal with the end case for the last chunk
where the last cursor would have been a No_Element cursor.

With the current scheme, all cursors from first to last of a chunk are valid and
non overlapping with other chunks. I think it also fits nicer for user usage,
using the Iterate calls that accept a start parameter, where one can exit the
inner loop by comparing to the last cursor of the chunk, as in my example from
the previous email.

> But I would have expected the Split interface to at least take a
> requested number of chunks. Otherwise, the container has no way of
> knowing how many chunks would be optimal (since that depends on the
> number of tasklets available and other implementation details). It
> shouldn't be guessing. And it would seem that a sensible
> implementation of the interface you have would be to simply return a
> single chunk for every query, since that is easy to implement and
> makes as much sense as any other result. I think you'd want to prevent
> that somehow, and the best I can think of is to have a suggested number of
> chunks.

I agree, I have prototyped a new version of Split that looks like;

   function Split
     (Object        : Parallel_Iterator;
      Advised_Split : Natural)
      return Chunk_Array'Class is abstract;

Where Advised_Split is the recommended number of chunks passed in from the
implementation, but can be ignored by the container writer if he/she so chooses.

I also added another function to the Parallel_Iterator interface

   function Iterations
     (Object : Parallel_Iterator)
      return Ada.Containers.Count_Type is abstract;

Because I think the implementation will need to query the iterator to determine
the number of iterations associated with the iterator so that the implementation
can make a better informed decision about determining the number of chunks to
suggest.

I think for user chunking, the user often would prefer to not have to decide the
number of chunks, so I was thinking of adding the following definitions to
System.Multiprocessors.

   type Iteration_Count is range -(2 **63) .. +(2 **63 - 1);
   for Iteration_Count'Size use 64;

   type Chunk_Count is range 0 .. 2 **32 - 1;

   function Advised_Chunk_Count
     (Iterations :  Iteration_Count) return Chunk_Count
     witn Non_Blocking => False;

Which provides a routine that the user can call to query the implementation for
a recommended chunk count to use, given a number of iterations.

I decided to make this routine Non_Blocking => False because the routine might
want to query the processor load to see how many processors are idle.

On Linux, I have found this can be done by reading system files from the /proc
filesystem, but presumably that would be blocking behaviour.


Lastly, I know we last thought another AI would be recommended, but now that
container iterators can be defined this way, it makes it a lot simpler to define
a package that can similarly support chunking for discrete type iteration.

with Ada202x.Iterator_Interfaces;
with Ada202x.Containers;

generic
   type Loop_Index is (<>);
package Ada.Discrete_Chunking is

   pragma Preelaborate;
   pragma Remote_Types;

   function Has_Element (Position : Loop_Index) return Boolean is (True);

   package Chunk_Array_Iterator_Interfaces is new
     Ada202x.Iterator_Interfaces (Loop_Index, Has_Element);

   type Chunk_Array (<>) is new
     Chunk_Array_Iterator_Interfaces.Chunk_Array with private;

   function Iterate
     (From : Loop_Index := Loop_Index'First;
      To   : Loop_Index := Loop_Index'Last)
   return Chunk_Array_Iterator_Interfaces.Parallel_Reversible_Iterator'Class;

private
   ...
end Ada.Discrete_Chunking;

This allows the user to write something like;

   declare
      package Manual_Chunking is new Ada.Discrete_Chunking (Natural);

      Chunks : constant
        Manual_Chunking.Chunk_Array_Iterator_Interfaces.Chunk_Array'Class :=
          Manual_Chunking.Iterate.Split (Advised_Split => 10);
   begin

      parallel
      for Chunk in 1 .. Chunks.Length loop
         for I in Chunks.Start (Chunk) .. Chunks.Finish (Chunk) loop
            ...
         end loop;
      end loop;
   end;

Which is very similar to the Container loop example.


The only thing that would be nice to have with the Chunk_Array type returned by
Split, is if it could be treated as a container so one could replace the loop
above with;

      parallel
      for Chunk of Chunks loop
         for I in Chunks.Start .. Chunks.Finish loop
            ...
         end loop;
      end loop;

But I suspect this is not currently possible to do, at least with something that
returns Interface'Class because I think you'd need to apply the Iterator_Element
aspect to the interface which cannot be done, since the interface doesn't know
the Iterator_Element type. Maybe there is a way to work around this, but I think
its probably good enough to require using "in" iterator loop syntax as above for
these cases instead of "of" syntax. Other than manual chunking, it will be the
implementation using the interface, so the "prettyness" of the syntax is not
needed.

***************************************************************

From: Jeff Cousins
Sent: Monday, January 1, 2018  6:15 PM

Thanks Brad for all that.

Some comments, mostly minor/typos:

!proposal 2nd para

notational -> notional ?

!proposal 5th para

9.10 ([23]{11???})), -> simply 9.10(11) ?

!proposal 6th para

I don’t think that you can say that tasklets are orthogonal to tasks,
particularly as the very next sentence covers how they relate; tasklets are a
sort of tasks--.

Parallel Blocks 3rd para last sentence

Following on from Jean-Pierre’s and Randy’s comments, this would reduce to
something like “The appropriate use of atomic or protected subprogram calls can
be used to avoid erroneous execution.”

Parallel Blocks 4th para

Potentially_Blocking has been superseded by Nonblocking.

Parallel Blocks 5th para

“contgrol” -> “control”

!wording

New paragraph after 5.5 (9/4)

This is out of order as 5.5 (7) follows.

“an global” -> “a global”

Modify to 5.5 (9/4)

“decreasing order {or unless” -> “decreasing order{, or unless”

“Examples after 5.5(20)” -> “Examples after 5.5(21)”

Modify 5.5.1 (6/3)

“decended” -> “descended”

Separate sentences for parallel iterator object and parallel reversible iterator
object would be more consistent wording.

Modify 5.5.2 (4/3)

“iterator_specificaiton” -> “iterator_specification”

Modify 5.5.2 (7/5)

“the nominal subtype of the loop parameter{s are} [is]” -> “the nominal subtype
of the loop parameter{s} is” – parameters may be plural but their subtype is
singular.

Modify 5.5.2 (10/3)

“the denoted iterator object{s} becomes” -> “the denoted iterator object{s}
become[s]”

“5.6.1 Parallel Block Statements -> Insert new paragraph 5.6.1 Parallel Block Statements

? Why the opening quotes, and isn’t all of this part an insertion?  Why the []
around the first paragraph?

Add new paragraph after 11.5 (20)

“elememnt” -> “element”

Modify A.18.2 (230.2/3)

Modify A.18.3 (144.2/3)

Modify A.18.6 (94.2/3)

Modify A.18.9 (113.2/3)

“notes” -> “nodes”

Modify A.18.2 (230.4/3)

Modify A.18.3 (144.4/3)

Modify A.18.6 (94.4/3)

Modify A.18.9 (113.4/3)

Modify A.18.10 (159/3)

“simulataneously” -> “simultaneously”

Modify A.18.10 (159/3)

“or or” -> “or”

**************************************************************

From: Brad Moore
Sent: Monday, January 1, 2018  12:33 AM

Thanks Jeff for your comments.

Since there are now quite a lot of changes being discussed compared to the
version of the AI that was sent out,

I have attached a new version of the AI. [This is version /06 of the AI - ED.]

This version includes the following changes;

1) Addressing Jeff's Comments.
2) Potentially Blocking operations are now disallowed in parallel loop and
   parallel block statements
3) Added Nonblocking aspect to the new routines where appropriate
4) Modified the Parallel Iterator interface to have Tuck's Split function idea,
   instead of the Jump function.
5) Added an Advised_Split parameter to the Split function so that the
   implementation can recommend the number of splits to be applied to the
   iterator, allowing container implementors to override the recommendation if
   desired.
6) Added function in System.Multiprocessors to query the implementation for a
   recommended split count, that can be called by User code if desired.
7) Added a package Ada.Discrete_Chunking, which basically contains a single
   Split function that returns a Chunk_Array that can be used for parallel
   iteration involving manual chunking of discrete types, similar to what is
   also possible using containers.

Happy New Year everyone!

**************************************************************

From: Tucker Taft
Sent: Tuesday, January 2, 2018  9:35 AM

> ...
>
> He also had the Cursor array being passed into Split as an out
> parameter. I think returning the Chunk_Array as a function result will be more
> convenient to users. ...

Convenience to users seems somewhat irrelevant in a routine that is intended to
be called by compiler-generated code.  Furthermore, the OUT parameter avoided
the need for doing dynamic allocation and deallocation on a secondary stack,
something which is painful in some environments, and not something you want to
require every time you do something with a container.   Finally, the size of the
passed-in array provided the specification of the number of chunks desired.

***************************************************************

From: Brad Moore
Sent: Tuesday, January 2, 2018  10:36 AM

> Convenience to users seems somewhat irrelevant in a routine that is
> intended to be called by compiler-generated code.

I agree that your original design made perfect sense for the original intent
outlined in the paper, which was as something called only by compiler-generated
code.

I say the newer design has improvements in the context of now also allowing
users to call the same routines, since there seems to be a number of cases where
that would be needed. Since the user can now be involved, more thought is needed
to making the interface more user-friendly.

> Furthermore, the OUT parameter avoided
> the need for doing dynamic allocation and deallocation on a secondary
> stack, something which is painful in some environments, and not
> something you want to require every time you do something with a container.

Keep in mind that this occurs only for parallel usage, and for parallel usage
for the parallelism to be worthwhile in the first place, I would think the
amount of processing needed to do the allocation and deallocation of the chunk
array would be insignificant compared to the amount of time needed to do the
processing on the elements in the container. How much pain is there to do such
allocations for such environments? I would think regardless how painful, it is
still something that generally needs to be supported in such environments for
other use cases.

>  Finally, the size of
> the passed-in array provided the specification of the number of chunks desired.

Yes, except I thought the idea was also that the container implementor might
want to override the number of desired chunks. As an out parameter, the
container writer is more tied to the number passed in by the implementor. For a
binary tree container for instance, maybe some power of two representing the top
nodes in the tree, makes for a better division of work, and that might be less
or more than the number of elements in the out parameter array. So long as the
number of chunks is less than or equal to the number of elements in the
container, it should work for parallel usage.

***************************************************************

From: Tucker Taft
Sent: Tuesday, January 2, 2018  12:22 PM

> I agree that your original design made perfect sense for the original
> intent outlined in the paper, which was as something called only by
> compiler-generated code.
>
> I say the newer design has improvements in the context of now also
> allowing users to call the same routines, since there seems to be a number
> of cases where that would be needed. Since the user can now be involved, more
> thought is needed to making the interface more user-friendly.

If that is really true, then I think we should consider making improvements on
the syntactic sugar, rather than spending energy trying to make the
compiler-oriented operations friendlier.

>  Furthermore, the OUT parameter avoided
>> the need for doing dynamic allocation and deallocation on a secondary
>> stack, something which is painful in some environments, and not
>> something you want to require every time you do something with a container.
>
> ...
> How much pain is there to do such allocations for such environments? I
> would think regardless how painful, it is still something that generally
> needs to be supported in such environments for other use cases.

I just don't think we want to insert this kind of unnecessary dynamic allocation
inside of compiler-generated code.  We could provide a "user-friendly" version
as well, though functions that return unconstrained arrays are not terribly
friendly, in that you often have to call them as part of a declaration, which
may require the insertion of yet another declare block in the middle of things.

>   Finally, the size of
>> the passed-in array provided the specification of the number of chunks desired.
>
> Yes, except I thought the idea was also that the container implementor
> might want to override the number of desired chunks. As an out
> parameter, the container writer is more tied to the number passed in
> by the implementor. For a binary tree container for instance, maybe
> some power of two representing the top nodes in the tree, makes for a better
> division of work, and that might be less or more than the number of elements
> in the out parameter array.

I agree that some chunks at the end might end up empty, but producing more
chunks than what the caller has specified as a "max" number of chunks seems
undesirable.

***************************************************************

From: Randy Brukardt
Sent: Wednesday, January 10, 2018  10:43 PM

...
> Since there are now quite a lot of changes being discussed compared to
> the version of the AI that was sent out,
>
> I have attached a new version of the AI.

Here's a list of editorial changes that I made to this version (there were
enough so I posted this as a separate version, version /07):

RM section/paragraph numbers don't include a space, and should not be prefixed
by anything (in particular, by "RM"). AARM section/paragraph numbers should
always be prefixed by "AARM". (This is part of the "Kiyoshi consistency rules",
and as such is not negotiable).

Only one space after a period ending a sentence. Text that Tuck wrote is almost
always wrong about this, but don't copy that style!

The rules for concurrent access includes a header and three bullets, and thus
would best be described as "9.10(11-14)".

The AI number of reduction expressions will be AI12-0242-1, so I plugged that
in.

The "rules for shared variables" are at a minimum 9.10(11-14); there is a
specific change to 9.10(13) being recommended, so it seems best to mention both.
"(see 9.10(11-14), and specifically 9.10(13))".

The headers for wording are inconsistent or missing in various places, and the
wording needs to be in paragraph order. Use either "Replace xx(xx)", "Modify
xx(xx)", "Add after xx(xx)", or "Add new subclause xx". Normally, I'd end all of
these with a colon (:) but there is some flexibility here as they're not
mentioned in the "Kiyoshi rules". But we don't want them to vary for different
paragraphs. One thing the "Kiyoshi" rules" does mention is that one never talks
about "paragraphs", one just says "add after".

Some specific issues:

The grammar change at 5.5(3/3) has to be a "Replace" since there are no
modification marks in the following text (and using those in grammar changes is
always questionable, since [] and {} have a syntax meaning).

There is no location at all specified for the Legality Rules. I added a header
to insert them after 5.5(6/5). (That seems like the best place, since we want
the basic definition of a "loop parameter" to occur before we start talking
about the rules for such things.)

The Legality Rules refer to specific paragraph numbers; that's not allowed in
wording.

In those rules: ... *parallel* reserve{d} word.

A change at 5.5(7) has to go before a change at 5.5(9/4). (Jeff noted this.)

"Modify [to] 5.5(9/4)" This section also has quote marks around the text for no
known reason.

The "with Ada.Containers" at 5.5.1(1/3) is not needed with the new
specification, since it doesn't reference anything from that package. I deleted
it.

I added a separate header for the added paragraph after 5.5.2(10/3), and deleted
the insertion marks (they're way too easy to miss). And there were no insertion
marks for the new AARM note.

Changed the header of 5.5.2(13/3) to: "Modify 5.5.2(13/3) [into three
paragraphs]" because that's not obvious to a causal reader (me).

"Section" is what we once might have called a chapter. These added things are
subclauses.

Some of the wording text has rather long lines. 80 character lines is generally
the maximum. The "Iterate" paragraphs are especially bad.

P.S. Brad, if you are going to make changes to this AI, please make sure you
have version /07 first. If it isn't posted yet, please ask me for a copy.

***************************************************************

From: Randy Brukardt
Sent: Wednesday, January 10, 2018  10:42 PM

...
> I have attached a new version of the AI.

Here are my substantive comments on this AI. (See the other message for the
editorial comments, all of which I've already applied.)

> Legality Rules, added after 5.5(6/5):

> A loop_statement with the *parallel* reserve word shall not update
> variables global to the loop unless either the action is sequential
> (see 9.10), or the expression for the update mentions the loop parameter.

An action being "sequential" is a dynamic concept (actually, it is defined for
"erroneous execution", but that is obviously a dynamic concept). We can't base
Legality Rules on dynamic concepts, because they cut through privacy and the
like (thus violating all of the good Ada principles).

I thought the idea was to use Global to define this properly for a Legality
Rule, essentially not allowing any call or other use of a global variable unless
it is "synchronized" (using the meaning of that from Global). Exactly how to
word this isn't obvious to me; perhaps there is some wording defining Global
that could be reused. (The wording would also need to allow the exception you
noted your wording above.)

> In that same wording:

> The sequence_of_statements of a loop_statement with the *parallel*
> reserve word shall not consist of a statement that can invoke a
> potentially blocking operation (see 9.5).

Wrong. "Potentially blocking" is a dynamic concept (there shortly will be an AI
on this mistake). You want to use the Legality Rules defined for Nonblocking
(specifically, 9.5(58-65/5). Unfortunately, there is no term specifically for
this (maybe there should have been). And Standard wording cannot use paragraph
numbers. So perhaps say something like:

The sequence_of_statements of a loop_statement with the *parallel* reserved word
are considered to be contained in a nonblocking program unit for the purposes of
Legality Rules (see 9.5).

AARM Ramification: This meaning that calling subprograms that allow blocking, or
directly blocking statements like an entry call, accept statement, or delay
statement, are not allowed in the body of a parallel loop. This is true even if
the surrounding unit allows blocking.

> Add after 5.5(7):

> When the reserved word parallel is present and a transfer of control
> out of the loop occurs, an attempt is made to cancel further parallel
> execution of the sequence_of_statements that has not yet started.

I thought the intent was that we'd try to cancel further parallel execution of
the sequence_of_statements that HAS started, and to prevent any further
iterations from starting. In any case, we need to be more specific about what
"cancel" means. (Is it "abort", with all of the baggage of that? I'd hope not,
but then precisely what is required to be canceled, and what *can* be
canceled?)

> Add after 5.5.2(10/3):

> For a parallel generalized iterator, the operation Iterations of the
> iterator type is called first to determine the number of iterations
> associated with the iterator.

I note in passing that this doesn't make any sense if we really add "when"
filters, as the only way to know that actual number of iterations is to
partially execute them (assuming the filter is an arbitrary expression). One
probably would have to do chunking based on the number of iterations expected
ignoring any filter, but that of course makes the filter essentially a lie (the
iteration will have to happen regardless of the filter value; only the executed
statements would change. Moreover, the chunking would often be suboptimal in
such a case).

> Modify 5.5.2(13/3) [into three paragraphs]

The new third paragraph is:

> If the loop parameter is a constant (see above), then the indexing
> uses the default constant indexing function for the type of the
> iterable container object for the loop; otherwise it uses the default variable
> indexing function.

The intent is that this applies to both forward/reverse container element
iterators and parallel container element iterators (and no others), but as a
separate paragraph following the parallel container element iterator this is not
obvious. Probably it needs a lead-in like:

For any kind of container element iterator, if the loop parameter is ...

>New subclause 5.5.3 has:

> generic
>    type Loop_Index is (<>);
> package Ada.Discrete_Chunking is
>
>    pragma Preelaborate;
>    pragma Remote_Types;

Use aspects, not pragmas, in new packages. (Otherwise, I'd have to complain that
the unit name was not repeated as it is in all of the other language-defined
packages.) So:

generic
   type Loop_Index is (<>);
package Ada.Discrete_Chunking
   with Preelaborate, Remote_Types is

> New subclause 5.6.1.

The Legality Rules here need reworking just like the ones in the loop cases.

For some reason, the execution of a parallel block is described in a "Static
Semantics" section. Put a "Dynamic Semantics" header in there somewhere!

> Add after 11.5(20)
>
> Loop_Parameter_Check
>  For a parallel loop, check that updates to a component of an array or
>  to an  element of a container via the loop parameter are sequential. See
>  9.10(11).

This implies that there is a dynamic check somewhere. But there is no such check
defined in 5.5 or anywhere else. (This is the pragma Suppress definition, but
the actual check has to be defined, too, in particular as to what exception is
raised. Look up any check in the RM index to see the appropriate form of the
check wording.)

> Add after D.16(5/3)
>
>   type Iteration_Count is range 0 .. implementation-defined;
>   for Iteration_Count'Size use implementation-defined;
>
>   type Split_Count is range 0 .. implementation-defined;
>
>   function Advised_Split_Count
>     (Iterations :  Iteration_Count) return Split_Count
>       witn Non_Blocking => False;

I don't see the point of the Iterations parameter here. An appropriate split
would take an estimate as to how expensive the loop body (or block body, for
that matter) is as well as the number of iterations. As such, it doesn't seem
sensible to have such a function here (in package Multiprocessing). What seems
appropriate here is a recommended number of executors; I'd suggest something
like:

   function Advised_Executors return CPU_Range
       witn Non_Blocking => False;

I used CPU_Range because it doesn't make much sense to have more executors than
execution resources (processors). (Getting rid of the types simplifies this a
lot.)

Additionally, it might make sense to specify the task and/or CPU that is
involved (as I can imagine the answer being different for different tasks). One
hopes that you could call this for some other task as part of setup (and not
just be restricted to querying this for yourself). However, since
System.Multiprocessors seems to be designed to sit below the level of tasks (it
has no task-based queries). So perhaps the more specific version belongs in
Dispatching_Domains (along with Get_CPU).


   function Advised_Executors (T   : Ada.Task_Identification.Task_Id :=
                 Ada.Task_Identification.Current_Task) return CPU_Range
       witn Non_Blocking => False;


That's enough for tonight. :-)

***************************************************************

From: Tucker Taft
Sent: Saturday, January 6, 2018  11:16 AM

Here is some unsolicited "rah, rah" from the sidelines.  I think this shows
that the parallel and reduction features are worth all the effort, and
back-and-forths, we are investing in them.

Begin forwarded message:

From: "Mcbeth, FletcherX"
Subject: Would love to see tasklets, parallel loops and parallel arrays/reductions thereof within Ada 2020
Date: January 5, 2018 at 8:41:06 PM EST

Hello Tucker,

My name is Fletcher McBeth.  I am working with the HPC group here at Intel
Hudson and would very much enjoy receiving any and all updates you may be
aware of towards getting your proposed constructs describing tasklets,
parallel loops and parallel arrays/reductions thereof into the next version
of Ada (hopefully 2020).

Thank you in advance.

***************************************************************

From: Brad Moore
Sent: Wednesday, March 28, 2018  12:03 AM

Here is an update to AI12-0119. [This is version /08 - ED].

It is now much smaller, with manual chunking, container iterators, and
data-race/deadlocking specific items moved into new separate AI's. Otherwise,
the text hasn't changed much except addressed a few typos that were mentioned
at the last teleconference call.

***************************************************************

From: Tucker Taft
Sent: Wednesday, March 28, 2018  9:10 PM

Thanks, Brad.  As I mentioned in my comments on one of the other AIs, probably
should remove the no-data-race, no blocking from this AI, since it is covered
by a separate one.

Separating out these pieces has really helped, and I think the wording is
starting to converge.  However, I was struck by this bit of wording:

  "A parallel block statement requests that two or more sequences of
   statements should execute in parallel with each other."

I find it a bit odd to describe a construct as "requesting" something.

We generally just try to describe what a construct does, as opposes to what it
"requests."

I would suggest you take a look at the introduction to chapter 9.  I suspect
we will need to introduce the notion of "tasklet" somewhere.  I am thinking
that perhaps all of these things should be described somewhere in chapter 9,
even if some of the syntax appears earlier.  If we don't coordinate the
definition of parallel block and tasklet with the introduction to tasks in
chapter 9, the result is probably going to be very confusing, since chapter 9
is where we talk about threads of control.  So I think that argues for having
a separate section somewhere in chapter 9 on tasklets, and in that section
introduce parallel blocks and loops.  And the introduction to chapter 9 will
need to set the stage for the discussion on tasklets, even if we don't go into
the details until a bit later in a section devoted to tasklets.

***************************************************************

From: Randy Brukardt
Sent: Thursday, March 29, 2018  6:15 PM

I'm not sure I agree. He does define the tasklet semantics in chapter 9 (in
9.12, specifically). It probably wouldn't hurt to mention that in the
introduction to chapter 9, too. [Oops, "Clause 9" if we're following current
ISO-speak. :-)]

OTOH, I don't think we want to hide the actual parallel blocks and loops in
Clause 9. Everyone "knows" that writing programs with tasks is hard, and
putting the parallel stuff there says that the same is true of them. Putting
them with the basic statements signals that these are "easy" (there's nothing
easier in Ada than Chapter 5!). Yes, that's marketing more than any major
semantic consideration -- but we all know that marketing and appearances
matter. That does require deferring some of the rules until the tasklet
subclause, and/or forward references. But it would seem very odd to have
basic blocking and looping constructs away from the rest of those constructs.
(And we hope/expect that parallel programming will get to be very basic during
the lifetime of Ada 2020.)

In some sense, this is very similar to your argument that eventually compilers
will handle a lot of parallelism under the covers. A parallel block or loop
itself is just an assertion that the block or loop can be executed in parallel
(without regard to whether that "works" in an as-if optimization, but possibly
with checking that it is OK). Otherwise, they are just another control
structure.

So I think the current organization is approximately right (details to be
determined). This obviously is a "squishy" argument and this organization is
not a big deal -- but we need to consider how this will look in 30 years. We
don't want to hide basic control flow in an otherwise little-used clause
(explicit tasks would be rare in a world where a lot of parallelism is applied
automatically or by simple declarations).

***************************************************************

From: Jeff Cousins
Sent: Tuesday, April 3, 2018  6:19 AM

We agreed to send these offline.

!problem 2nd para
“independent processing where” -> “independent processing, whereas” would
seem better.

Typos – “perforamance” -> “performance”, “mainain” -> “maintain”.

Add after 5.5(7), 5.6.1 Static Semantics 2nd para

It seems odd too talk of cancelling something that hasn’t yet started, maybe
“inhibit” instead?

5.6.1 Static Semantics

It seems odd to describe transfer of control under Static Semantics rather
than Dynamic Semantics – after all, the equivalent wording for parallel loops
is under Dynamic Semantics.

Modify 9.10(13)

Typo – “part of a;” -> “part of a:”.

***************************************************************

From: Tucker Taft
Sent: Monday, April 2, 2018  3:06 PM

Here is an attempt to revise the introduction to "Tasks and Synchronization" to
incorporate mention of "parallel constructs" without being more specific than
talking about "multiple logical threads of control."

Comments welcome.  The vocabulary of this introduction will probably form the
basis for the descriptions of the semantics of parallel constructs, so it seems
important to agree on it first.

I purposely did not show the actual edits, so you can just read it to see if it
makes sense.  The number of edits is actually relatively small.

----------

Modify section 9, Tasks and Synchronization:

The execution of an Ada program consists of the execution of one or more tasks.
Each task represents a separable activity that proceeds independently and
concurrently between the points where it interacts with other tasks.  A single
task, when in the context of a parallel construct, can represent multiple
logical threads of control which can proceed in parallel.

The various forms of task interaction are described in this clause, and include:

1.a
To be honest: The execution of an Ada program consists of the execution of one
or more partitions (see 10.2), each of which in turn consists of the execution
of an environment task and zero or more subtasks.

2
the activation and termination of a task;

3
a call on a protected subprogram of a protected object, providing exclusive
read-write access, or concurrent read-only access to shared data;

4
a call on an entry, either of another task, allowing for synchronous
communication with that task, or of a protected object, allowing for
asynchronous communication with one or more other tasks using that same
protected object;

5
a timed operation, including a simple delay statement, a timed entry call or
accept, or a timed asynchronous select statement (see next item);

6
an asynchronous transfer of control as part of an asynchronous select statement,
where a task stops what it is doing and begins execution at a different point in
response to the completion of an entry call or the expiration of a delay;

7
an abort statement, allowing one task to cause the termination of another task.

8
In addition, tasks can communicate indirectly by reading and updating
(unprotected) shared variables, presuming the access is properly synchronized
through some other kind of task interaction.

Static Semantics
9
The properties of a task are defined by a corresponding task declaration and
task_body, which together define a program unit called a task unit.

Dynamic Semantics
10
Over time, tasks proceed through various states. A task is initially inactive;
upon activation, and prior to its termination it is either blocked (as part of
some task interaction) or ready to run. While ready, a task competes for the
available execution resources that it requires to run.  In the context of a
parallel construct, a single task can utilize multiple processing resources
simultaneously.

10.a/3
Discussion: {AI05-0229-1} The means for selecting which of the ready tasks to
run, given the currently available execution resources, is determined by the
task dispatching policy in effect, which is generally implementation defined,
but may be controlled by aspects, pragmas, and operations defined in the
Real-Time Annex (see D.2 and D.5).

NOTES
11
1  Concurrent execution may be implemented on multicomputers, multiprocessors,
   or with interleaved execution on a single physical processor. On the other
   hand, whenever an implementation can determine that the required semantic
   effects can be achieved when parts of the execution of a single logical
   thread of control are performed by multiple physical processors acting in
   parallel, it may choose to perform them in this way.


***************************************************************

From: Randy Brukardt
Sent: Monday, April 2, 2018  11:01 PM

> I purposely did not show the actual edits, so you can just read it to
> see if it makes sense.  The number of edits is actually relatively
> small.

As far as I can tell, you only added one sentence to the end of the first
paragraph. I couldn't find anything else changed (but I didn't try a
word-for-word comparison). Based on your earlier discussion, I was expecting
more than that.

> Modify section 9, Tasks and Synchronization:
>
> The execution of an Ada program consists of the execution of one or
> more tasks. Each task represents a separable activity that proceeds
> independently and concurrently between the points where it interacts
> with other tasks.  A single task, when in the context of a parallel
> construct, can represent multiple logical threads of control which can
> proceed in parallel.

This one new sentence is fine. But it seems strange that "threads of control"
are never mentioned again. I realize that so long as we don't allow blocking in
parallel constructs, there isn't much overlap -- but that makes me wonder why
you were so concerned about rewriting this. Ergo, I have to think I'm missing
something (or you are).

***************************************************************

From: Tucker Taft
Sent: Tuesday, April 3, 2018  8:15 AM

The changes were subtle, and it is interesting that you didn't notice them.  The
first change was to describe a task as a "separable activity" rather than as a
"logical thread of control."  Now that it can represent multiple threads of
control, we needed a different way to describe a task -- hence, "separable
activity."  Furthermore, the term "task" is omitted completely from the note
about using multiple physical processors, and the term "logical thread of
control" appears there as the thing that can be implemented using multiple
physical processors.

So the key shift here is that parallel constructs result in additional logical
threads of control, and that a single task can represent multiple logical
threads of control.

...
> As far as I can tell, you only added one sentence to the end of the
> first paragraph. I couldn't find anything else changed (but I didn't
> try a word-for-word comparison). Based on your earlier discussion, I
> was expecting more than that.

Here is one additional sentence, at the end of paragraph (10):

"...  In the context of a parallel construct, a single task can utilize multiple
processing resources simultaneously. "


...
> This one new sentence is fine. But it seems strange that "threads of
> control" are never mentioned again.

Threads of control are mentioned in the Note, and the term will presumably
appear elsewhere when we talk about the dynamic semantics of parallel
constructs.

> I realize that so long as we don't allow blocking in parallel
> constructs, there isn't much overlap -- but that makes me wonder why
> you were so concerned about rewriting this. Ergo, I have to think I'm
> missing something (or you are).

There is a shift in vocabulary, which is subtle but important in my view.  A
task is no longer considered a single logical thread of control.  Instead,
parallel constructs can result in multiple logical threads of control in a
single task, while a task is considered, as mentioned above, a "separable
activity."

***************************************************************

From: Tullio Vardanega
Sent: Tuesday, April 3, 2018  11:42 AM

I can see the angle in the new opening paragraph, and the taxonomy that it
implies. A task is a program entity that represents an activity whose execution
can progress independently of (hence, separable from) its outside environment.
That progress is realized by logical threads of control that can be assigned to
physical processors. In the context of a parallel construct, a single separable
activity may be assigned as many independent logical threads of control as the
available physical processors.

***************************************************************

From: John Barnes
Sent: Tuesday, April 3, 2018  11:58 AM

Hmm.

I haven't been following this in detail but I can see that (deo volente) I might
have to rewrite my wretched book some day.

I am not sure about

In the context of a parallel construct, a single task can utilize multiple
processing resources simultaneously.

I could read this to mean that the task is inside some parallel construct rather
than the other way around.

***************************************************************

From: Tucker Taft
Sent: Tuesday, April 3, 2018  12:10 PM

How about:

While the body of a task is executing a parallel construct, the task can utilize
multiple processing resources simultaneously.

***************************************************************

From: John Barnes
Sent: Tuesday, April 3, 2018  4:07 PM

That would be crystal clear.

***************************************************************

From: Randy Brukardt
Sent: Tuesday, April 3, 2018  6:13 PM

> The changes were subtle, and it is interesting that you didn't notice
> them.  The first change was to describe a task as a "separable
> activity" rather than as a "logical thread of control."  Now that it
> can represent multiple threads of control, we needed a different way
> to describe a task -- hence, "separable activity."  Furthermore, the
> term "task" is omitted completely from the note about using multiple
> physical processors, and the term "logical thread of control"
> appears there as the thing that can be implemented using multiple
> physical processors.

I see. So the big deal is to redefine tasks (I didn't pay much attention to them
as they aren't changed, at least in how we think of them). I wouldn't have
expected that is necessary, but it's clear if you want to do that, you need to
do that immediately and not as some sort of bolt on.

In any case, what you had read well and made sense. Indeed, I think it makes
more sense even for tasks themselves than the original wording (which probably
was overspecified) did.

***************************************************************

From: Tucker Taft
Sent: Tuesday, April 3, 2018  9:12 PM

> That would be crystal clear.

Probably should adjust the last sentence in paragraph (1) similarly:

"... A single task, when in the context of a parallel construct, can represent
multiple logical threads of control which can proceed in parallel."

Might want to be instead:

"... A single task, when executing a parallel construct, can represent multiple
logical threads of control that can proceed in parallel."

***************************************************************

From: Tucker Taft
Sent: Wednesday, April 4, 2018  2:11 PM

Indeed - something like that.

***************************************************************

From: Tucker Taft
Sent: Sunday, June 10, 2018  8:49 PM

Here is a significant revision of the Parallel Operations AI. [This is version
/09 of the AI - Editor.] I dropped the use of terms like "tasklet" and
"executor" and used "logical thread of control" more widely.  I tried to
simplify the rules as far as I could, to avoid over-specification, and
because simple is usually good.  Comments welcome!

Note that the !proposal is essentially unchanged, and only a bit was added to
the !discussion.  So I recommend folks start with the revised !wording rather
than wading through the other material which really hasn't changed much.

***************************************************************

From: Randy Brukardt
Sent: Monday, June 11, 2018  8:41 PM

> Here is a significant revision of the Parallel Operations AI.

As usual, I fixed a number of typos in it while posting: a number of colons were
missing (you added many that were missing, but missed a few), a couple of
spelling errors (like "modfiy"), and I got rid of the extra spaces after periods
that some of us were taught to include on a typewriter, but never have been used
in AIs.

> I dropped the use of terms like "tasklet" and "executor" and used
> "logical thread of control" more widely.

The wording wasn't long enough already? ;-)

I guess I would have expected to use some short term for this, as it is going to
need to be frequently referenced in AIs, discussions, and (especially)
educational material. I'm already exhausted of writing "logical thread of
control", and there isn't even a pronouncable acronym for it (LToC isn't that).
There's 18 occurrences in this AI, and I presume there will be more in the
related AIs.

I realize that it is clearer in some sense, 'cause there's no chance of
confusion, as there might be just calling it a "thread". Maybe just shorten it
to "logical thread" except in 9(1)?

In any case, if we're not going to use "tasklet", the !proposal needs to be
rewritten to purge the term. It ought not be used outside of the "reject
options" discussion (which you already have).

> I tried to simplify the rules as far as I could, to avoid
> over-specification, and because simple is usually good.
> Comments welcome!

Overall the wording looks good. I didn't see anything that didn't make sense,
but I didn't try very hard to look for omissions.

> Note that the !proposal is essentially unchanged, and only a bit was
> added to the !discussion.  So I recommend folks start with the revised
> !wording rather than wading through the other material which really
> hasn't changed much.

I usually use a file compare to read updated AIs, not always the best idea but
it works great when a lot of the discussion material is little changed.

> We don't define such rules in this AI.

Shouldn't we at least mention that such rules are proposed in another AI?
(AI12-0267-1 in particular).

It might also make sense to mention the reduction expression AIs (AI12-0242-1,
AI12-0262-1) in the discussion on that topic (rather than just leading the
reader hanging).

***************************************************************

From: Tucker Taft
Sent: Tuesday, June 12, 2018  8:34 AM

> Here is a significant revision of the Parallel Operations AI.
>
> As usual, I fixed a number of typos in it while posting: a number of
> colons were missing (you added many that were missing, but missed a
> few), a couple of spelling errors (like "modfiy"), and I got rid of
> the extra spaces after periods that some of us were taught to include
> on a typewriter, but never have been used in AIs.

Good luck taking that second space out of my fingers' typing reflexes...

>> I dropped the use of terms like "tasklet" and "executor" and used
>> "logical thread of control" more widely.
>
> The wording wasn't long enough already? ;-)
>
> I guess I would have expected to use some short term for this, as it
> is going to need to be frequently referenced in AIs, discussions, and
> (especially) educational material. I'm already exhausted of writing
> "logical thread of control", and there isn't even a pronouncable
> acronym for it (LToC isn't that). There's 18 occurrences in this AI,
> and I presume there will be more in the related AIs.

For now, I think it is worth being explicit.  Once we all agree on the term,
we can consider an "official" short hand.

> I realize that it is clearer in some sense, 'cause there's no chance
> of confusion, as there might be just calling it a "thread". Maybe just
> shorten it to "logical thread" except in 9(1)?

That seems reasonable, but again, I'd rather wait on that until we get a
chance to review this AI together.

> In any case, if we're not going to use "tasklet", the !proposal needs
> to be rewritten to purge the term. It ought not be used outside of the
> "reject options" discussion (which you already have).

That makes sense.  Just not a priority at the moment until we all agree on
the new terminology.

>> I tried to simplify the rules as far as I could, to avoid
>> over-specification, and because simple is usually good.
>> Comments welcome!
>
> Overall the wording looks good. I didn't see anything that didn't make
> sense, but I didn't try very hard to look for omissions.

Very glad to hear that.  Thanks for the review.  I hope others will have time
to review it as well before the meeting...

>> Note that the !proposal is essentially unchanged, and only a bit was
>> added to the !discussion.  So I recommend folks start with the
>> revised !wording rather than wading through the other material which
>> really hasn't changed much.
>
> I usually use a file compare to read updated AIs, not always the best
> idea but it works great when a lot of the discussion material is
> little changed.

Unfortunately, I often end up re-formatting paragraphs a bit to make
line-lengths reasonable and most file compares aren't good at handling
that. The side-by-side comparison used by "gerrit" seems pretty good at
that -- I don't know if there are other such tools available for more
stand-alone use.

>> We don't define such rules in this AI.
>
> Shouldn't we at least mention that such rules are proposed in another AI?
> (AI12-0267-1 in particular).

I suppose, but I am still working on that one, and predicting what it will
look like is a bit difficult until I at least have a first draft!

> It might also make sense to mention the reduction expression AIs
> (AI12-0242-1, AI12-0262-1) in the discussion on that topic (rather
> than just leading the reader hanging).

I suppose.  Again, it just wasn't a priority yet.

***************************************************************

From: Randy Brukardt
Sent: Tuesday, June 12, 2018  6:24 PM

...
> > As usual, I fixed a number of typos in it while posting: a number of
> > colons were missing (you added many that were missing, but missed a
> > few), a couple of spelling errors (like "modfiy"), and I got rid of
> > the extra spaces after periods that some of us were taught to
> > include on a typewriter, but never have been used in AIs.
>
> Good luck taking that second space out of my fingers' typing
> reflexes...

I've pretty much given up on it, but since I was mentioning changes "for
the record", I figured I better mention that.

...
> > In any case, if we're not going to use "tasklet", the !proposal
> > needs to be rewritten to purge the term. It ought not be used
> > outside of the "reject options" discussion (which you already have).
>
> That makes sense.  Just not a priority at the moment until we all
> agree on the new terminology.

Understood. However, I'd hope that we can wrap up this AI at our upcoming
meeting (since many other things depend on it, it's hard to make progress
without it being done -- and it appears mostly done to me), which would put
that update into the hands of our overworked and overbudget editor. Which I'd
like to avoid. :-)

...
> >> Note that the !proposal is essentially unchanged, and only a bit
> >> was added to the !discussion.  So I recommend folks start with the
> >> revised !wording rather than wading through the other material
> >> which really hasn't changed much.
> >
> > I usually use a file compare to read updated AIs, not always the
> > best idea but it works great when a lot of the discussion material
> > is little changed.
>
> Unfortunately, I often end up re-formatting paragraphs a bit to make
> line-lengths reasonable and most file compares aren't
> good at handling that.   The side-by-side comparison used by
> "gerrit" seems pretty good at that -- I don't know if there are other
> such tools available for more stand-alone use.

I tend to reformat both sides to match when I do that -- which is why the
result often ends up with different line breaks than you sent. (If I have
to edit something, I tend to start with the reformatted version and not
your original text.)

> >> We don't define such rules in this AI.
> >
> > Shouldn't we at least mention that such rules are proposed in another AI?
> > (AI12-0267-1 in particular).
>
> I suppose, but I am still working on that one, and predicting what it
> will look like is a bit difficult until I at least have a first draft!

Sure, but it's the cross reference that's valuable, not the details.

> > It might also make sense to mention the reduction expression AIs
> > (AI12-0242-1, AI12-0262-1) in the discussion on that topic (rather
> > than just leading the reader hanging).
>
> I suppose.  Again, it just wasn't a priority yet.

Same here.

***************************************************************

From: Randy Brukardt
Sent: Tuesday, June 12, 2018  7:40 PM

I just noticed that you have bounded error text like:

  During the execution of one of the iterations of a parallel loop, it
  is a bounded error to invoke an operation that is potentially blocking
  (see 9.5).

But the consequences of the Bounded Error are missing here. We always have
to say what happens with a bounded error. We need a sentence like:

If the bounded error is detected, Program_Error is raised. If not detected,
blocking of the logical thread of control may occur as defined elsewhere in
this standard, possibly leading to deadlock.

The second bounded error needs some similar text.

***************************************************************

From: Tucker Taft
Sent: Saturday, June 23, 2018  6:31 PM

Here is an update. [This is version /10 of the AI - Editor.]

***************************************************************

Questions? Ask the ACAA Technical Agent