CVS difference for ai12s/ai12-0119-1.txt

Differences between 1.6 and version 1.7
Log of other versions for file ai12s/ai12-0119-1.txt

--- ai12s/ai12-0119-1.txt	2017/08/08 03:14:54	1.6
+++ ai12s/ai12-0119-1.txt	2017/08/31 04:10:03	1.7
@@ -4179,3 +4179,862 @@
 useful at all.
 
 ****************************************************************
+
+From: Tucker Taft
+Sent: Monday, August 14, 2017  8:46 AM
+
+> ...
+> With this sort of structure, I don't really see any issue with having
+> exit work deterministically. The only requirement are to ensure that
+> any iterations on the left of the iteration of the exit (that is, with lower
+> indexes) run to completion. There certainly isn't any requirement to
+> start new tasklets after an exit; you just have to assign chunks to
+> tasklets in a left-to-right order (and it would be a lot of extra work
+> to do anything else). Thus, if an exit occurs, any chunk not ever
+> assigned to a tasklet can be immediately discarded (it necessarily
+> comes after the exit). Only tasklets logically before the exit would
+> need to continue. ...
+
+I am not convinced we should imply there is any ordering of iterations in a
+parallel loop. This just adds complexity to the implementation with little
+benefit in my mind.  If you have a “many-core” machine, then it is quite
+possible that all of the iterations (or at least all of the chunks) get going
+simultaneously, so what value is there to promising ordering in some particular
+cases?  As I suggested in an earlier e-mail, if you really want the
+lowest-indexed element that satisfies your predicate, then you wouldn’t want to
+use “normal” chunking, but rather you would want to bias things to have multiple
+cores looking at the early part of the range first.  In any case, are there
+really that many cases where you want the lowest-indexed item?  More likely in
+my view is that you would either want any item, or you would want all items that
+satisfy the predicate.  If you want all items, then of course “exit” would not
+be used, and you would build up a list of all desired items.  If y want any
+item, then the first you find should “win”.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Monday, August 14, 2017  9:23 PM
+
+At the risk of continuing to beat this same horse...
+
+> ... If you
+> have a "many-core" machine, then it is quite possible that all of the
+> iterations (or at least all of the chunks) get going simultaneously,
+> so what value is there to promising ordering in some particular cases?
+
+The benefit is obvious: loop algorithms work as they are written. No mental
+gymnastics needed.
+
+> ...  As I suggested in an
+> earlier e-mail, if you really want the lowest-indexed element that
+> satisfies your predicate, then you wouldn't want to use "normal"
+> chunking, but rather you would want to bias things to have multiple
+> cores looking at the early part of the range first.
+
+That was a terrible idea:
+(1) It inverts the looping algorithm, obscuring the actual operative code;
+(2) It forces the use of complex code to get parallel operation, defeating
+    the purpose of making getting better performance easy;
+(3) It assumes a particular chunking mechanism, which makes the code much less
+    portable (at least if performance matters).
+
+To elaborate on (3): A good compiler will tailor the chunking to the iteration
+being executed (specifically whether it is balanced so the iterations probably
+take the same amount of time, or if they are obviously unbalanced, and whether
+or not there are any explicit exits), the hardware (the number of cores and how
+they are mapped to Ada tasks), and possibly do that at runtime (especially as
+the tasks that are running aren't known until the moment the loop is launched).
+An algorithm using what Brad called "manual chunking" is going to end up
+fighting that compiler-generated chunking, and that is going to scrub off
+performance. You'd be better off using a library like Parafin to implement such
+a loop -- but in that case, there is no longer enough value to the syntactic
+parallel loop to bother with defining it.
+
+But an algorithm like the one you propose makes a lot of sense as the
+implementation generated by the *compiler* when the code contains explicit
+exits. It can do that with all of the knowledge needed and adding no portability
+issues; there's surely no requirement that a chunk run consecutive iterations!
+
+> ... In any case, are there really that many cases where you want the
+> lowest-indexed item?
+
+Yes, this is pretty much the only thing I search for. Usually, I need to be able
+to restart the search if the found item eventually turns out to not work for
+some complex reason. That would be difficult if some random result was returned.
+
+> More likely in my view is
+> that you would either want any item, or you would want all items that
+> satisfy the predicate.  If you want all items, then of course "exit"
+> would not be used, and you would build up a list of all desired items.
+
+Which you can't do with a parallel loop; you have to use a reduction expression
+for that.
+
+> If you want any item, then the first you find should "win".
+
+The only case where it makes sense to me to want "any item" is when you are
+looking for the absence of something (in such a case there shouldn't be any
+match). That's pretty rare in my view.
+
+I still think that "exit" should produce a deterministic answer; there's no
+problem if you don't need a deterministic answer -- don't use "exit" in that
+case. The goal here is to make it as easy as possible to write a parallel loop.
+If you have to know lots of unusual patterns in order to get anything to work,
+then only experts will be able to write parallel code. But that's already true;
+it's not the least bit hard for an Ada real-time expert to write parallel
+algorithms (as Brad has proved). The whole point of these features is to bring
+this programming capabilities to the rest of us. If we fail to do that, then
+these features become essentially useless (and a massive amount of work for
+that).
+
+****************************************************************
+
+From: Brad Moore
+Sent: Tuesday, August 15, 2017  10:12 AM
+
+> I want to step back a minute and think a bit about the uses for these
+> constructs and our actual goals for them. I'm getting concerned that
+> there can be no real uses of the parallel loop construct.
+
+I have been thinking about this also, and I am thinking we probably should
+consider dropping parallel loop syntax from the proposal.
+
+The discussion about find first, find last and preemptive exits has led to show
+that the anonymous loop body AI coupled with a parallelism library call is a
+better solution to these problems than the parallel loop syntax solutions we
+have been discussing.
+
+Once you come to that realization, it opens a bit of a slippery slope which
+suggests that once you need parallelism libraries to solve certain problems,
+probably they are also the best way to solve other manual chunking problems as
+well.
+
+This includes what I was calling the "parallel all" loops where an outer
+sequential loop encloses an inner nested loop that is desired to be executed in
+parallel. The best performance for these sorts of problems, which I have
+encountered several, is to use barriers with manual chunking to push the
+parallelism out to an outer loop, that encloses the sequential loop, and use
+barriers to transition between parts of the algorithm that need to run
+sequential and other parts that need to run in parallel.
+
+It felt admittedly a bit like trying to force a square peg into a round hole to
+graft the "parallel all" syntax onto the parallel loop syntax.
+
+If you take out all the manual chunking problems and reductions problems out of
+the picture, whats left?
+
+Not much, as I would say easily more than half of the parallel loop problems I
+have encountered that are not reductions are of the manual chunking variety.
+
+I have only two representative problems left that would fit under parallel loop
+syntax.
+
+  1. Find any
+  2. Combining arrays/containers
+
+The Find any problems are the ones that Tucker was talking about where a
+preemptive exit would be needed to break out of the loop.
+
+I think these sorts problems are actually better served by quantified
+expressions, which as you may recall are actually a form of reduction
+expression.
+
+A quantified expression is a boolean expression that a compiler should break out
+of evaluating once it has found a case that determines the value of the
+expression. This also applies for parallel quantified expressions. In this case,
+the programmer doesn't even need to write the exit statement. It is expected
+that the compiler will implicitly do this.
+
+Consider determining if a number is a prime number.
+
+Is_Prime : constant Boolean :=
+   (parallel for all I in 1 .. Integer(Sqrt(N)) => N mod I != 0);
+
+Whether the parallel keyword is present or not, a compiler should stop further
+evaluation once if finds a factor of N.
+
+Sometimes you want a non boolean value as the result of what you are looking
+for. A quantified expressions works here also, except a side affect of the
+function is needed to store the result. As an output, you still need a boolean
+result also, to indicate if you found what you were looking for.
+
+Consider finding any prime value within a range of integers.
+
+Prime_Number : Integer := 0 with Atomic;
+
+function Test_And_Store_Prime (N : Integer) return Boolean is begin
+  if Is_Prime (N) then
+     Prime_Number := N;
+     return True;
+  else
+     return False;
+  end if;
+end Test_And_Store_Prime;
+
+Prime_Found : constant Boolean :=
+    (parallel for all I in Start .. Finish => Test_And_Store_Prime (I));
+
+if Prime_Found then
+   Put_Line (Integer'Image (Prime_Number) & " is a prime number"); end if;
+
+There will be some naysayers who dislike seeing a side-effect in a quantified
+expression, but I think for this particular problem it feels good enough to use
+here.
+
+That leaves combining arrays/containers, however those can be written as a
+reduction expression also.
+
+Array_Result : array (ArrA'First .. ArrA'Last) of Integer :=
+   (for I in ArrA'First .. ArrA'Last => <""> & (ArrA(I) + ArrB(I)));
+Where concatenation is he reducing operator.
+
+Perhaps this looks inefficient, though whose is to say how the compiler can
+optimize this. It is otherwise a simple enough expression to read or write, I
+think.
+
+A more visually appealing way to do this would be to allow the keyword parallel
+with the new array aggregate syntax, which eliminates the concatenation
+semantics.
+
+Array_Result : array (ArrA'First .. ArrA'Last) of Integer :=
+    (parallel for I in ArrA'First .. ArrA'Last => (ArrA(I) + ArrB(I)));
+
+See RM 4.3.3 5.1/5 for the array initialization syntax.
+
+I have not been able to yet think of any other cases where parallel loop syntax
+would be needed, so it seems to me that we should not introduce parallel loop
+syntax given that there are no use cases for parallelism problems that cant be
+better solved with reduction expressions or anonymous loop body syntax combined
+with parallelism library calls.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Wednesday, August 16, 2017  12:22 AM
+
+...
+> The discussion about find first, find last and preemptive exits has
+> led to show that the anonymous loop body AI coupled with a parallelism
+> library call is a better solution to these problems than the parallel
+> loop syntax solutions we have been discussing.
+
+There's a couple of problems using the "anonymous loop body" AI in this way:
+(1) Without some additional mechanism, we'd lose the safety checks of the
+    parallel loop. They don't make any sense in the context of the "anonymous
+    loop body" proposal per-se. So I think we'd still need a "parallel" keyword
+    or something like that to trigger the safety checks. (Without them, parallel
+    operations are just a giant race condition; no human programmer could avoid
+    them.)
+(2) There's no certainty that the "anonymous loop body" proposal even makes it
+    into the Standard. Several people have expressed strong reservations about
+    that proposal.
+
+As an aside, I'd like to note that the "exit" problem with the "anonymous loop
+body" proposal is in fact very similar to the problems that arise with the
+parallel loop proposal. In both case, implementing exit/return/goto is very
+painful, because you have to get control, somehow save whatever is supposed to
+happen after the loop, do some actions, and then proceed to doing the after the
+loop actions. The only place where the "anonymous loop body" is a bit worse is
+that a user-defined routine can see some of what happens there; the parallel
+loop is all-compiler generated.
+
+...
+> If you take out all the manual chunking problems and reductions
+> problems out of the picture, whats left?
+>
+> Not much, as I would say easily more than half of the parallel loop
+> problems I have encountered that are not reductions are of the manual
+> chunking variety.
+
+Exactly my point.
+
+> I have only two representative problems left that would fit under
+> parallel loop syntax.
+>
+>   1. Find any
+>   2. Combining arrays/containers
+>
+> The Find any problems are the ones that Tucker was talking about where
+> a preemptive exit would be needed to break out of the loop.
+
+I think "Find any" per-se is rather unlikely, although they make sense in the
+case of an "ensure none" predicate.
+
+The array combination cases are most likely going to trigger the safety checks
+(unless we build in some special holes, and even then only a few special forms
+would work). The compiler is probably not going to be able to tell that
+different iterations write different components unless the array index is of a
+special, recognized form. (It definitely will not work with the raw static
+Global check, since the entire array is considered a single object, which
+necessarily overlaps with itself.)
+
+> I think these sorts problems are actually better served by quantified
+> expressions, which as you may recall are actually a form of reduction
+> expression.
+
+Makes sense. Certainly "ensure none" is the sort of case that quantified
+expressions were designed for.
+
+...
+> That leaves combining arrays/containers, however those can be written
+> as a reduction expression also.
+>
+> Array_Result : array (ArrA'First .. ArrA'Last) of Integer := (for I in
+> ArrA'First .. ArrA'Last => <""> & (ArrA(I) + ArrB(I))); Where
+> concatenation is the reducing operator.
+
+They have to be written this way (or some other reduction expression), as
+otherwise the array objects would be overlapping (making a parallel loop
+illegal).
+
+> Perhaps this looks inefficient, though whose is to say how the
+> compiler can optimize this. It is otherwise a simple enough expression
+> to read or write, I think.
+>
+> A more visually appealing way to do this would be to allow the keyword
+> parallel with the new array aggregate syntax, which eliminates the
+> concatenation semantics.
+>
+> Array_Result : array (ArrA'First .. ArrA'Last) of Integer := (parallel
+> for I in ArrA'First .. ArrA'Last => (ArrA(I) + ArrB(I)));
+>
+> See RM 4.3.3 5.1/5 for the array initialization syntax.
+
+Right, and this probably is a better way to write this expression rather than
+writing a loop in the first place.
+
+> I have not been able to yet think of any other cases where parallel
+> loop syntax would be needed, so it seems to me that we should not
+> introduce parallel loop syntax given that there are no use cases for
+> parallelism problems that cant be better solved with reduction
+> expressions or anonymous loop body syntax combined with parallelism
+> library calls.
+
+Agreed, so long as we find a way to make the safety checks that we have in
+parallel loops.
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Thursday, August 17, 2017  7:43 PM
+
+> ...
+>
+>> I have only two representative problems left that would fit under
+>> parallel loop syntax.
+>>
+>>  1. Find any
+>>  2. Combining arrays/containers
+>>
+>> The Find any problems are the ones that Tucker was talking about
+>> where a preemptive exit would be needed to break out of the loop.
+>
+> I think "Find any" per-se is rather unlikely, although they make sense
+> in the case of an "ensure none" predicate.
+
+I don’t agree, based on my experience with ParaSail.  YMMV.
+
+> The array combination cases are most likely going to trigger the
+> safety checks (unless we build in some special holes, and even then
+> only a few special forms would work). The compiler is probably not
+> going to be able to tell that different iterations write different
+> components unless the array index is of a special, recognized form.
+> (It definitely will not work with the raw static Global check, since
+> the entire array is considered a single object, which necessarily overlaps
+> with itself.) ...
+
+This is very well traveled area.  Compilers have been recognizing the special
+case of independent loop iterations for decades.  Given:
+
+   parallel for I in A’Range loop
+       C(I) := A(I) + B(I);
+   end loop;
+
+Many, many compilers are smart enough to understand that this is safe.  I agree
+the loop-body procedure construct is very nice, but you still end up wanting to
+do the safety checks, and something as simple as the above, whether the user is
+driving the “chunking” or the compiler does it by itself, has exactly the same
+safety-check requirements.
+
+We have been talking with existing Ada customers recently about planned Ada 202X
+enhancements, and this is the one that gets them drooling every time.  They
+*really* want parallel loops.  Of course if they are too restrictive, they won’t
+solve the problem, but I believe with the proposed reduction expression, the
+above simple parallel loop, and the nice loop-body procedure, we will give them
+what they need to be quite happy.  I really believe we need all three.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Thursday, August 17, 2017  9:32 PM
+
+...
+> > The array combination cases are most likely going to trigger the
+> > safety checks (unless we build in some special holes, and even then
+> > only a few special forms would work). The compiler is probably not
+> > going to be able to tell that different iterations write different
+> > components unless the array index is of a special, recognized form.
+> > (It definitely will not work with the raw static Global
+> check, since
+> > the entire array is considered a single object, which
+> necessarily overlaps with itself.) ...
+>
+> This is very well traveled area.  Compilers have been recognizing the
+> special case of independent loop iterations for decades.  Given:
+>
+>    parallel for I in A'Range loop
+>        C(I) := A(I) + B(I);
+>    end loop;
+>
+> Many, many compilers are smart enough to understand that this is safe.
+
+Surely. But that's not the point. The point is that we have to have Ada Legality
+Rules which makes this sort of thing legal and portable. The proposed rules
+using the Global aspect certainly would not allow the above. Unless you have
+been secretly planning some much more complex rules to allow such things, loops
+like the above couldn't be legal Ada.
+
+Note that I'm not against allowing loops like the above, but that has to be
+portably defined. (The above case: a single dimensioned target that doesn't
+overlap anything else, and is indexed directly by the loop parameter, is clearly
+safe and could be defined that way, but hardly anything else could be. Is this
+case alone enough for the extra complication???)
+
+...
+> I agree the loop-body procedure construct is very nice, but you still
+> end up wanting to do the safety checks, and something as simple as the
+> above, whether the user is driving the "chunking" or the compiler does
+> it by itself, has exactly the same safety-check requirements.
+
+Not exactly, as noted above. But I definitely agree that whatever construct is used needs safety checks.
+
+> We have been talking with existing Ada customers recently about
+> planned Ada 202X enhancements, and this is the one that gets them
+> drooling every time.  They *really* want parallel loops.  Of course if
+> they are too restrictive, they won't solve the problem, but I believe
+> with the proposed reduction expression, the above simple parallel
+> loop, and the nice loop-body procedure, we will give them what they
+> need to be quite happy.  I really believe we need all three.
+
+Probably because no one will be able to grok reduction expressions. :-) It took
+me a three hour ARG presentation to get any idea...
+
+People naturally understand parallel loops. At least they think they do, until
+they find out that exit doesn't work sensibly. :-)
+
+****************************************************************
+
+From: Brad Moore
+Sent: Friday, August 18, 2017  12:01 AM
+
+I have some more thoughts towards answering the question on whether we should
+eliminate parallel loop syntax from our proposal.
+
+Firstly, I had said in the previous email that find-first and find-last problems
+should be written using parallel libraries and the anonymous loop proposal.
+
+I think these can also be written well enough using reduction expressions.
+
+Consider finding the highest prime number in a range of integers; One could
+express this using the following reduction expression.
+
+Highest_Prime_Found : constant Integer :=
+   (parallel for I in reverse Start .. Finish =>
+      Integer'Max(<Integer'First>, (Is_Prime(I) then I else Integer'First));
+
+A simpler compiler might only apply divide and conquer here, which means each
+chunk fully executes. This should still cut the execution time down from the
+sequential version by a factor of the number of cores.
+
+A smarter compiler might recognize that when the reduction is 'Max or 'Min, and
+the expression value result is of the loop iterator in such a pattern, that once
+a higher index is found to be prime, there is no longer a need to keep searching
+the lower indices. Indeed a smarter compiler might recognize this as a special
+case and insert a completely custom algorithm optimized for find first/find last
+problems.
+
+So there appears to be no need to rely on library packages nor parallel loop
+syntax for these problems.
+
+Now consider another one of the manual chunking problems where I was suggesting
+using barriers would be the best approach when the outer loop must be a
+sequential loop that happens to enclose a parallel loop.
+
+Consider a Newton package that simulates the movement of heavenly bodies in a
+solar system. A procedure Update_Position can be called that considers the mass,
+gravity, direction, and movement of neighbouring bodies to calculate the
+position of a body for the next time increment. An Advance function might be
+needed to swap internal buffers for the next and current values for the next
+time increment.
+
+One could express this using a parallel loop inside a sequential loop that
+advances time.
+
+for Current_Time in Now .. Time + 100_000_000 loop
+
+   for parallel all Heavenly_Body in Solar_System'Range loop
+      Newton.Update_Position (Heavenly_Body);
+   end loop;
+
+   Newton.Advance;
+end loop;
+
+This doesn't work well with reduction expressions if the data arrays/containers
+are hidden behind an abstraction in a package body.
+
+However, if one were to modify the Update_Position subprogram to be a function
+that returns a boolean status to indicate the success of the calculation, then
+one can use a reduction expression here. The Newton.Advance function might raise
+an exception if the boolean input parameter is false, indicating some failure
+occurred, such as two planets colliding, that isn't handled by the simulation.
+
+for Current_Time in Now .. Time + 100_000_000 loop
+
+   Newton.Advance (for parallel all Heavenly_Body in Solar_System'Range
+                     => Newton.Update_Position (Heavenly_Body)); end loop;
+
+So, if we did not have parallel loop syntax, it might influence people more to
+write functions for parallelism problems, rather than procedures, but I wouldn't
+say thats a bad thing necessarily.
+
+A smarter compiler might recognize the pattern of a reduction expression inside
+a parallel loop and treat that as a special case using barriers to get optimal
+performance, or it may be that the default implementation of the compiler just
+works well here. If the programmer finds the performance is poor, he could
+always write the problem himself using barriers and use the anonymous loop
+procedure with a parallelism library.
+
+The point again though is that I don't think we need to rely on the availability
+of parallelism libraries for this problem, nor do we need parallel loop syntax,
+unless we feel strongly that one should be able to use a procedure instead of a
+function for the Update_Position call.
+
+The one other manual chunking problem I have that is quite different from the
+others is to generate a cumulative sum of an array where the result array has
+the sum of the corresponding element from the input array plus all values
+preceding that value in the input array. This solution involves 3 loops. The
+first loop is a parallel loop that gets the sum of each chunk and stores that
+value in a partial sum intermediate array.
+
+The second loop is a sequential loop that performs a cumulative sum operation on
+the partial sum intermediate array.
+
+The third loop is a parallel loop that takes the corresponding partial result
+values as a starting point for each chunk, and then applies the cumulative sum
+to each chunk.
+
+Using manual chunking I can write the first loop using a nested reduction
+expression where the outer parallel reduction expression concatenates sum values
+calculated from the inner reduction expression for each chunk.
+
+However, the third loop seems to pretty much need to be a parallel loop.
+
+declare
+    Chunk_Size : constant Natural := Natural'Min(Arr'Length, 10_000);
+    Chunks     : constant Natural := Arr'Length / Chunk_Size;
+
+    function Chunk_Start (Chunk : Natural) return Index_Type is
+      (Arr'First + (Chunk - 1) * Chunk_Size);
+
+    function Chunk_Finish (Chunk : Natural) return Index_Type is
+      (Arr'First + (if Chunk = Chunks then Arr'Last else Chunk * Chunk_Size - 1));
+
+    -- Create the Partial sums in parallel
+    Partial    : array (1 .. Chunks) of Element_Type :=
+       (parallel for I in 1 .. Chunks => <""> &
+           (for J in Chunk_Start (I) .. Chunk_Finish(I) => <0> + Arr(J))); begin
+
+    -- Sequential carry partial sums through array
+    for I in 2 .. Partial'Length loop
+       Partial (I) := @ + Partial (I - 1);
+    end loop;
+
+    for parallel I in 1 .. Chunks loop
+      declare
+        Sum : Element_Type := (if I = 1 then 0 else Partial (I - 1));
+      begin
+        for J in Chunk_Start (I) .. Chunk_Finish (I) loop
+           Sum := @ + Arr (J);
+           Cum (J) := Sum;
+        end loop;
+      end;
+    end loop;
+end;
+
+However, as this is a manual chunking problem, I find the code written using
+anonymous loop procedures and a library call is quite a bit simpler and easier
+to read, and requires less knowhow on the programmer to decide how the loops
+should be chunked. The chunking logic is hidden inside the library call.
+
+declare
+    package Scan is new Parallel.Loops.Static (Loop_Index => Positive);
+    Partial : array (1 .. Scan.Worker_Count) of Element_Type := (others => 0); begin
+
+    -- Get the sum of each chunk
+    for (Start, Finish) in Scan.Parallel loop
+      declare
+         Worker : constant Parallel.Worker_Id := Scan.Get_Worker_Id;
+      begin
+          for I in Start .. Finish loop
+             Partial(Worker) := @ + Histogram (I);
+          end loop;
+       end;
+    end loop;
+
+    --  Sequential exclusive scan phase
+    for I in 2 .. Worker_Id (Effective_Workers) loop
+       Partial (I) := @ + Partial (I - 1);
+    end loop;
+
+    for (Start, Finish) in Scan.Parallel loop
+      declare
+         Worker : constant Parallel.Worker_Id := Scan.Get_Worker_Id;
+         Sum : Integer := (if Worker = 1 then 0 else Partial (Worker - 1));
+      begin
+         for I in Start .. Finish loop
+            Sum := @ + Arr (I);
+            Cum (I) := Sum;
+          end loop;
+      end;
+    end loop;
+
+So I would say this is the one case where a parallel library call with anonymous
+loop procedure appears to be the best solution, particularly if it means that we
+can simplify the parallelism proposal and not have parallel loop syntax.
+
+Tucker did have a proposal involving parallel array syntax that was designed
+specifically for this problem. At the time we were considering using that as the
+reduction solution, but since reduction expressions appears to be a better
+solution, it seems unlikely that we would want to bring back parallel array
+syntax if it only is useful for this one problem.
+
+So in summary, I still think most parallel looping problems can be handled with
+reduction expressions.
+
+There may be a few odd cases involving manual chunking where one might want to
+use a parallelism library with the anonymous loop procedure.
+
+My sense is that there are not enough of these problems needing parallelism
+libraries to warrant standardizing a set of libraries for this, given that
+libraries are available, and one can also write Ada wrapper libraries around
+OpenMP calls, for example, although standardizing a set of parallelism libraries
+could be another thing to consider.
+
+While there are some special cases where parallel loop syntax could be useful
+for manual chunking problems, the alternative of using a reduction expression
+seems to be a good enough alternative, otherwise using anonymous loop body with
+a library call is another viable alternative.
+
+That leaves me still thinking that parallel loop syntax is not worth adding at
+this point, which would simplify the AI to just cover reduction expressions and
+parallel blocks. But it would be good to get others to weigh in as well, of
+course.
+
+****************************************************************
+
+From: Brad Moore
+Sent: Tuesday, August 29, 2017  10:01 AM
+
+> ...
+>> The discussion about find first, find last and preemptive exits has
+>> led to show that the anonymous loop body AI coupled with a
+>> parallelism library call is a better solution to these problems than
+>> the parallel loop syntax solutions we have been discussing.
+> There's a couple of problems using the "anonymous loop body" AI in this way:
+> (1) Without some additional mechanism, we'd lose the safety checks of
+> the parallel loop. They don't make any sense in the context of the
+> "anonymous loop body" proposal per-se. So I think we'd still need a
+> "parallel" keyword or something like that to trigger the safety
+> checks. (Without them, parallel operations are just a giant race
+> condition; no human programmer could avoid
+> them.)
+
+I have been thinking about this. I think the safety checks could be provided if
+we could allow the global aspect on formal subprograms and formal access to
+subprograms, with an addition to the global syntax to indicate a subrange of an
+array.
+
+with global => array(x .. y)
+
+Which would indicate that there is  global usage of an array from the indices x
+.. y. The array might be a name of an array, or just the "array" keyword which
+would mean any array.
+
+This might be used with a parallel loop library such as the following.
+
+with System.Storage_Elements;
+
+generic
+    type Loop_Index is range <>;
+package Parallel is
+
+    procedure Parallel_Loop
+      (Process : not null access
+            procedure (Start, Finish : Loop_Index) with
+                    global => (Input => array (Start .. Finish),
+                                      In_Out => array (Start .. Finish),
+                                      Out     => array (Start .. Finish));
+       From : Loop_Index := Loop_Index'First;
+       To      : Loop_Index := Loop_Index'Last)
+    with global => null;
+
+end Parallel;
+
+The idea would be that the compiler would reject any call to Parallel_Loop that
+involved a Process callback that had any globals other than array references of
+a slice using the parameters of the call.
+
+The compiler would also reject the code if there was an attempt to reference the
+array using indices that are outside the range of the specified limits, or not
+derived from the specified parameters.
+
+Note the Parallel_Loop library call itself does not involve any globals.
+
+The compiler would not reject calls involving a Process callback that didn't
+involve any globals. i.e. It doesn't necessarily expect the actual to involve
+globals, but if it does, it cannot involve globals outside of the specification.
+This should eliminate the race conditions that we are worried about I think.
+
+example usage of the above:
+
+   -- Anonymous loop body call
+   for (Start, Finish) of Parallel_Loop loop
+
+       -- Sequential inner loop
+       for I in Start .. Finish loop
+          C(I) := A(I) + B(I);           -- OK
+       end loop;
+
+       C(1) := 3; -- Illegal, Index is not derived from Start or Finish
+   end loop;
+
+> (2) There's no certainty that the "anonymous loop body" proposal even
+> makes it into the Standard. Several people have expressed strong
+> reservations about that proposal.
+>
+> As an aside, I'd like to note that the "exit" problem with the
+> "anonymous loop body" proposal is in fact very similar to the problems
+> that arise with the parallel loop proposal. In both case, implementing
+> exit/return/goto is very painful, because you have to get control,
+> somehow save whatever is supposed to happen after the loop, do some
+> actions, and then proceed to doing the after the loop actions. The
+> only place where the "anonymous loop body" is a bit worse is that a
+> user-defined routine can see some of what happens there; the parallel loop
+> is all-compiler generated.
+
+Actually, this is not a problem for exit because with the anonymous loop body,
+the users code appears as a sequential loop nested inside the outer anonynmous
+loop. Since the inner loop is a sequential loop, exit works just as it does
+today, it only exits the inner sequential loop.
+
+I suspect we'd probably want to make returns and gotos illegal from inside an
+anonymous loop body.
+
+So the anonymous loop body needs to have no knowledge that there is parallelism
+involved. Its just a library call which happens to hide the parallelism inside
+the library call.
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Tuesday, August 29, 2017  2:05 PM
+
+Although SPARK only allows names of stand-alone objects in Global annotations,
+for Ada we allow any object name, presumably including an indexed component or a
+slice.  We also have proposed in AI12-0079 to allow them on formal subprograms
+and access-to-subprogram, so I think the cases you mentioned are consistent with
+the current AI.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Wednesday, August 30, 2017  10:16 PM
+
+How do the static checks work in such a case, as the bounds in question would
+typically be non-static? For instance, in Brad's
+
+  with global => array(x .. y)
+
+X and Y probably would be outer loop indexes, or parameters to the current
+subprogram.
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Wednesday, August 30, 2017  10:26 PM
+
+You just need to prove two slices don’t overlap, which is equivalent to
+proving the high bound of one is less than the low bound of the other. This
+is the kind of thing that static analysis and more advanced proof tools are
+pretty good at doing! This is not significantly harder than eliminating
+run-time checks for array indexing when the bounds are not known at
+compile-time, which is something that many Ada compilers can do based on other
+information available at the point of usage.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Wednesday, August 30, 2017  10:40 PM
+
+OK, but again these are Legality Rules, not something that compilers are doing
+on their own. The entire set of rules have to be defined formally and required
+of all Ada compilers. What "static analysis" and "advanced proof tools" can or
+can't do is irrelevant (unless of course we allowed what is legal or not to be
+implementation-defined -- when I previously proposed that for exception
+contracts, the idea seemed to be as effective as a lead balloon).
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Wednesday, August 30, 2017  11:02 PM
+
+> I have some more thoughts towards answering the question on whether we 
+> should eliminate parallel loop
+
+...
+
+Let me throw another classic problem (one that I recently wrote for a hobby
+program), the problem of determining correlations and other statistics between
+two parallel arrays of data.
+
+The core of the problem is calculating 6 pieces of data: the sum of the
+elements of each array, the sum of the squares of the elements of each array,
+the product of the matching elements of the arrays, and the square of the
+product of the arrays.
+
+The sequential solution is a single loop creating all 6 items:
+
+     for I in 1 .. MAX loop
+        A_Sum := @ + A(I);
+        A_Sum_Sqr := @ + A(I)*A(I);
+        B_Sum := @ + B(I);
+        B_Sum_Sqr := @ + B(I)*B(I);
+        Prod_Sum := @ + A(I)*B(I);
+        Prod_Sum_Sqr := @ := (A(I)*B(I))**2;
+     end loop;
+
+MAX is over 3000 for this particular problem, so it would seem that a parallel
+solution might add some performance.
+
+These are clearly reduction expressions. However, we have to write 6 separate
+reduction expressions, which means 6 times the loop overhead. Also, the above
+is full of common subexpressions; it's likely that a compiler would calculate
+and load A(I) and B(I) once each, and possibly eliminate some of the other
+operations as well (in particular, the A*B product). None of these
+optimizations can be done with separate reduction expressions, so we'll end up
+with 5 times the memory reading as well (each A(I) and B(I) is read 5 times in
+the sequential loop).
+
+Thus, a parallel rewrite into 6 reduction expressions is likely to be
+disappointing. With the extra parallel execution overhead, it would likely
+take 8 cores to improve the performance over the sequential loop, and the
+performance would be worse for 4 cores or less.
+
+This is not a theoretical concern: Janus/Ada doesn't do many floating point
+optimizations; recompiling the program with GNAT (which presumably does do
+those optimizations) improved the performance by a large factor (more than 10,
+as I recall).
+
+How would you parallelize this loop??
+
+****************************************************************

Questions? Ask the ACAA Technical Agent