# CVS difference for ai12s/ai12-0119-1.txt

Differences between 1.5 and version 1.6
Log of other versions for file ai12s/ai12-0119-1.txt

```--- ai12s/ai12-0119-1.txt	2017/06/12 22:36:00	1.5
+++ ai12s/ai12-0119-1.txt	2017/08/08 03:14:54	1.6
@@ -1351,14 +1351,2831 @@
>      (for parallel X of Arr => Integer'Min(<Integer'Last>,  X));

Min : Integer :=
-      (for parallel X of Arr => Integer'Min(<Integer'Last'Initial,  X));
+      (for parallel X of Arr => Integer'Min(Integer'Last'Initial,  X));

or

Min : Integer :=
-      (for parallel X of Arr => Integer'Min(<Initial(Integer'Last),  X));
+      (for parallel X of Arr => Integer'Min(Initial(Integer'Last),  X));

Maybe there is a better idea out there (neither of these look very promising),
but whatever we do needs to look and feel like Ada.
+
+****************************************************************
+
+Sent: Tuesday, June 13, 2017  9:21 AM
+
+> First, a bit of a rant. It's disconcerting that we get a wildly
+> different proposal at each meeting. There's no opportunity to get the
+> details right if we change the basic proposal completely between every
+> meeting.
+>
+> In Pittsburgh, we spent a significant amount of time figuring out the
+> right model for reducers. I see here that all of the conclusions of
+> that meeting have completely been discarded in favor of what appears
+> on first glance the reducer-de-jour. Why should we even spend time
+> trying to figure out this wildly different proposal when it seems
+> obvious that whatever we decide will be discarded all over again next time?
+
+I started with that proposal, but the more I tried to apply it to real problems,
+the more problems I saw.
+
+To describe the problem in a nutshell, for parallel reduction problems there are
+two mostly independent algorithms to be expressed, which I call the "what" and
+the "how"
+
+The "what" answers the question: What is the application doing?
+
+- This includes the application logic that mostly remains the same regardless
+  whether parallelism is being applied or not. It also includes an expression of
+  the desire that the application code should be optimized to run in parallel.
+
+The "how" answers the question: How is the application code implemented?
+e.g. How does the optimization such as parallelism happen? How is chunking
+performed, how are the results being reduced and combined? With parallelism,
+this is complex enough that it can be viewed as a separate algorithm being
+overlaid on top of the application's algorithm.
+
+I found that syntax that tries to express both "how" and "what" ends up being
+messy, and too constraining.
+
+One problem I could see, is that by using array syntax to hold chunk results,
+implies that there is an array of chunk results that exists in time that can be
+iterated over and examined. In the parallelism libraries I've been working with,
+that isn't the case, the chunk results come and go and get combined during the
+computation. There typically never is a point in time where all the chunk
+results exist.
+
+Having an array of chunk results implies that the reduction does not occur until
+after the looping has completed for all the chunks. This is eliminating a
+parallelism possibility where the reduction occurs in parallel during the
+computation of the loop, rather than sequentially after the looping of all the
+chunks is complete.
+
+There are some algorithms where reducers can be written that depend on the
+reduction occurring in parallel during the loop processing.
+
+If we look at the example from the minutes, to sum the elements of an array we
+have;
+
+    My_Sum := 0;
+
+    parallel [(Num_CPUs*2)]
+       Sum : array (<>) of Integer := 0;
+    for I in 1 .. 1_000_000 loop
+       Sum (<>) := @ + A(I);
+    then
+       for S of Sum loop
+          My_Sum := @ + S;
+       end loop;
+    end parallel loop;
+
+The middle loop expands into two loops:
+    for Chunk_Index in Sum’Range loop
+       for I in Split(1..1000000, Chunk_Index, Num_CPUs*2) loop
+          Sum(Chunk_Index) := @ + A(I);
+       end loop;
+    end loop;
+
+The parallel array of Sum, the "then" keyword, and the explanation for the
+middle loop demonstrates the intent that the reduction occurs after the main
+loop processing.
+
+The point of parallelism is to make the applications logic run as fast as
+possible. Why are we attempting to go with syntax that is injecting sequential
+processing when it isn't needed? It would be better more parallelism can be
+applied, if it is possible, and it is.
+
+Having an array of chunk results also implies that the chunking is more static
+in nature, suggesting that the computation of chunks occurs before the loop is
+executed, which rules out other implementation possibilities where the chunking
+is more dynamic in nature, such as some of the work stealing algorithms where
+chunks can be dynamically resized as workers become free, they steal work from
+other chunks being processed by other workers. Why are we choosing syntax that
+forces implementors to apply a limited range of possible approaches?
+
+If the chunking is computed before the loop begins executing, then it implies
+there can be quite a large number of chunk results, if there are a lot of loop
+iterations to be chunked, as the chunks should be smaller to allow for better
+load balancing between the cores. The number of chunks would be a calculation
+similar to;
+
+     Number_of_Cores * x, where x is some factor based on the number of iterations.
+
+This could require larger storage needs, for the case where the reduction result
+is a larger data type such as a large matrix.
+
+With the parallelism libraries I have been working with, using dynamic chunking,
+the only needs to be a single chunk result for each core.
+
+These are just some of the problems I see with the design we came up with last
+meeting.
+
+I also find that there is a lot we expecting the reader to understand. -
+- The concept of the parallel arrays
+- That chunking is being applied
+- That each chunk is going to a different element in the array
+- That a second loop that looks like a nested loop is not actually nested, but
+  is executed after the outer loop, etc
+
+
+Being able to express the above problem as an expression such as;
+
+    (for I of Arr => <0> + I)
+
+Seems more useful, a lot simpler to parse for the reader, allows implementors
+more freedom, allows for better parallelism, etc
+
+It also seems like an expression is a better tool for the job, when it comes to
+reduction problems.
+
+The point of looping is to apply some processing multiple times.
+The point of an expression is to produce a result.
+
+I would argue that the point of reduction problems is mainly to produce a
+result. Looping is needed as part of the implementation to produce the result,
+but looping is not the main goal here. It is secondary.
+
+Further, this reduction expression syntax can be compatible with whatever
+parallel loop syntax we might decide to go with, including what we came up with
+at the last ARG meeting. This syntax raises the question about whether we even
+need reduction syntax to be combined with loop syntax however.
+
+> I want to at least feel like we are making progress toward a solution,
+> rather than starting from nearly ground zero with each proposal. End
+> of mini-rant.
+
+I feel like is considerable progress toward a solution. It's not that this is
+just another alternative that more or less accomplishes the same thing as what
+was proposed during the ARG meeting. To me, this feels like a new idea that is a
+significant improvement over what we had before.
+
+> ...
+>> Special syntax, "<value>" is used to indicate where this feedback
+>> occurs, which also encloses the initial value to be used for the
+>> first iteration.
+>
+> Regardless of the merits of the proposal itself (which I haven't
+> thought about), I don't think angle brackets is going to fly as syntax
+> in Ada. There would be substantial confusion with relational
+> operators, which would likely make such text unparsable. (We discussed
+> this extensively with some of the assignment target proposals; it
+> didn't fly there and I don't see what could have changed in 8 months.)
+
+But we already have angle brackets in Ada, <> box notation appears in several
+places in the syntax. This is just another case of box, except that we stick a
+value in between.  Tucker has already indicated several other possible syntax
+replacements here, so if we dont like the box notation, it shouldn't be too hard
+to come up with something else.
+
+I think we could potentially just use <> also, without providing an intial
+value, and just say that there has to be at least one iteration, and the intial
+value is the value of the first iteration.
+
+> You could use square or curly brackets for this, but in any case, such
+> syntax doesn't seem very Ada-like.
+
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Tuesday, June 13, 2017 12:59 PM
+
+...
+> One problem I could see, is that by using array syntax to hold chunk
+> results, implies that there is an array of chunk results that exists
+> in time that can be iterated over and examined. In the parallelism
+> libraries I've been working with, that isn't the case, the chunk
+> results come and go and get combined during the computation. There
+> typically never is a point in time where all the chunk results exist.
+
+Arrays in this context is just a syntax, there's no reason that it has to be
+materialized. So I find this concern mainly one about reading too much into the
+syntax. One hopes that the array *isn't* materialized, and the actual semantic
+rules allow that.
+
+However, we want the syntax to be as easy as possible to understand by a normal
+sequential programmer, and they certainly will understand the use of arrays in
+this context.
+
+...
+> Being able to express the above problem as an expression such as;
+>
+>     (for I of Arr => <0> + I)
+>
+> Seems more useful, a lot simpler to parse for the reader, allows
+> implementors more freedom, allows for better parallelism, etc
+
+This seems very limited to me, as it limits what can be parallelized with
+reductions to things that can be written as a single expression. It seems as
+thought people will end up writing helper functions (which necessarily reduce
+parallelism as the compiler can't see the contents) or extremely complex single
+expressions (hard to read, hard to debug).
+
+It also seems to indicate (along with the thought to add anonymous subprograms,
+lets, and the like) that some people want to force Ada into being a functional
+language. Since there is a vocal part of the Ada community that finds functional
+languages to be evil, that could be a very controversial direction. (I'm
+undecided on this point.)
+
+> > ...
+> >> Special syntax, "<value>" is used to indicate where this feedback
+> >> occurs, which also encloses the initial value to be used for the
+> >> first iteration.
+> >
+> > Regardless of the merits of the proposal itself (which I haven't
+> > thought about), I don't think angle brackets is going to fly as
+> > syntax in Ada. There would be substantial confusion with relational
+> > operators, which would likely make such text unparsable. (We
+> > discussed this extensively with some of the assignment target
+> > proposals; it didn't fly there and I don't see what could have
+> > changed in
+> 8 months.)
+>
+> But we already have angle brackets in Ada, <> box notation appears in
+> several places in the syntax. This is just another case of box, except
+> that we stick a value in between.
+
+Umm, no. <> is a lexical symbol. If you try to put "something in between", you
+have (at least) three lexical symbols: "<" expression ">", and that has to be
+parsed to be understood.
+
+> Tucker has already indicated several other possible syntax
+> replacements here, so if we dont like the box notation, it shouldn't
+> be too hard to come up with something else.
+>
+> I think we could potentially just use <> also, without providing an
+> intial value, and just say that there has to be at least one
+> iteration, and the intial value is the value of the first iteration.
+
+Almost anything is better than "<expr>". Maybe not "+expr+". :-)
+
+****************************************************************
+
+Sent: Thursday, June 15, 2017  2:04 AM
+
+>> One problem I could see, is that by using array syntax to hold chunk
+>> results, implies that there is an array of chunk results that exists
+>> in time that can be iterated over and examined. In the parallelism
+>> libraries I've been working with, that isn't the case, the chunk
+>> results come and go and get combined during the computation. There
+>> typically never is a point in time where all the chunk results exist.
+>
+> Arrays in this context is just a syntax, there's no reason that it has
+> to be materialized. So I find this concern mainly one about reading
+> too much into the syntax. One hopes that the array *isn't*
+> materialized, and the actual semantic rules allow that.
+
+Its not just about the storage layout though. If it looks like an array, people
+will tend to use it that way. This includes doing slices, copying, bit
+operations (if its an array of bits), etc
+
+I can imagine cases where one will want to iterate through the array initially
+after loop processing to perhaps do intermediate processing and adjustments to
+the values before applying the reduction. Nested functions may be needed to do
+this. For example, one might want to apply data smoothing the values. The more
+one tries to use this as an array, I think the danger is the more that ends up
+having to be the implementation.
+
+> However, we want the syntax to be as easy as possible to understand by
+> a normal sequential programmer, and they certainly will understand the
+> use of arrays in this context.
+
+That part is pretty easy to understand, since it looks like existing Ada.
+However, the chunking, the meaning of (<>), and the overall construct, I would
+think will be quite foreign to sequential programmers, and existing Ada
+programmers.
+
+> ...
+>> Being able to express the above problem as an expression such as;
+>>
+>>     (for I of Arr => <0> + I)
+>>
+>> Seems more useful, a lot simpler to parse for the reader, allows
+>> implementors more freedom, allows for better parallelism, etc
+>
+> This seems very limited to me, as it limits what can be parallelized
+> with reductions to things that can be written as a single expression.
+> It seems as thought people will end up writing helper functions (which
+> necessarily reduce parallelism as the compiler can't see the contents)
+> or extremely complex single expressions (hard to read, hard to debug).
+
+Similar to the rewriting of array syntax, I would think the compiler could also
+do some rearranging of at least certain helper functions, such as inline
+functions, to further optimize what actually gets generated by the compiler, if
+it is needed.
+
+> It also seems to indicate (along with the thought to add anonymous
+> subprograms, lets, and the like) that some people want to force Ada into
+> being a functional language. Since there is a vocal part of the Ada
+> community that finds functional languages to be evil, that could be a very
+> controversial direction. (I'm undecided on this point.)
+
+I think an important point to consider is whether this syntax is appealing
+enough to add to Ada. It seems to be useful for other things than parallelism,
+for example it is convenient to be able to express factorial as an expression
+function, where parallelism likely isn't needed or wanted.
+
+It doesn't necessarily need to be the only way to write reductions, just as
+quantified expressions can also be expressed in Ada in other ways, such as via
+looping and function calls. We can also have a parallel loop solution, such as
+the Pittsburgh proposal as well.
+
+I think this important to discuss the parallel reduction syntax however, because
+  a) If we do want to support this, this likely will impact the syntax proposal
+     for parallel loop reductions
+  b) Even if we decide to not support this, (at least not now), it offers some
+     ideas that still could impact the parallel loop proposal.
+
+In particular, I'm referring to the idea for the Reducer aspect, and Identity
+aspect that can be applied to function declarations. This is useful to separate
+the "what" from the "how" as I described in my previous email, where the "what"
+syntax shows what application code to be parallelized, and the "how" syntax
+which is separate, provides details on how this is to happen.
+
+The reducer for "+" is always going to be "+" itself, so why force the
+programmer to write this out every time one want to add numbers in parallel,
+this can be baked into the standard in the form of aspects applied to the
+function declarations?
+
+This could eliminate the need for both the "then" clause as well as the need for
+the parallel array, which could result in a much simpler syntax construct
+involving only adding the work parallel to a loop. This also makes sense because
+adding parallelism I think in a lot of cases will be a later step in the
+development process of an application - The optimisation phase.
+
+In that case you have existing sequential code to be parallelized, and it is
+desirable to have to make minimal changes to the code to get the parallelism.
+
+Further, one might find that adding the parallelism doesn't help performance, or
+may even make things worse, in which case the programmer will want to back out
+those changes. It would be desirable to be able to undo changes with minimal
+effort. For example, just adding or removing the keyword "parallel" would be
+appealing.
+
+If we take the sequential code:
+
+   declare
+      Sum : Integer := 0;
+   begin
+      for I in 1 .. 1_000_000 loop
+         Sum := @ + I;
+      end loop;
+
+      Put_Line ("Sum:" & Integer'Image (Sum));
+   end;
+
+And convert it to parallel just by adding the parallel keyword
+
+   declare
+      Sum : Integer := 0;
+   begin
+      for I in parallel 1 .. 1_000_000 loop
+         Sum := @ + I;
+      end loop;
+
+      Put_Line ("Sum:" & Integer'Image (Sum));
+   end;
+
+I think that is closer to what programmers would want to be able to do.
+
+We could say that as written, the compiler would reject this, because the global
+aspect could indicate this is a data race for parallel operations.
+
+We could then have a Boolean aspect, "Reduction" that can be applied to a
+variable declaration that indicates that variable involves parallel reduction.
+This would then allow the compiler to accept the loop, and provides better
+feedback to the user that parallel reduction is occurring.
+
+i.e.
+
+   declare
+      Sum : Integer with Reduction := 0;
+   begin
+      for I in parallel 1 .. 1_000_000 loop
+         Sum := @ + I;
+      end loop;
+
+      Put_Line ("Sum:" & Integer'Image (Sum));
+   end;
+
+
+The compiler and the reader can see that the operation being applied to the
+global result is
+
+   function "+" (L, R : Integer) with Identity => 0;
+
+Both can also see that this is also a reducer function because it accepts two
+parameters of the same type, and returns a result of that same type, and because
+it has the Identity aspect. So this shows how the results of the chunks are
+combined. The compiler now has the necessary inputs to be able to implement the
+parallelism, and the reader who is mostly only concerned with the application
+logic can see that the code is to run in parallel, and in many cases, the
+programmer will not care how this happens, but if he does he can look up the
+reducer function for the loop, as well as the identity value for that function.
+
+I would think the sequential programmer which prefer to see the above solution
+for parallel loops, over the syntax proposed in Pittsburgh.
+
+This also is more harmonious with the parallel reduction expression syntax as
+currently written in the AI.
+
+Note that writing as a reduction expression, isn't a data race because the
+result is computed before it is assigned to the global value, so in that case,
+the Reduction aspect would not be needed on the Sum declaration, though it could
+be optionally added as a confirming aspect.
+
+i.e. the above code could be replaced with;
+
+   declare
+      Sum : constant Integer := (for I in parallel 1 .. 1_000_000 => <> + I);
+   begin
+      Put_Line ("Sum:" & Integer'Image (Sum));
+   end;
+
+or even just;
+
+   Put_Line ("Sum:" & Integer'Image (for I in parallel 1 .. 1_000_000 => <> + I));
+
+****************************************************************
+
+Sent: Saturday, June 17, 2017 11:53 PM
+
+I think we can further simplify and clarify the Reduction Expression syntax, and
+at the same time eliminate some problems.
+
+Specifically, we can eliminate the Reducer aspect and Identity aspect by saying
+that the top level operation of the reduction expression must be a reducer
+function. (i.e. a function that takes two arguments of the same type, and
+returns a result of that same type)
+
+In that case, we would not need to know of an identity value.
+Rather, the result of the first iteration of any chunk effectively becomes the
+initial value for that chunk. The accumulation back into the same result only
+occurs if there are at least two iterations for a chunk.
+
+The expression inside the box is then not the initial value, but instead is the
+default value, whose only purpose is to provide the result of the expression if
+the reduction expression iterates over a null range, i.e. 0 iterations.
+
+A <> with no initial value would signify that there is no default value for the
+expression, which implies that an exception would be raised if a null range were
+specified.
+
+This actually is needed for operations such as set intersection, where the
+intersection of 0 sets is not defined mathematically, or at least difficult to
+produce a value for, if we wanted it to mean the universe.
+
+consider:
+
+We want to find the intersection of animals that can be found in a set of zoos.
+
+Set_Array(Vienna_Zoo).Include("Gorilla");
+Set_Array(Vieena_Zoo).Include("Giraffe")
+...
+Set_Array(Calgary_Zoo).Include("Giraffe");
+Set_Array(Calgary_Zoo).Include("Buffalo");
+
+
+Common_Animals : constant Set := (for Set of Set_Array => Intersection(<>, Set));
+
+It would be pretty much impossible to create an initial value for a zoo that
+contains all known animals.
+
+Since that case is only needed when there are zero zoos being examined, we do
+not need to supply that value. Also in this case, the programmer knows that
+there are at 1 or more zoos in the array, so the default value will not be
+needed, and and exception would not be raised.
+
+Probably it would make sense that the exception raised would be
+Constraint_Error, if one did try to iterate with a Reduction expression that did
+not provide a default value.
+
+This solves another problem. Yesterday Erhard brought up the example of matrix
+to the power of N.
+
+Now that the Identity aspect is not needed, the Identity Matrix is only needed
+as a default value for the reduction expression. This can now be trivially
+supplied by the programmer by a function call.
+
+eg.
+
+   Matrix_To_The_Nth_Power : constant Matrix := (for I in 1 .. N => <Identity (A)> * A;
+
+Where, Identity (A) is a function of a Matrix abstraction that returns the
+Identity matrix for the matrix, A.
+
+If we know that N is of subtype Positive, we can further simplify this to,
+
+   Matrix_To_The_Nth_Power : constant Matrix := (parallel for I in 1 .. N => <> * A;
+
+Since, the Identity matrix would not be needed to evaluate the expression.
+
+You may ask, what about the case in the previous examples as currently written
+in the AI where the top level operation is not a reducer function.
+
+The answer is that it should be trivial to rewrite the expression so that a
+reducer function is used.
+
+
+    Alphabet : constant Unbounded_String := (for Letter in 'A' .. 'Z' => Append(<>, Letter));
+
+One could write;
+
+    Alphabet : constant Unbounded_String := (for Letter in 'A' .. 'Z' => <> & To_Unbounded_String ("" & Letter));
+
+Does this seem like a way forward?
+
+****************************************************************
+
+Sent: Sunday, June 18, 2017  12:26 AM
+
+Another problem that would benefit from this simplication is bitwise "and" for
+modular integer types.
+
+It otherwise would be difficult to produce an initial value for modular integer
+types where the modulus is not a power of two. Attempting to set all bits to 1
+could result in a value that is outside the range of values for the type.
+
+With this simplification, it should not be a problem anymore, since we are only
+"and"ing together supplied values that are all valid, and the result will also
+be valid and belong to the subtype.
+
+eg.
+
+   (for Value of Arr => <> & Value)
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Sunday, June 18, 2017  2:02 AM
+
+I would rather leave it the way we discussed yesterday, as these changes in my
+view are making the construct more complex from the user’s point of view.  It is
+easy enough for the compiler to recognize special cases.  It is a pain for the
+user to have to obey additional restrictions.
+
+****************************************************************
+
+Sent: Sunday, July  9, 2017  1:23 AM
+
+I have been thinking about iteration schemes, for for loops, and I think there
+are 5 that we'd want to support in Ada. By iteration scheme, I am referring to
+the combination of chunking and iteration direction.
+
+The 5 are;
+
+- Forward Sequential
+- Reverse Sequential
+- Forward Parallel
+- Reverse Parallel
+- All     Parallel
+
+
+- Forward Sequential
+Already exists in Ada. (A for loop without the reverse keyword) Sequential and
+Forward are both implicit. Semantically, this is a single chunk, executed by a
+single tasklet, with iteration from lowest to highest.
+
+- Reverse Sequential
+Already exists in Ada. (A for loop with the reverse keyword) Sequential is
+implicit, Reverse is explicit. Semantically, also a single chunk, executed by a
+single tasklet, with iteration from highest to lowest.
+
+- Forward Parallel
+Forward is implicit, parallel is explicit.
+The for loops with the parallel keyword we were discussing in Vienna
+Semantically, multiple chunks, each corresponding to a tasklet where iteration
+within each chunk is from lowest to highest.
+
+- Reverse Parallel
+Both Reverse and Parallel are explicit.
+A for loop with both the parallel keyword, and the reverse keyword.
+Semantically, multiple chunks, each corresponding to a tasklet where iteration
+within each chunk is from highest to lowest.
+
+- All Parallel
+Both all and Parallel are explicit.
+A for loop with the parallel keyword, and the "all" keyword.
+Semantically, each iteration is a separate chunk, or alternatively viewed as
+there not being any chunks. There is no order to the iteration, any iteration
+can occur before any other.
+
+The case for reverse parallel would admittedly be quite rare, just as Reverse
+Sequential loops are quite rare today in Ada, however it is not hard to come up
+with an example.
+
+Consider finding the highest index of an array whose element satisfies some
+property of interest. If chunking is involved, it makes sense to want to search
+in reverse order, because once you've found a candidate, you know that there is
+no need to search further within the chunk, and processing can be abandoned.
+With a forward sequential loop, one cannot be sure that they've found a
+candidate without searching the entire chunk.
+
+Eg.
+
+    --  Note: We do not want exit to pro-actively terminate processing in
+    --  other chunks here. We just want to exit processing for the
+    --  current chunk if any of the exit conditions hold true.
+    --  Updater is a protected object in this example.
+
+    --  Find maximum array index whose element satisfies some property
+
+    for I in reverse parallel Arr'Range loop
+
+       -- This first check causes other chunks to stop processing
+       -- if a higher chunk has already found a result
+       if I < Updater.Value then
+          exit;
+
+       -- This second check abandons the current chunk if a
+       -- result is found, since any other finds in the same
+       -- chunk will have a lower index
+       elsif Satisfies (Arr (I)) then
+          Updater.Update (I);
+          exit;
+       end if;
+    end loop;
+
+    Put_Line ("The answer is" & Integer'Image (Updater.Value));
+
+The all parallel case is useful when the programmer wants to do manual chunking.
+This is useful for instance in problems such as generating a cumulative array
+sum of another array, as two loops are needed and generally both loops should be
+chunked the same way.
+
+Another useful case for all parallel loops is when one wants to mix sequential
+processing within the parallel processing, for instance if one wants to output
+some results at the end of each iteration through the chunks. Consider the
+following example to compute a 7 day weather forecast for various cities in the
+world, as well as an average of all these values for each day.
+
+    declare
+       type Cities is (London,
+                       Tokyo,
+                       New_York,
+                       Beijing,
+                       Calgary,
+                       Paris,
+                       Los_Angeles);
+
+       Today : constant Calendar.Time := Calendar.Clock;
+       Forecast : array (Cities) of Float := (others => 0.0);
+
+       Done_Parallelism, Done_Sequential : Synchronous_Barrier
+         (Release_Threshold => Forecast'Length);
+
+       function Max_Temp
+          (City : Cities; Day : Calendar.Time) return Float is
+       begin
+         ...
+          --  Presumably some complex computer intensive calculation....
+          return Result;
+       end Max_Temp;
+
+       Notified : Boolean := False;
+       use type Calendar.Arithmetic.Day_Count;
+    begin
+
+       Day_Loop :
+       for Day in 1 .. 7 loop -- Compute 7 day forecast
+
+          City_Loop :
+          for City in all parallel Cities'Range loop
+
+             -- Calculated in parallel for each city
+             Forecast (City) :=
+               Max_Temp (City,
+                         Today + Calendar.Arithmetic.Day_Count (Day));
+
+             Synchronous_Barriers.Wait_For_Release (Done_Parallelism,
+                                                    Notified);
+
+             -- Transitioning from parallel to sequential
+             if Notified then
+
+                Output_Loop :
+                for City in Cities'Range loop
+                   Put (Cities'Image (City) &
+                          Float'Image (Forecast (City)) & ' ');
+
+                   -- Note use of parallel reduction expression
+
+                   Put ("World Avg" &
+                          Float'Image
+                          ((for Temp of parallel Forecast =>
+                            <0.0> + Temp)
+                           / Float (Forecast'Length)));
+                end loop Output_Loop;
+             end if;
+
+            -- Transitioning from Sequential back to Parallel
+             Synchronous_Barriers.Wait_For_Release (Done_Sequential,
+                                                    Notified);
+          end loop City_Loop;
+       end loop Day_Loop;
+
+    end;
+
+The synchronous barriers are used to transition between parallel and sequential
+processing, and then back to to parallel processing for each day of the
+forecast. For this to work, each iteration in City_Loop must iterate in parallel
+with each other since the Release_Threshold of the barriers needs to be known in
+advance. The calculation of each forecast for each city for each day is done in
+parallel, but presumably, the weather calculations are such that one cannot
+proceed to the next day until all the weather forecasting is complete for the
+previous day for all the cities.
+
+I know in Vienna, we had taken a straw poll, which favored putting the keyword
+parallel before the word "for" in the for loop. While this is a useful data
+point, I think if we want to support "all parallel" loops, I find it makes a lot
+more sense to have all parallel together after the word "in" or "of".
+
+eg
+    for I in all parallel 1 .. 10 loop
+
+or
+
+    for Element of all parallel List loop
+
+rather than;
+
+all parallel for i in 1 .. 10 loop
+or
+parallel all for I in 1 .. 10 loop
+
+
+as it reads more like an English sentence, similar to reverse parallel loops.
+
+for I in reverse parallel 1 .. 10 loop
+
+
+parallel for I in reverse 1 .. 10 loop
+
+The other point that was not made in Vienna, is that there is a case for saying
+that the syntax for specifying iteration scheme should be grouped together in
+the same place in the syntax.
+
+Having parallel at the beginning of the loop statement, and reverse in the
+middle of the loop statement means you have to look in two places for the reader
+to determine the iteration scheme.
+
+To recap some of my other arguments that were made in Vienna, which might
+resonate better with the above considerations
+
+1. for ... loop  identifies the control construct which is the most important
+   part of the syntax. That the iterations occur in parallel, reverse, forward,
+   etc is secondary, which suggests that parallel is not the most important part
+   of the syntax and therefore should not appear first.
+
+2. This point was not specifically made in Vienna, but the parallel block is
+   also a control structure, so in that case, it makes sense that parallel is
+   the first word.
+
+3. In Vienna, we decided that since parallel for loops with parallel at the
+   front of the syntax would have some ambiguity with parallel blocks, we felt
+   we had to add the Begin keyword to the parallel blocks. I did not like that
+   we were being forced to alter our other syntax on account of the choice of
+   parallel keyword placement in for loops. In the examples I have used I have
+   found that any declarations typically need to have greater scope than the
+   parallel block, so it is not clear to me that having a begin declaration area
+   as part of the parallel block is particularly useful.
+
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Sunday, July  9, 2017  8:08 AM
+
+...
+> Both Reverse and Parallel are explicit.
+> A for loop with both the parallel keyword, and the reverse keyword.
+> Semantically, multiple chunks, each corresponding to a tasklet where
+> iteration within each chunk is from highest to lowest.
+>
+
+Given where we are placing the “parallel” word in the latest proposal discussed
+by the ARG (at the beginning), there seems no particular reason to disallow
+“reverse.”  But I don’t find your example very compelling:
+
+...
+> Consider finding the highest index of an array whose element
+> satisfies some property of interest. If chunking is involved, it makes
+> sense to want to search in reverse order, because once you've found a
+> candidate, you know that there is no need to search further within the
+> chunk, and processing can be abandoned. …
+
+>
+> Eg.
+>
+>   --  Note: We do not want exit to pro-actively terminate processing in
+>   --  other chunks here. We just want to exit processing for the
+>   --  current chunk if any of the exit conditions hold true.
+>   --  Updater is a protected object in this example.
+>
+>   --  Find maximum array index whose element satisfies some property
+>
+>   for I in reverse parallel Arr'Range loop
+>
+>      -- This first check causes other chunks to stop processing
+>      -- if a higher chunk has already found a result
+>      if I < Updater.Value then
+>         exit;
+>
+>      -- This second check abandons the current chunk if a
+>      -- result is found, since any other finds in the same
+>      -- chunk will have a lower index
+>      elsif Satisfies (Arr (I)) then
+>         Updater.Update (I);
+>         exit;
+>      end if;
+>   end loop;
+>
+>   Put_Line ("The answer is" & Integer'Image (Updater.Value));
+
+The semantics for exiting a loop early as currently proposed is that if any
+iteration invokes an “exit,” all other iterations are abandoned.  So you
+wouldn’t end up with the highest index overall.  You would end up with the one
+that is the highest within its chunk, but of which chunk would be somewhat
+random.  And along the way you have added a synchronized operation in each
+iteration (if I < Updater.Value) which seems highly likely to slow things down
+significantly.
+
+One place where "reverse parallel" definitely makes sense is in the proposed
+map/reduce construct, where the “parallel” is really about doing the “map” part
+in parallel, and then doing the “reduction” part in some kind of parallel tree,
+preserving order.  So “reverse” has a well-defined meaning.  I don’t see
+“reverse" having much meaning in the more conventional   parallel iteration
+schemes.  It just produces a bit of a bias within each chunk, but has no
+particularly well-defined overall semantic effect.
+
+...
+> - All Parallel
+> Both all and Parallel are explicit.
+> A for loop with the parallel keyword, and the "all" keyword.
+> Semantically, each iteration is a separate chunk, or alternatively
+> viewed as there not being any chunks. There is no order to the
+> iteration, any iteration can occur before any other.
+
+I would say using “all” to signify a universal quantified expression as well as
+this new idea is asking for confusion.
+
+...
+> The all parallel case is useful when the programmer wants to do manual
+> chunking. This is useful for instance in problems such as generating a
+> cumulative array sum of another array, as two loops are needed and
+> generally both loops should be chunked the same way.
+...
+> The synchronous barriers are used to transition between parallel and
+> sequential processing, and then back to to parallel processing for
+> each day of the forecast. For this to work, each iteration in
+> City_Loop must iterate in parallel with each other since the
+> Release_Threshold of the barriers needs to be known in advance. The
+> calculation of each forecast for each city for each day is done in
+> parallel, but presumably, the weather calculations are such that one
+> cannot proceed to the next day until all the weather forecasting is complete for the previous day for all the cities.
+
+I guess I understand the semantics, but it just feels too complicated to me.
+For something like this, two separate parallel loops, or an explicit array of
+tasks might be more appropriate.  Trying to synchronize across tasklets is going
+to be complex, and the semantics will be subject to a lot of caveats, I suspect.
+Tasklets can generally use run-to-completion semantics, and that clearly won’t
+work here with barriers in the middle.  If you want a barrier, you can end the
+parallel loop, and then start a new one, which creates a natural, highly
+visible, easy-to-understand barrier.
+
+> I know in Vienna, we had taken a straw poll, which favored putting the
+> keyword parallel before the word "for" in the for loop. While this is
+> a useful data point, I think if we want to support "all parallel"
+> loops, I find it makes a lot more sense to have all parallel together
+> after the word "in" or "of”.  …
+
+Personally I would rather not re-open this decision.  We have already gone round
+and round on a number of these issues.  The usual rule is that only someone who
+voted *for* the chosen approach can re-open the discussion.
+
+****************************************************************
+
+Sent: Monday, July 10, 2017  2:24 PM
+
+>> … - Reverse Parallel
+>> Both Reverse and Parallel are explicit.
+>> A for loop with both the parallel keyword, and the reverse keyword.
+>> Semantically, multiple chunks, each corresponding to a tasklet where
+>> iteration within each chunk is from highest to lowest.
+>>
+>
+> Given where we are placing the “parallel” word in the latest proposal
+> discussed by the ARG (at the beginning), there seems no particular reason to
+> disallow “reverse.”  But I don’t find your example very compelling:
+
+I will try to see if I can come up with some more compelling examples.
+But I agree that reverse parallel loops are not likely to be used much.
+It might be that the best use case is finding the max occurrence of something,
+similar to the example I provided, which is not a common problem, but might be
+common enough to allow the syntax to cover that case.
+
+...
+> The semantics for exiting a loop early as currently proposed is that if any
+> iteration invokes an “exit,” all other iterations are abandoned.
+
+Not quite, the AI adds this note:
+
+"Note that aborting a tasklet need not be preemptive, but should prevent the
+initiation of further nested parallel blocks or parallel loops."
+
+So if the loop is running its chunks non-preemptively, then presumably the code
+would work as I suggested.
+
+I think there are two cases to consider.
+1) A high number of iterations with small processing per iteration
+      eg. (incrementing integer elements of a large array)
+2) A small number of iterations with large processing per iteration.
+      eg. (Calculating weather for a handful of cities)
+
+
+In case 1, the majority of time is being spent doing the iteration. I suspect
+that preemptive tasklet abortion in that case will introduce significant
+overhead and interfere too much with the parallelism. This is better suited for
+the non-preemptive case.
+
+In case 2, the actual iteration is insignificant, and there is much to be gained
+
+I can imagine we might want to give the programmer control of this decision via
+an aspect or pragma, such as Preemptive_Aborts but otherwise let the compiler
+decide how to deal with such transfer of control.
+
+> One place where "reverse parallel" definitely makes sense is in the proposed
+> map/reduce construct, where the “parallel” is really about doing the “map” part
+> in parallel, and then doing the “reduction” part in some kind of parallel tree,
+> preserving order.
+
+Agreed.
+
+...
+>> - All Parallel
+>> Both all and Parallel are explicit.
+>> A for loop with the parallel keyword, and the "all" keyword.
+>> Semantically, each iteration is a separate chunk, or alternatively
+>> viewed as there not being any chunks. There is no order to the
+>> iteration, any iteration can occur before any other.
+>
+> I would say using “all” to signify a universal quantified expression as well
+> as this new idea is asking for confusion.
+
+I dont think this is a real problem, just as it isn't for using "all"
+for dereferencing access objects, because the surrounding context is different.
+As in english, "all" typically describes what follows,
+
+for all X ...
+
+    -- meaning for all values of X
+
+for X in all parallel ....
+
+     -- meaning for all parallel iterations
+
+Also Quantified expressions and Loops statements appear in very different
+
+> ...
+>> The synchronous barriers are used to transition between parallel and
+>> sequential processing, and then back to to parallel processing for
+>> each day of the forecast. For this to work, each iteration in
+>> City_Loop must iterate in parallel with each other since the
+>> Release_Threshold of the barriers needs to be known in advance. The
+>> calculation of each forecast for each city for each day is done in
+>> parallel, but presumably, the weather calculations are such that one
+>> cannot proceed to the next day until all the weather forecasting is complete for the previous day for all the cities.
+>
+> I guess I understand the semantics, but it just feels too complicated to me.  For something like this, two separate parallel loops, or an explicit array of tasks might be more appropriate.  Trying to synchronize across tasklets is going to be complex, a
nd the semantics will be subject to a lot of caveats, I suspect.  Tasklets can generally use run-to-completion semantics, and that clearly won’t work here with barriers in the middle.  If you want a barrier, you can end the parallel loop, and then start a
new one, which creates a natural, highly visible, easy-to-understand barrier.
+
+To do this as two separate loops, you need to store much more state. In this
+case, you'd need a two dimensional array and store all the values for all days
+of the forecast. Using barriers, you only need to track the current day
+forecast, so a single dimensional array works here.
+
+There are other similar problems that benefit from the use of barriers.
+
+One is the n-body problem, where the masses of heavenly bodies and direction of
+movement are tracked so that the orbital paths can be predicted. The technique
+is to use small increments of time and then update the location of all the
+heavenly bodies taking gravitational pulls into account, and the position of the
+heavenly bodies can then be output for each successive time increment.
+
+Using barriers, only the current location of each body needs to be tracked.
+Using your suggestion of two separate loops, one would have to store all the
+updated movements in memory before outputting the results. This could
+potentially and literally require an astronomical amount of storage.
+
+Using barriers really is the only reasonable solution for this, that I can see.
+
+Yet another example of this sort of problem is of linear algebra solving solving
+matrices using gaussian elimination.  As each column is processed, barriers
+allow one to find the pivot point for the next column. I dont think I can
+imagine how to solve this by breaking into separate loops, not that I have
+tried, but it seems like it would be very difficult. I think using barriers is
+the best solution for that problem when solving in parallel.
+
+I also think in the AI we already talk about calling protected objects from
+
+" calls by different tasklets of the same task into the same protected object
+are treated as different calls resulting in distinct protected actions;
+therefore synchronization between tasklets can be performed using non-blocking
+protected operations. Note that this is consistent with the current standard
+which already supports multiple concurrent calls by a single task in the
+presence of the asynchronous transfer of control capability (RM 9.7.4 (23))."
+
+So I think the current AI design suggests that barriers should work as I
+describe in the example.
+
+>> I know in Vienna, we had taken a straw poll, which favored putting
+>> the keyword parallel before the word "for" in the for loop. While
+>> this is a useful data point, I think if we want to support "all
+>> parallel" loops, I find it makes a lot more sense to have all
+>> parallel together after the word "in" or "of”.  …
+>
+> Personally I would rather not re-open this decision.  We have already gone
+> round and round on a number of these issues.  The usual rule is that only
+> someone who voted *for* the chosen approach can re-open the discussion.
+
+I wouldn't ordinarily have suggested reexamining this issue, but I would hope
+that it should be oK for anyone to open a discussion if new ideas and arguments
+are brought to light that were not considered as inputs in the original
+decision.
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Monday, July 10, 2017  2:43 PM
+
+>>> ...
+>> The semantics for exiting a loop early as currently proposed is that if any
+>> iteration invokes an “exit,” all other iterations are abandoned.
+>
+> Not quite, the AI adds this note:
+>
+> "Note that aborting a tasklet need not be preemptive, but should
+> prevent the initiation of further nested parallel blocks or parallel loops."
+>
+> So if the loop is running its chunks non-preemptively, then presumably
+> the code would work as I suggested.
+
+That is certainly not something to be relied upon.  And even if things run
+non-preemptively, there might be tasklets that have never been started, so
+certainly those would not start after an “exit.”
+
+> I think there are two cases to consider.
+> 1) A high number of iterations with small processing per iteration
+>     eg. (incrementing integer elements of a large array)
+> 2) A small number of iterations with large processing per iteration.
+>     eg. (Calculating weather for a handful of cities)
+>
+>
+> In case 1, the majority of time is being spent doing the iteration. I
+> suspect that preemptive tasklet abortion in that case will introduce
+> significant overhead and interfere too much with the parallelism. This
+> is better suited for the non-preemptive case.
+>
+> In case 2, the actual iteration is insignificant, and there is much to
+> be gained by preemptively aborting other tasklets.
+>
+> I can imagine we might want to give the programmer control of this
+> decision via an aspect or pragma, such as Preemptive_Aborts but
+> otherwise let the compiler decide how to deal with such transfer of
+> control.
+
+This is adding yet more complexity to the model.  And as pointed out above, if you have more tasklets than cores, some of them might not be started at all before the “exit."
+>
+>>> ...
+>>>
+>> I guess I understand the semantics, but it just feels too complicated to me.  For something like this, two separate parallel loops, or an explicit array of tasks might be more appropriate.  Trying to synchronize across tasklets is going to be complex,
and the semantics will be subject to a lot of caveats, I suspect.  Tasklets can generally use run-to-completion semantics, and that clearly won’t work here with barriers in the middle.  If you want a barrier, you can end the parallel loop, and then start a
new one, which creates a natural, highly visible, easy-to-understand barrier.
+>
+> To do this as two separate loops, you need to store much more state.
+
+You should factor in the amount of space needed to store the stacks of all of
+the temporarily suspended tasklets.  You are requiring that all but one tasklet
+be suspendible until they all reach the barrier.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Monday, July 10, 2017  3:16 PM
+
+Just a few words on this, because I don't think it pays to get bogged down in
+detail...
+
+...
+> So if the loop is running its chunks non-preemptively, then presumably
+> the code would work as I suggested.
+>
+> I think there are two cases to consider.
+> 1) A high number of iterations with small processing per iteration
+>       eg. (incrementing integer elements of a large array)
+> 2) A small number of iterations with large processing per iteration.
+>       eg. (Calculating weather for a handful of cities)
+
+This is the crux of the issue. There clearly are multiple termination approaches
+that make sense for a parallel loop, and those are mostly independent of the
+form of the loop itself. (And there's more than two!)
+
+   Preemptive vs. not preemptive
+   Order significant vs. not significant
+
+Specifically, one possibility is the proposed rule that an exit of any tasklet
+terminated the others. Another possibility is that the exit effectively works
+the same as a sequential loop -- meaning that an exit terminates any tasklets
+with indicies above the exiting one, and the tasklets with indicies below
+continue (this would work well for the search application among others). Since
+the latter is likely to have a higher overhead, one probably would want the
+option to select the termination mechanism.
+
+There's other details that might be worth controlling.
+
+My overall point is that it is unlikely that the details could be controlled
+solely with keywords, there are just too many. We need something like an aspect
+specification for these loops so that the details can be controlled without
+introducing loads of keywords (which will be hard to remember).
+
+And the default should be that semantics are as close as reasonable to
+sequential loops; many cases can use simpler termination but I think the default
+should be fewest surprises.
+
+...
+> >> I know in Vienna, we had taken a straw poll, which favored putting
+> >> the keyword parallel before the word "for" in the for loop. While
+> >> this is a useful data point, I think if we want to support "all
+> >> parallel" loops, I find it makes a lot more sense to have all
+> >> parallel together after the word "in" or "of".  .
+> >
+> > Personally I would rather not re-open this decision.  We
+> have already gone round and round on a number of these issues.  The
+> usual rule is that only someone who voted *for* the chosen approach
+> can re-open the discussion.
+>
+> I wouldn't ordinarily have suggested reexamining this issue, but I
+> would hope that it should be oK for anyone to open a discussion if new
+> ideas and arguments are brought to light that were not considered as
+> inputs in the original decision.
+
+There are always going to be "new ideas" for an AI like this one that it at the
+bleeding edge. Allowing unlimited reopening only ensures that nothing ever will
+be decided and no progress ever will be made. That's certainly the case for this
+AI in the last year or so. Keep in mind that to stay on schedule this thing has
+to be mostly frozen in little over a year. Going back to square one for every
+decision ensures that there is almost no chance of that happening. (The only
+reason IMHO that we didn't completely change parallel blocks at this last
+meeting is that we barely talked about them.)
+
+****************************************************************
+
+Sent: Monday, July 11, 2017  9:18 AM
+
+>>>> ...
+>>> The semantics for exiting a loop early as currently proposed is that if any iteration invokes an “exit,” all other iterations are abandoned.
+>>
+>> Not quite, the AI adds this note:
+>>
+>> "Note that aborting a tasklet need not be preemptive, but should
+>> prevent the initiation of further nested parallel blocks or parallel loops."
+>>
+>> So if the loop is running its chunks non-preemptively, then
+>> presumably the code would work as I suggested.
+>
+> That is certainly not something to be relied upon.  And even if things run
+> non-preemptively, there might be tasklets that have never been started, so
+> certainly those would not start after an “exit.”
+
+I think if we did support some sort of non-preemptive mode, probably the most
+sensible behaviour for that mode is to just let the tasklets complete on their
+own, without any attempt to shut them down midstream. Any tasklets that hit the
+exit, or other transfer of control would terminate early, but otherwise tasklets
+generated at a POP would run to completion.
+
+...
+>> I can imagine we might want to give the programmer control of this
+>> decision via an aspect or pragma, such as Preemptive_Aborts but
+>> otherwise let the compiler decide how to deal with such transfer of
+>> control.
+>
+> This is adding yet more complexity to the model.  And as pointed out above, if
+> you have more tasklets than cores, some of them might not be started at all
+> before the "exit."
+
+I think there are a number of controls that the real time community might want,
+for example. I think all of these should be the subject of a different AI
+though, as they can be added later on their own merit, including an explicit
+non-preemptive mode for tasklet termination, which I think should probably allow
+starting after an "exit", (or continuing after an exit if tasklets get initially
+created to cover the full iterations of the for loop, which tends to be how I
+look at things.)
+
+...
+>>>>
+>>> I guess I understand the semantics, but it just feels too complicated to me.  For something like this, two separate parallel loops, or an explicit array of tasks might be more appropriate.  Trying to synchronize across tasklets is going to be complex,
and the semantics will be subject to a lot of caveats, I suspect.  Tasklets can generally use run-to-completion semantics, and that clearly won’t work here with barriers in the middle.  If you want a barrier, you can end the parallel loop, and then start
a new one, which creates a natural, highly visible, easy-to-understand barrier.
+>>
+>> To do this as two separate loops, you need to store much more state.
+>
+> You should factor in the amount of space needed to store the stacks of all
+> of the temporarily suspended tasklets.  You are requiring that all but one
+> tasklet be suspendible until they all reach the barrier.
+
+Yes, but that storage is bounded, and the programmer can always do manual
+chunking if necessary to to bring the number of chunks down to the number of
+available cores, or a small multiple of that. The simulation can run
+indefinitely in theory without having to worry about running out of storage.
+This tasklet stack storage becomes insignificant for the n-body problem,
+compared to the real bounds of N * state, where N is the number of bodies, and
+state is the amount of state variables for each body.
+
+With separate loops however, the storage needed for the simulation however would
+be unbounded, in the sense that the longer the simulation runs, the more storage
+is needed, and you need much more of it.
+
+The needs become N * state * time-increments, where time-increments are the
+number of simulation steps that are being computed. The notion of running a
+simulation for an unspecified amount of time becomes problematic, because of the
+risk of running out of storage.
+
+I think this barrier pattern could be commonly applied to a lot of
+simulation/modeling scenarios.
+
+****************************************************************
+
+Sent: Tuesday, July 11, 2017  9:45 AM
+
+> This is the crux of the issue. There clearly are multiple termination
+> approaches that make sense for a parallel loop, and those are mostly
+> independent of the form of the loop itself. (And there's more than
+> two!)
+>
+>     Preemptive vs. not preemptive
+>     Order significant vs. not significant
+
+In the gang of four, we had discussed the idea of declaring a local subtype for
+a loop iterator, and then applying aspects to that declaration to provide
+specific controls.
+
+eg.
+
+declare
+   subtype Loop_Iterator is Integer range 1 .. 100000
+            with Executors => 10, Chunk_Size => 100,
+            Load_Sensitivity, Preemptive_Termination => False; begin
+    parallel for I in Loop_Iterator'range loop
+       ....
+    end loop;
+end;
+
+Adding these sort of specific controls should be a separate AI, and can be added
+later independent of this AI.
+
+We could say that the "all parallel" notion could be a similar aspect that is
+specified this way, but to me at least it seems like this would be an
+inconsistency, since all other iteration schemes are built right into the loop
+syntax. I think the "all parallel" scheme is useful enough that it ought to be
+visible at the same level as the parallel keyword, and the reverse keyword,
+regardless whether these two keywords appear adjacent to each other, or
+separated as we have coming out of Vienna.
+
+The question is how best to do this in the syntax. It seemed to fit nicely in
+with the version of the AI going into Vienna. It doesn't seem to fit very well
+with the version of the AI we'll have coming out of Vienna, but maybe we'll
+either live with that, or come up with a better idea.
+
+...
+> Specifically, one possibility is the proposed rule that an exit of any
+> tasklet terminated the others. Another possibility is that the exit
+> effectively works the same as a sequential loop -- meaning that an
+> exit terminates any tasklets with indicies above the exiting one, and
+> the tasklets with indicies below continue (this would work well for
+> the search application among others). Since the latter is likely to
+> have a higher overhead, one probably would want the option to select
+> the termination mechanism.
+>
+> There's other details that might be worth controlling.
+>
+> My overall point is that it is unlikely that the details could be
+> controlled solely with keywords, there are just too many. We need
+> something like an aspect specification for these loops so that the
+> details can be controlled without introducing loads of keywords (which will be
+> hard to remember).
+
+I think the aspect specification like I described above would cover most of
+these, but "all parallel" feels like it should be more integrated with the loop
+syntax, for consistency, and importance for semantic behaviour. I think there
+are enough problems that would need to rely on these semantics. Most other
+controls such as number of executors, or chunk size do not really affect the
+semantics. They just impact the speed of the processing.
+
+> And the default should be that semantics are as close as reasonable to
+> sequential loops; many cases can use simpler termination but I think
+> the default should be fewest surprises.
+
+Agreed.
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Tuesday, July 11, 2017  9:55 PM
+
+>>> So if the loop is running its chunks non-preemptively, then
+>>> presumably the code would work as I suggested.
+>>
+>> That is certainly not something to be relied upon.  And even if things run
+>> non-preemptively, there might be tasklets that have never been started, so
+>> certainly those would not start after an “exit.”
+>
+> I think if we did support some sort of non-preemptive mode, probably the most
+> sensible behaviour for that mode is to just let the tasklets complete on their
+> own, without any attempt to shut them down midstream. Any tasklets that hit
+> the exit, or other transfer of control would terminate early, but otherwise
+> tasklets generated at a POP would run to completion.  ...
+
+I believe this is creating two significantly different semantics for a single
+syntactic construct, which is bad news in my view.  You are heading toward
+“programming by pragma” (or "by aspect"), I fear.  That is largely what OpenMP
+does, and I believe we should avoid it.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Wednesday, July 19, 2017  8:36 PM
+
+...
+> > And the default should be that semantics are as close as reasonable
+> > to sequential loops; many cases can use simpler termination but I
+> > think the default should be fewest surprises.
+>
+> Agreed.
+
+(I think this is the only thing that I wrote that Brad agreed with, but don't
+look behind that curtain! :-)
+
+I had the thought that any loop exits have to be explicitly in the loop body
+(they can't be in some subprogram that is called). Thus, we could make the
+termination semantics depend upon that. (The same applies to AI12-0189-1, the
+loop body as anomymous procedure, BTW).
+
+The basic idea is that one could depend on the loop index value determined at
+the point of an exit; if the loop finishes normally or exits abnormally, you
+couldn't have such a dependence. Thus if there is no exit, the termination is a
+bit simpler (not much, really); if there is, we require the loop to get the same
+loop index as the similar sequential loop.
+
+You could use an exception for a non-local exit, but that is so ugly I think it
+would be OK to ignore it for the purposes of termination semantics.
+
+Thus a loop like the following would work in parallel:
+
+             parallel for I in 1 .. Limit loop
+                if DB(I).Name = Name_to_Find then
+                   My_PO.Save_Found_Index (I); -- Save the lowest index found.
+                   exit; -- Done looking.
+                -- else continue
+                end if;
+             end loop;
+
+P.S. Unrelated aside: I have to wonder about the wisdom of considering atomic
+objects task-safe, since Ada doesn't have any form of atomic update. (The only
+safe atomic writing operation is initialization.) Almost any use of an atomic
+object would have a race condition. I originally wrote the above My_PO use as:
+          if Result > I then
+              Result := I;
+          end if;
+with Result an atomic object initialized to Integer'Last (something I do
+sequentially all the time) -- but the above is a race condition and definitely
+is NOT task-safe. The same would be true for any other test of an atomic object
+followed by a write of that object -- but that is the natural sequential code.
+
+It seems better (given the current facilities) to insist on a PO here (since the
+natural code in a PO would do the right thing).
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Wednesday, July 19, 2017  9:06 PM
+
+...
+> >>> So if the loop is running its chunks non-preemptively, then
+> >>> presumably the code would work as I suggested.
+> >> That is certainly not something to be relied upon.  And
+> even if things run non-preemptively, there might be tasklets that have
+> never been started, so certainly those would not start after an
+> "exit."
+> >
+> > I think if we did support some sort of non-preemptive mode,
+> probably the most sensible behaviour for that mode is to just let the
+> tasklets complete on their own, without any attempt to shut them down
+> midstream. Any tasklets that hit the exit, or other transfer of
+> control would terminate early, but otherwise tasklets generated at a
+> POP would run to completion.  ...
+>
+> I believe this is creating two significantly different semantics for a
+> single syntactic construct, which is bad news in my view.  You are
+> heading toward "programming by pragma"
+> (or "by aspect"), I fear.  That is largely what OpenMP does, and I
+> believe we should avoid it.
+
+This is a real concern, but I have to wonder if we really have any choice.
+
+If our primary goal is (as I hope it is) to make it as easy as possible to
+convert sequential code into parallel code, the semantics have to be a close as
+possible to the equivalent sequential loop. That means that there can't be any
+erroneous execution from aborts, the loop index has be meaningful at a normal
+exit, and so on. That's going to be more expensive than the straight
+abort-everything-on-any-sort-of-completion (I'm skeptical that it would be much
+more expensive in usual cases, but I also know that I know just enough to be
+dangerous in this area). So people who need absolute performance may need some
+other semantics (including abort, for instance).
+
+I suspect that Brad's non-premptive semantics make the most sense for the
+default semantics for parallel loops (probably not for parallel blocks; the
+problems are very different). That means that anyone that wants preemptive
+semantics is going to have to use a pragma or aspect. But I don't think that is
+a huge problem semantically; the difference only matters if the loop has a bug
+already, or depends on the loop index at exit (in which case the preemptive
+semantics is plain wrong and we could even make it illegal to use it in that
+case as noted in my other message).
+
+Anyway, back to real work...
+
+****************************************************************
+
+Sent: Thursday, July 20, 2017  12:16 AM
+
+> ...
+>> > And the default should be that semantics are as close as reasonable
+>> > to sequential loops; many cases can use simpler termination but I
+>> > think the default should be fewest surprises.
+>>
+>> Agreed.
+>
+> (I think this is the only thing that I wrote that Brad agreed with,
+> but don't look behind that curtain! :-)
+
+I disagree. :-)
+I'd have to check, but my overall sense was that I was agreeing with more of
+what you said.
+
+> I had the thought that any loop exits have to be explicitly in the
+> loop body (they can't be in some subprogram that is called). Thus, we
+> could make the termination semantics depend upon that. (The same
+> applies to AI12-0189-1, the loop body as anomymous procedure, BTW).
+>
+> The basic idea is that one could depend on the loop index value
+> determined at the point of an exit; if the loop finishes normally or
+> exits abnormally, you couldn't have such a dependence. Thus if there
+> is no exit, the termination is a bit simpler (not much, really); if
+> there is, we require the loop to get the same loop index as the similar sequential loop.
+>
+> You could use an exception for a non-local exit, but that is so ugly I
+> think it would be OK to ignore it for the purposes of termination semantics.
+>
+> Thus a loop like the following would work in parallel:
+>
+>             parallel for I in 1 .. Limit loop
+>                if DB(I).Name = Name_to_Find then
+>                   My_PO.Save_Found_Index (I); -- Save the lowest index found.
+>                   exit; -- Done looking.
+>                -- else continue
+>                end if;
+>             end loop;
+
+Maybe I am not fully understanding this idea, or maybe these semantics are the
+ones I had in my mind all the time, and have not considered the alternate
+semantics that you might otherwise have assumed?
+
+> P.S. Unrelated aside: I have to wonder about the wisdom of considering
+> atomic objects task-safe, since Ada doesn't have any form of atomic update.
+> (The only safe atomic writing operation is initialization.) Almost any
+> use of an atomic object would have a race condition. I originally
+> wrote the above My_PO use as:
+>          if Result > I then
+>              Result := I;
+>          end if;
+> with Result an atomic object initialized to Integer'Last (something I
+> do sequentially all the time) -- but the above is a race condition and
+> definitely is NOT task-safe. The same would be true for any other test
+> of an atomic object followed by a write of that object -- but that is
+> the natural sequential code.
+>
+> It seems better (given the current facilities) to insist on a PO here
+> (since the natural code in a PO would do the right thing).
+
+Actually, if I were to write this, I was thinking that a PO would be used to
+update a global volatile/atomic variable, but otherwise the global variable
+
+Something like;
+
+    parallel for I in 1 .. Limit loop
+       if DB(I).Name = Name_to_Find and then I < Result then
+          My_PO.Save_Found_Index (I); -- Save the lowest index found in Result.
+          exit; -- Done looking.
+    -- else continue
+       end if;
+    end loop;
+
+The extra check "and then I < Result" would potentially provide an
+improvement/short cut on performance, assuming it is cheaper to test an
+atomic/volatile integer, than to always have to access the PO.
+
+On a further aside, I recognise that one would ordinarily consider atomic as the
+safe choice for such a variable declaration, but I found that declaring the
+global variable to be volatile rather than atomic provided significant
+performance improvements. I'm not sure if that is just an artefact of the
+combination of compilers and target platforms I was using, or Ada semantics
+somehow demands this, but if the type of the object is such that it can be read
+atomically without putting memory fences or guards around the access, eg. such
+as via a single machine instruction, then I found I'd get consistently correct
+results, with improved performance, over declaring the global as atomic. Ada
+doesn't distinguish, but I wonder if there are certain data types that when
+declared to be volatile, are also atomic. We already know that variable declared
+to be atomic are also volatile in Ada. Or maybe there are certain volatile types
+that have some useful properties of atomic, but are somehow otherwise
+lighter-weight than full atomic?
+
+****************************************************************
+
+Sent: Thursday, July 20, 2017  12:37 AM
+
+> I suspect that Brad's non-premptive semantics make the most sense for
+> the default semantics for parallel loops (probably not for parallel
+> blocks; the problems are very different). That means that anyone that
+> wants preemptive semantics is going to have to use a pragma or aspect.
+
+Not necessarily a pragma or aspect. See below.
+
+> But I don't think that
+> is a huge problem semantically; the difference only matters if the
+> loop has a bug already, or depends on the loop index at exit (in which
+> case the preemptive semantics is plain wrong and we could even make it
+> illegal to use it in that case as noted in my other message).
+
+I think the non-preemptive semantics probably also are less likely to introduce
+erroneous or unexpected results more generally. But that is just my overall
+sense.
+
+I agree here, that I think the non-preemptive behaviour might be the best
+default.
+
+Also, I am not advocating a programming by pragma view. The goal should be that
+for most cases (> 90-99%), no pragmas would be necessary.
+
+Note, we have currently have other cases in Ada where two very different
+semantics can be invoked with respect to preemptive and non-preemtive behaviour,
+and I think we should try to be consistent with those already existing syntax if
+possible.
+
+In particular, we have;
+   requeue
+and
+   requeue with abort
+
+I am now thinking that we could use a "with abort" on a parallel loop, to get
+the preemptive semantics, but otherwise the semantics would be non-preemptive.
+
+eg.
+
+parallel with abort
+  for I in 1 .. limit loop
+     Long_Calculation (I);
+  end loop;
+
+Here, for example, an exception raised in Long_Calculation would be expected to
+preempt other parallel iterations involving the same call.
+
+This might also be a good way to tie in the "all" for the "all parallel" loops
+that I was describing in recent emails, with the current parallel keyword
+placement. a "with all", would imply that any iteration can be expected to
+potentially execute in parallel with any other iteration. (ie. no chunking)
+
+parallel with all
+   for I in 1 .. limit loop
+     Long_Calculation (I);
+   end loop;
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Thursday, July 20, 2017  1:32 AM
+
+...
+> > Thus a loop like the following would work in parallel:
+> >
+> >             parallel for I in 1 .. Limit loop
+> >                if DB(I).Name = Name_to_Find then
+> >                   My_PO.Save_Found_Index (I); -- Save the lowest index found.
+> >                   exit; -- Done looking.
+> >                -- else continue
+> >                end if;
+> >             end loop;
+>
+> Maybe I am not fully understanding this idea, or maybe these semantics
+> are the ones I had in my mind all the time, and have not considered
+> the alternate semantics that you might otherwise have assumed?
+
+I think you and I agree on this point, but Tucker seems to be using strict
+parallel block semantics for this. And that wouldn't work for this loop, since
+at the first exit, all of the other tasklets would be aborted (by the parallel
+block semantics). So it would be non-deterministic which exit was ultimately
+executed (assuming there is more than one).
+
+It's pretty clear to me that "exit" is not like an exception, and should be
+treated differently. (Not as certain about gotos, since ones clear out of the
+loop may or may not be like an exit.) A block doesn't have anything like exit
+(although it could be emulated with a goto).
+
+...
+> Actually, if I were to write this, I was thinking that a PO would be
+> used to update a global volatile/atomic variable, but otherwise the
+> global variable could be read/tested directly.
+>
+> Something like;
+>
+>     parallel for I in 1 .. Limit loop
+>        if DB(I).Name = Name_to_Find and then I < Result then
+>           My_PO.Save_Found_Index (I); -- Save the lowest index found in Result.
+>           exit; -- Done looking.
+>        -- else continue
+>        end if;
+>     end loop;
+>
+> The extra check "and then I < Result" would potentially provide an
+> improvement/short cut on performance, assuming it is cheaper to test
+> an atomic/volatile integer, than to always have to access the PO.
+
+A PO is supposed to be a monitor, and its rather abstraction breaking to put the
+monitored object outside of it. (Yes, you sometimes have to do that, but I don't
+recommend it if you don't have to do so.)
+
+> On a further aside, I recognise that one would ordinarily consider
+> atomic as the safe choice for such a variable declaration, but I found
+> that declaring the global variable to be volatile rather than atomic
+> provided significant performance improvements. I'm not sure if that is
+> just an artefact of the combination of compilers and target platforms
+> I was using, or Ada semantics somehow demands this, but if the type of
+> the object is such that it can be read atomically without putting
+> memory fences or guards around the access, eg. such as via a single
+> machine instruction, then I found I'd get consistently correct
+> results, with improved performance, over declaring the global as
+> atomic. Ada doesn't distinguish, but I wonder if there are certain
+> data types that when declared to be volatile, are also atomic. We
+> already know that variable declared to be atomic are also volatile in
+> Or maybe there are certain volatile types that have some useful
+> properties of atomic, but are somehow otherwise lighter-weight than
+> full atomic?
+
+I think you just got lucky. My understanding is that fences are needed to ensure
+memory consistency across cores. So you can only see problems if you have two
+tasks running simultaneously on two different cores -- and the OS might try to
+avoid this particular situation. I don't think there are any memory reads or
+writes that work in that situation without a fence; besides, ALL atomic objects
+use a single instruction on most implementations.
+
+Volatile is definitely not enough for any data that needs to be synchronized
+Only atomic or a PO synchronizes data (you can use one of those to synchronize
+volatile data). And I don't think experimental results mean much here, because
+some of these situations are pretty rare -- but Murphy would say that they'll
+only strike at the worst time.
+
+And Atomic isn't really enough, either, since real problems always require some
+sort of update.
+
+****************************************************************
+
+From: Jeffrey Cousins
+Sent: Thursday, July 20, 2017  11:44 AM
+
+>I know in Vienna, we had taken a straw poll, which favored putting the
+>keyword parallel before the word "for" in the for loop. While this is a
+>useful data point, I think if we want to support "all parallel" loops, I
+>find it makes a lot more sense to have all parallel together after the word
+>"in" or "of".
+> eg
+>    for I in all parallel 1 .. 10 loop
+>
+> or
+>
+>    for Element of all parallel List loop
+
+
+Much as I sympathise with the logic of Brad’s argument, I think we should stick
+with what was agreed at Vienna for the time being, whilst we sort out what
+behaviour we want.
+
+Terminating loops early seems to be a particular sticking point.
+
+Only yesterday we had a case where one of our apps has M tasks checksumming N
+files, where N >> M.  Currently when the abort button is clicked, no new files
+are checksummed, but checksumming the files already in progress is allowed to
+complete.  (Non-pre-emptive).  The user would like the current checksumming to
+be aborted immediately (pre-emptive; maybe he wants to checksum a different set
+of files a.s.a.p.).
+
+But I think it equally likely that there will be a case where finishing the
+current processing cleanly is the imperative.
+
+So I think we need to provide both options (somehow).
+
+
+>Note, we have currently have other cases in Ada where two very different
+>semantics can be invoked with respect to preemptive and non-preemtive behaviour,
+>and I think we should try to be consistent with those already existing syntax
+>if possible.
+>
+>In particular, we have;
+>   requeue
+>and
+>   requeue with abort
+>
+>I am now thinking that we could use a "with abort" on a parallel loop, to get
+>the preemptive semantics, but otherwise the semantics would be non-preemptive.
+>
+>eg.
+>
+>parallel with abort
+>  for I in 1 .. limit loop
+>     Long_Calculation (I);
+>  end loop;
+
+No doubt someone will point out holes in the above, but I quite like the above
+suggestion.
+
+****************************************************************
+
+From: Erhard Ploedereder
+Sent: Thursday, July 20, 2017  4:31 PM
+
+> If our primary goal is (as I hope it is) to make it as easy as
+> possible to convert sequential code into parallel code, the semantics
+> have to be a close as possible to the equivalent sequential loop.
+
+No. Particularly, if the syntax distinguishes the "new" semantics (or else, why
+introduce new language features?)
+
+
+Canonical example:
+> That means that .... the loop index has be meaningful at a normal exit, and so on.
+
+>             parallel for I in 1 .. Limit loop
+>                 if DB(I).Name = Name_to_Find then
+>                    My_PO.Save_Found_Index (I); -- Save the lowest index found.
+>                    exit; -- Done looking.
+>                 -- else continue
+>                 end if;
+>              end loop;
+
+NO, again, to the comment in the example. Consider an array with duplicates.
+Canonical semantics gives you the lowest index indeed. If you want that, do not
+write "parallel"!
+
+If you do write "parallel":
+A parallel loop surely should be allowed to give you some found index but
+whether it is the lowest is completely non-deterministic (I hope).
+
+(Execution-wise, I am open to both models of "last one wins" or "exit terminates
+the parallel execution".)
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Thursday, July 20, 2017  6:26 PM
+
+> > If our primary goal is (as I hope it is) to make it as easy as
+> > possible to convert sequential code into parallel code, the
+> > semantics have to be a close as possible to the equivalent sequential loop.
+>
+> No. Particularly, if the syntax distinguishes the "new"
+> semantics (or else, why introduce new language features?)
+
+It's much easier to convert from one to the other if the semantics doesn't
+change much, if at all. Indeed, I'm not sure there is any point to having a
+parallel loop construct at all if this isn't true, because the existing Ada
+facilities are just fine (as demonstrated by Brad's libraries) for people who
+are already comfortable with writing parallel code. It's the majority of
+programmers who don't understand concepts like race conditions that need help.
+
+> Canonical example:
+> > That means that .... the loop index has be meaningful at a normal
+> > exit, and so on.
+>
+> >             parallel for I in 1 .. Limit loop
+> >                 if DB(I).Name = Name_to_Find then
+> >                    My_PO.Save_Found_Index (I); -- Save the lowest index found.
+> >                    exit; -- Done looking.
+> >                 -- else continue
+> >                 end if;
+> >              end loop;
+>
+> NO, again, to the comment in the example. Consider an array with
+> duplicates. Canonical semantics gives you the lowest index indeed. If
+> you want that, do not write "parallel"!
+
+So you think it is OK to make it impossible to write a significant proportion of
+real problems in parallel? You're essentially saying that this search has to run
+4 (or 8 or whatever) times slower, or the programmer has to create their own
+
+> If you do write "parallel":
+> A parallel loop surely should be allowed to give you some found index
+> but whether it is the lowest is completely non-deterministic (I hope).
+>
+> (Execution-wise, I am open to both models of "last one wins"
+> or "exit terminates the parallel execution".)
+
+I don't think that the cost of the model I'm suggesting is significantly
+different than either of these. (I again note that I know just enough about such
+things to be dangerous, so there might be some magic solution that I cannot
+imagine.) For each of the models, you have between 2 and 4 possible outcomes
+when a tasklet finishes a chunk (normal completion, exit, goto, exception; some
+of these might be treated the same in some models, but surely the first has to
+be treated differently in all models).
+
+So the only difference between a model that is index-aware and one that is not
+is that the indexes have to be known when the chunk finishes. They're obviously
+known when the chunk starts, so in the worst case, the starting index has to be
+saved with the other tasklet information. That's hardly a hardship, and the loop
+termination code is essentially the same in any of these models.
+
+Ergo, I don't see much reason to make the semantics non-deterministic.
+Non-determinism is evil; sometimes you can't avoid it, but it should never be
+employed if there is an option. And in this case, there is an option.
+
+****************************************************************
+
+Sent: Friday, July 21, 2017 12:30 AM
+
+> I suspect that Brad's non-premptive semantics make the most sense for
+> the default semantics for parallel loops (probably not for parallel
+> blocks; the problems are very different).
+
+A parallel block is effectively just a parallel loop with a small number of
+iterations that has been unrolled. One should be able to convert source code
+from either construct to the other (at least for simpler loops with a small
+number of iterations).
+
+If anything, I see an even greater need for non-preemptive semantics for
+parallel blocks.
+
+A parallel loop is typically applying similar processing iteratively. A parallel
+block can also be that, (for example calculating Fibonacci recursively), but it
+can also be suitable for completely independent processing. In one branch you
+might want to exit the processing of that branch early, but (I think more
+typically) without affecting the processing in the other branches.
+
+eg.
+
+Note: I am not using the exact syntax that we arrived at in Vienna, only because
+I cannot remember exactly what it was, until I see the minutes.
+
+parallel
+
+   exit when Do_A_Step1 = Failure;
+   exit when Do_A_Step2 = Failure;
+   exit when Do_A_Step3 = Failure;
+   exit when Do_A_Step4 = Failure;
+   Result := Do_A_Step5;
+
+and
+   Do_B;
+end parallel
+
+-- Activity B has completed here, regardless whether Activity finished
+-- successfully.
+
+If one wants the processing for activity A to kill the processing for activity
+B, then I think the programmer ought to explicitly ask for those semantics. The
+default should be the more deterministic semantics.
+
+Again, the "with abort" clause seems like a consistent and natural way to do
+this.
+
+parallel with abort
+
+   exit when Do_A_Step1 = Failure;
+   exit when Do_A_Step2 = Failure;
+   exit when Do_A_Step3 = Failure;
+   exit when Do_A_Step4 = Failure;
+   Result := Do_A_Step5;
+
+and
+   Do_B;
+end parallel
+
+-- A failure in activity A, may have caused activity B to terminate early.
+
+****************************************************************
+
+Sent: Thursday, July 20, 2017  12:40 AM
+
+> ? I know in Vienna, we had taken a straw poll, which favored putting
+> the ? keyword parallel before the word "for" in the for loop. While
+> this is a ? useful data point, I think if we want to support "all
+> parallel" loops, I ? find it makes a lot more sense to have all
+> parallel together after the word ? "in" or "of".
+>>
+>> eg
+>>    for I in all parallel 1 .. 10 loop
+>>
+>> or
+>>
+>>    for Element of all parallel List loop
+>
+> Much as I sympathise with the logic of Brad’s argument, I think we
+> should stick with what was agreed at Vienna for the time being, whilst
+> we sort out what behaviour we want.
+> Terminating loops early seems to be a particular sticking point.
+
+I have actually changed my view here. I am now in favour of having the parallel
+keyword for loops at the beginning, as we discussed in Vienna. The reason is
+that I find that if we do want to specify controls on the parallelism, such as
+"with abort", or "with all", or anything else we come up with, applying these to
+the keyword parallel is the best way I've seen so far to do this, in my opinion.
+If "parallel" is located in the middle of the loop syntax, I don't think this
+would work (too ugly and messy), but when parallel is at the beginning, this
+provides a "header" that describes the parallelism without interfering with the
+loop logic.
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Thursday, July 20, 2017  1:22 PM
+
+> If anything, I see an even greater need for non-preemptive semantics
+> for parallel blocks.
+>
+> A parallel loop is typically applying similar processing iteratively.
+> A parallel block can also be that, (for example calculating Fibonacci
+> recursively), but it can also be suitable for completely independent
+> processing. In one branch you might want to exit the processing of
+> that branch early, but (I think more typically) without affecting the
+> processing in the other branches.
+
+Well, we don't have "exit" for blocks (it would be very incompatible); I suppose
+we could have it only for parallel blocks but that seems like asking for trouble
+with the forces of consistency. One could use a goto for that effect, but that
+seems less likely than  an exit from a loop.
+
+So, in the block case I was thinking mainly about exceptions; and exceptions
+usually (not always) represent bugs. I don't think it makes much sense to
+continue buggy code -- indeed I thought that it made sense to use some sort of
+"terminate as soon as practical" semantics for exceptions in all cases. It's
+only exit (and similar gotos) that it makes sense to treat non-preemptively.
+
+(And I was thinking that "non-preemptive" simply meant that the tasklet ran to
+the next task dispatching point, not that it necessarily ran to completion. A
+queued and blocked tasklet might never finish otherwise, especially in the
+exception case.)
+
+...
+> Again, the "with abort" clause seems like a consistent and natural way
+> to do this.
+>
+> parallel with abort
+>
+
+Yes, that syntax seems OK. Better than some your other recent ideas. ;-)
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Tuesday, July 25, 2017  4:36 PM
+
+I am quite strongly against having multiple interpretations for "exit" from a
+parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I
+have some convincing to do.  The interpretation I recommend is that an “exit” or
+“return” (or “raise”) within a parallel loop causes all other iterations to be
+terminated as soon as possible.  For a “return,” there is a “race” to see which
+iteration reaches a return first, and that is resolved  using some kind of
+synchronization, and then only the “return” that “wins” is allowed to proceed to
+evaluate its return expression (presuming it is returning from a function).
+Similarly, if an iteration would propagate an exception, then again, a
+synchronization is performed against other operations that can terminate the
+loop, and the winner gets to determine the net effect, be it a simple exit, an
+exception, a return, or (if we permit it) a goto.  Essentially any operation
+that would normally terminate a loop goes through a “two-phase” commit, where
+the first phase requests the “right” to decide the fate of the loop, and then
+the winner determines what happens next.
+
+I have not seen any convincing (to me!) argument to allow unstarted iterations
+to continue processing, as this seems to defeat one of the whole points of a
+parallel “divide-and-conquer” approach.  I can see some arguments to not
+terminating iterations that are already executing, if there is danger of them
+leaving some object visible outside the iteration in an abnormal state.  In
+ParaSail, if a loop includes an operation that could result in an early exit, it
+cannot manipulate a variable that is declared outside the loop unless it is
+protected.  This doesn’t quite work in Ada since any parallel loop could have an
+iteration die due to the failure of a run-time check.  But to some extent that
+is already a problem because we recognize that the failure of a run-time check
+can always leave certain variables in an abnormal state.  For a parallel loop, a
+run-time check failure could leave any variable manipulated by any iteration in
+an abnormal state.  But if you look at 11.6(6/3), even a “regular” loop that
+dies due to a failed run-time check can have the same effect, since the Ada RM
+has always permitted implicit parallelization (e.g. see NOTE 1 in chapter 9 —
+aka RM 9(11)) and instruction re-ordering.  So we can effectively ignore failed
+run-time checks, and only worry about explicit “raise” as relevant to this
+discussion.  So what it effectively means is that the action of propagating a
+“raise” out of an iteration is just like a “return” or an “exit” in terms of
+having to “compete" to decide which terminating construct gets to determine the
+“fate” of the parallel loop.
+
+The bottom line to me is that we should terminate parallel loops as soon as
+possible, be it from an exit, return, raise (or goto or requeue for that
+matter), using a two-phase approach that first determines which one will
+actually decide the fate of the loop as a whole, and then a second step that
+actually does the deed.
+
+****************************************************************
+
+From: Tuillo Vardanega
+Sent: Wednesday, July 26, 2017  7:43 AM
+
+I think I understand the semantics as Tuck proposes and I find it to make sense
+in the specific context of parallel computation. To me, wanting exactly the same
+source code to operate identically within and without a parallel construct
+requires writing it to that end, which means understanding the implications of
+the most demanding side. In a sense, this is no different (only more complex)
+than it was for concurrency: the code that you embed in concurrent threads must
+be amenable to concurrent semantics else problems occur. I suppose that what
+matters most here is that the programmer has the means (for syntax and
+semantics) to ensure that the externally visible effect of a parallel construct
+"stabilizes" before it can be used outside of it. To my reading, Tuck's
+"two-phase commit" notion does that quite fine, including the
+"as-soon-as-possible" part of it.
+
+****************************************************************
+
+Sent: Thursday, July 27, 2017  12:26 PM
+
+> I am quite strongly against having multiple interpretations for "exit" from a
+> parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I
+> have some convincing to do.
+
+Unfortunately, I am not yet convinced. I think if we want to have only one
+interpretation for Exit, then that interpretation should be non-preemptive. Lets
+see if I can convince you.
+
+First I have some findings to report first however that should be of interest.
+
+I have added preemptive termination to most of my Paraffin libraries (except the
+Ravenscar ones) so that if any exception occurs, it preemptively terminates
+
+I found that adding this, (using asynchronous transfer of control), did not add
+any significant performance hit to any of the libraries with GNAT at least. This
+applies even to problems such as adding the sum of a large range of integers,
+which I had mentioned previously might be a case where preemptive abort might
+
+So, for exceptions, I think it makes sense to have preemptive termination,
+generally. However, I think exit, return, and goto should be handled
+differently.
+
+Here are some arguments
+
+1. I think it is quite a common problem to be searching for the first or last
+   occurrence of something in an array or container. For example, consider an
+   array or container of events stored in chronological order.
+
+I should think answering question such as;
+   When is the first time that conditions X occurred?, or when was the last time
+   that conditions Y occurred?
+
+I find it quite objectionable if we have to say, sorry, but you cannot do such
+searches efficiently in native Ada using parallelism. You should either use
+libraries such as Brad Moore's Paraffin libraries, or call out to another
+language such as C or C++ and write that code in OpenMP.
+
+If Exit is preemptive, then we cannot continue searching in other tasklets to
+find better or correct answers. We effectively have to not use Exit at all, and
+search the full range of iterations, which can be a waste of time if we already
+
+We can use exceptions as the mechanism for preemptive abort. If a programmer
+wants preemptive termination, they could raise their own exception in the loop,
+and write an exception handler at the appropriate level to handle that. We could
+even define a language defined exception, such as
+Parallelism_Termination_Exception, which could be defined as an exception that
+is implicitly handled prior to the completion of the loop, and can be explicitly
+raised by the user if preemptive termination is desired.
+
+We still need a mechanism for non-preemptive abort to solve problems such as the
+above however, and exit, return, and gotos could provide that mechanism.
+
+2. Having Exit and Return treated effectively as an exception and causing
+   preemptive is very different from sequential Ada, and likely to introduce a
+   lot of unexpected errors to users programs. For instance, I found adding the
+   preemptive termination to Paraffin introduced at least one bug where one task
+   might be waiting on an entry while the aborted code that would have
+   ultimately serviced that code would have left another task waiting on the
+   entry indefinitely. Another example could be where a loop iteration creates
+   dependent tasks, and aborting the loop iteration might shortcut interactions
+   with those dependent tasks, leaving those tasks in a state where they cannot
+   terminate, and where the aborted iteration is left in an abort-deferred
+
+   I think when the user's code explicitly uses a select then abort statement,
+   the user is fully aware of the asynchronous transfer of control, and can take
+   care to ensure that the preemptive behaviour makes sense. I think it would be
+   undesirable however to have the compiler implicitly introducing such
+   asynchronous transfers of control, where the user is likely in many cases to
+   not even be aware of this, which could lead to difficult to reproduce
+   failures in deployed code.
+
+3. I think we should also be supporting parallelism in Ravenscar compliant code.
+   I have Paraffin libraries that do this, which shows that it can be done.
+   However, in Ravenscar, I believe all termination needs to be non-preemptive,
+   including exceptions, since Ravenscar has the restrictions:
+   No_Select_Statements.
+
+It seems pretty clear that preemptive aborting behaviour is not supported in a
+simpler Ravenscar runtime. I think this is OK, as you still get the benefits of
+divide and conquer with non-preemptive termination, just maybe not as responsive
+for early exits, but that is traded off with better deterministic behaviour
+which is generally desired in Ravenscar. The main point here is that if we have
+Exit, goto, and return as non-preemptive, then we have consistent, single
+semantic interpretations for these constructs for all of Ada. If we make these
+preemptive, then we are actually forcing two interpretations if we want these to
+work in Ravenscar also.
+
+Another suggestion, if we really do want preemptive exit, return, or goto, then
+perhaps we need a different flavour of syntax for those.
+
+I had previously suggested that a "with abort" clause could be added to the
+parallel keyword of the loop. Rather than do that, I think it would be better to
+attach that clause to the actual Exit, return, or goto statement, as in;
+
+parallel
+    for I in 1 .. N loop
+       if Arr (I) > X then
+          exit with abort;
+       end if;
+    end loop;
+
+It is clearer to tie the "with abort" closer to where the premptive termination is desired.
+
+
+   The interpretation I recommend is that an “exit” or “return” (or
+“raise”) within a parallel loop causes all other iterations to be
+terminated as soon as possible.  For a “return,” there is a “race” to
+see which iteration reaches a return first, and that is resolved  using
+some kind of synchronization, and then only the “return” that “wins” is
+allowed to proceed to evaluate its return expression (presuming it is
+returning from a function).  Similarly, if an iteration would propagate
+an exception, then again, a synchronization is performed against other
+operations that can terminate the loop, and the winner gets to determine
+the net effect, be it a simple exit, an exception, a return, or (if we
+permit it) a goto.  Essentially any operation that would normally
+terminate a loop goes through a “two-phase” commit, where the first
+phase requests the “right” to decide the fate of the loop, and then the
+winner determines what happens next.
+
+A two phase commit is pretty much the only possibility that I see.
+Paraffin has always had such a two phase commit for handling exceptions,
+except until just recently, the exception handling was non-preemptive.
+If multiple exceptions were raised in different tasklets, the
+implementation decides which one wins, prior to re-raising that
+exception and propagating it out into the user's code. The two-phase
+commit idea is orthogonal to this discussion about how Exit statements
+are treated I think.
+
+> I have not seen any convincing (to me!) argument to allow unstarted iterations
+> to continue processing, as this seems to defeat one of the whole points of a
+> parallel "divide-and-conquer" approach.
+
+You do get benefits of divide and conquer regardless whether termination
+is preemptive or not. The preemptive termination can be viewed as an
+additional optimization that can be added on top of the divide and
+conquer optimization, but only if it does not introduce errors!
+
+
+   I can see some arguments to not terminating iterations that are
+already executing, if there is danger of them leaving some object
+visible outside the iteration in an abnormal state.  In ParaSail, if a
+loop includes an operation that could result in an early exit, it cannot
+manipulate a variable that is declared outside the loop unless it is
+protected.  This doesn’t quite work in Ada since any parallel loop could
+have an iteration die due to the failure of a run-time check.  But to
+some extent that is already a problem because we recognize that the
+failure of a run-time check can always leave certain variables in an
+abnormal state.  For a parallel loop, a run-time check failure could
+leave any variable manipulated by any iteration in an abnormal state.
+But if you look at 11.6(6/3), even a “regular” loop that dies due to a
+failed run-time check can have the same effect, since the Ada RM has
+always permitted implicit parallelization (e.g. see NOTE 1 in chapter 9
+— aka RM 9(11)) and instruction re-ordering.  So we can effectively
+ignore failed run-time checks, and only worry about explicit “raise” as
+relevant to this discussion.  So what it effectively means is that the
+action of propagating a “raise” out of an iteration is just like a
+“return” or an “exit” in terms of having to “compete" to decide which
+terminating construct gets to determine the “fate” of the parallel loop.
+
+I dont think this last sentence follows from the previous, particularly
+in the case of Exit. In Paraffin, an Exit statement is simply treated as
+an Exit statement in the user's code. It causes that tasklet to exit,
+without affecting any others, no special treatment than other sequential
+statements in Ada. There is no need to decide which exit wins prior to
+leaving the loop, as there no result or exception being propagated out
+of the loop. The implementation requires no special treatment of this.
+Exceptions do require special treatment, since a specific exception
+should be propagated out of the loop.
+
+For a return statement, there is a stronger case for preemptive
+behaviour, but I think considering the other arguments, it should be
+left non-preemptive to side with Exit statements, rather than side with
+the preemptive nature of exceptions.
+
+Goto's should probably not be too relevant to this discussion if we even
+support them at all. I doubt this feature would be very rarely used or
+useful in a parallel loop, but they are similar to return statements.
+
+>
+> The bottom line to me is that we should terminate parallel loops as soon as
+> possible, be it from an exit, return, raise (or goto or requeue for that
+> matter), using a two-phase approach that first determines which one will
+> actually decide the fate of the loop as a whole, and then a second step that
+> actually does the deed.
+
+The bottom line to me is that the user should be able to write programs
+that execute as fast as possible. Sometimes terminating loops as soon as
+possible comes at a greater expense than being able to use a more
+optimal algorithm. The user should be able to choose either option.
+
+Below I have execution times showing a case where the non-preemptive
+algorithm beats the preemptive one.
+
+A two phase approach is recommended for sure, but Exit's do not need to
+be involved in the second stage of the commit. They do not contribute to
+any result associated with the loop. Return results do need to be
+considered in the second phase, but I would think that any exception
+raised should trump any return result.
+
+Here are some more results comparing performances using preemptive vs
+non-preemptive termination.
+
+I have a simple program that scans an array of 600_000_000 integers
+looking for the value in the highest index of the array that satisifies
+some property. In this case, the element in Arr'Last is the one that
+satisfies the query.
+
+Using a Reverse parallel loop with non-premptive exit, the elapsed time
+for the loop is 0.13 seconds.
+
+The parallel loop looks like;
+
+      Maximum_Index : Index := Index'First with Volatile;
+
+      parallel
+       for I in reverse Start .. Finish loop
+          if I < Maximum_Index then
+             exit;
+          elsif Satisfies (Arr (I)) and then I > Maximum_Index then
+             Maximum_Index := I;
+          end if;
+       end loop;
+
+If I take out the reverse keyword, I still get the correct result, but
+the elapsed time becomes 1.4 seconds, since with forward iteration, the
+Maximum_Index has to iterate through the full chunk of the highest
+chunk, since each iteration improves upon the previous value.
+
+If we assume that Exit must be preemptive, then we cannot use Exit,
+because it gives us an incorrect result, so I have to rewrite the loop
+as the following if we assume that reverse parallel loops are still allowed;
+
+     parallel
+       for I in reverse Start .. Finish loop
+          if Satisfies (Arr (I)) and then I > Maximum_Index then
+             Maximum_Index := I;
+          end if;
+       end loop;
+
+This produces an elapsed time of 1.3 seconds;
+
+If we say that reverse parallel loops are not allowed, then I have to
+remove the reverse keyword, and the elapsed time becomes 3.8 seconds.
+
+So in this particular example, it appears that the decision to not
+support non-preemptive Exit results in a program that runs 9.8 times
+slower on my computer than the version that does allow non-premptive
+Exit, for reverse parallel loops.  For the non-reverse loops, the
+preemptive version runs 2.7 slower.
+
+The preemptive forward loop runs 29.2 times slower than the
+non-preemptive reverse parallel loop.
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Thursday, July 27, 2017  4:29 PM
+
+>> I am quite strongly against having multiple interpretations for "exit" from a
+>> parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I
+>> have some convincing to do.
+>
+> Unfortunately, I am not yet convinced. I think if we want to have only
+> one interpretation for Exit, then that interpretation should be
+> non-preemptive. Lets see if I can convince you.
+> ...
+>
+> Here are some arguments
+>
+> 1. I think it is quite a common problem to be searching for the first
+> or last occurrence of something in an array or container. For example,
+> consider an array or container of events stored in chronological order.
+>
+> I should think answering question such as;  When is the first time
+> that conditions X occurred?, or when was the last time that conditions
+> Y occurred?
+>
+> I find it quite objectionable if we have to say, sorry, but you cannot
+> do such searches efficiently in native Ada using parallelism. You
+> should either use libraries such as Brad Moore's Paraffin libraries,
+> or call out to another language such as C or C++ and write that code
+> in OpenMP.
+
+There are many ways to search efficiently, including in parallel. If you want
+the lowest-numbered element that satisfies a particular criteria, then that
+should be explicit in the algorithm, rather than something that is hidden in an
+assumption about how non-preemptive chunked parallel iteration works.  It seems
+unwise to make requirements about the order in which things are done, e.g.
+whether lower-numbered or higher-numbered iterations are processed first,
+whether "exit" ends one chunk but not all chunks, whether exit stops new chunks
+from starting, etc.  This could be seen as defeating the purpose of parallel
+iteration, and forcing the semantics to be restricted in what seem somewhat
+unnatural ways just for this particular case.
+
+> If Exit is preemptive, then we cannot continue searching in other
+> tasklets to find better or correct answers. We effectively have to not
+> use Exit at all, and search the full range of iterations, which can be
+> a waste of time if we already know the answer. ...
+
+In my view, "exit" means the entire loop is done, and trying to imply that under
+certain circumstances it means "well not really if there is a better answer yet
+to be found" seems like it would require a compiler that can guess your
+intentions.  I suspect what would be most efficient for such a problem is a more
+explicit divide and conquer, not something where you are implicitly relying on
+chunking to accomplish what you want.
+
+For example, if you want the item with the lowest index which satisfies some
+predicate, you would want all of the cores looking at the first few items in the
+array, and then if nothing is found there, you would have them all start looking
+at the next few items in the array.  It would be less efficient to break the
+array into big chunks, and have half of the cores looking at the last half of
+the array.
+
+Here is an example algorithm that would devote all of the cores to search the
+beginning of the array first.  This is just one example, but I suspect it would
+be faster than one that made various assumptions about chunking and assumed that
+"exit" didn't really mean “exit” when there were better answers waiting.  My
+point is that having multiple meanings for "exit" adds a lot of complexity
+without providing sufficient general benefit.
+
+   Assume we have "N" cores, and an array A where we are trying to find the
+   lowest-indexed element of A that satisfies the predicate P.
+
+function Find_First(A : Elem_Arr; P : access function(E : Elem_Type) return Boolean)
+   return Integer is
+   N : constant Positive := Num_CPUs;
+   Lowest_Index : array(1..N) of Integer := (others => A’Last+1)
+   Found : Integer := A'Last+1 with Atomic;
+       --  Atomic flag used to stop other threads early begin
+   parallel for I in 0 .. N-1 loop
+      -- find A(J) with lowest valued J satisfying P(A(J)) by looking
+      -- at A(A'First+I), A(A'FIrst + I + N), A(A'First + I + 2*N),
+      declare
+         J : Integer := A'First + I;
+      begin
+         while J <= A'Last and then J < Found loop
+             if P(A(J)) then
+                 -- This is the lowest-numbered element
+                 -- satisfying P with index of form A’First + I + K*N
+                 Lowest_Index(I) := J;
+		 Found := J;  --  Stop others as soon as possible.
+                    -- Note that there is a benign race condition here, since
+                    -- we are merely trying to get other threads to stop if further
+                    -- searching would be a waste of time.  "Found" is not presumed
+                    -- to be the overall lowest value.
+                 exit;
+             end if;
+             J := J + N;
+          end loop;
+      end;
+   end loop;
+
+   -- Use proposed "map/reduce" construct to find overall minimum index
+   return (for I in 1 .. N => Integer'Min(<A'Last+1>, Lowest_Index(I)); end;
+
+****************************************************************
+
+Sent: Tuesday, August 1, 2017  2:08 AM
+
+> There are many ways to search efficiently, including in parallel. If you want
+> the lowest-numbered element that satisfies a particular criteria, then that
+> should be explicit in the algorithm, rather than something that is hidden in
+> an assumption about how non-preemptive chunked parallel iteration works.  It
+> seems unwise to make requirements about the order in which things are done,
+> e.g. whether lower-numbered or higher-numbered iterations are processed first,
+> whether “exit” ends one chunk but not all chunks, whether exit stops new chunks
+> from starting, etc.  This could be seen as defeating the purpose of parallel
+> iteration, and forcing the semantics to be restricted in what seem somewhat
+> unnatural ways just for this particular case.
+
+I see your point, but I also think treating an exit like an exception is also
+asking for trouble. Your arguments have moved me somewhat though. I now feel
+that If we cannot have both preemptive and non-premptive exits, returns, and
+gotos, then I think we shouldn't allow them to transfer control out of a
+parallel loop at all. I think they introduce too much in the way of confusion,
+unexpected behaviour, and semantic problems.
+
+If one really wants to preemptively terminate a parallel loop, they can always
+raise an exception and handle it at the appropriate level. Otherwise I think
+gotos, returns, and exits should only be allowed to exit loops that do not have
+the parallel keyword. This also simplifies the parallelism proposal, I should
+think.
+
+> For example, if you want the item with the lowest index which satisfies some
+> predicate, you would want all of the cores looking at the first few items in
+> the array, and then if nothing is found there, you would have them all start
+> looking at the next few items in the array.  It would be less efficient to
+> break the array into big chunks, and have half of the cores looking at the
+> last half of the array.
+
+That can be tricky to program though without introducing too much overhead of
+having the cores needing to interact with each other to get the next search
+region, and it depends on the probability of finding things in the bottom half
+of the array. If what you are looking for is rare enough, that could raise the
+chances that what you are looking for can be found in the upper half of the
+array. But yes, it could be an alternate strategy that might work well enough in
+certain situations.
+
+> Here is an example algorithm that would devote all of the cores to search the
+> beginning of the array first.  This is just one example, but I suspect it
+> would be faster than one that made various assumptions about chunking and
+> assumed that "exit" didn't really mean "exit" when there were better answers
+> waiting.  My point is that having multiple meanings for "exit" adds a lot of
+> complexity without providing sufficient general benefit.
+
+This approach can be summarized as manual chunking, where the programmer writes
+an outer parallel loop that executes a number of chunks in any order, and an
+inner sequential loop that calculates the chunk boundaries, where you can then
+use sequential semantics including reverse loops, with non-preemptive exits.
+
+I think given the syntax we are proposing, manual chunking is likely the only
+solution that that works for this problem, if we do not have non-preemptive
+exits and want to write all the code in line. A point to mention is that with
+manual chunking perhaps that can eliminate the need for reverse parallel loops,
+since one can use the inner sequential loop to provide the reverse searching.
+
+I think I have a better solution for this problem though, because there are a
+number of criticisms I see here.
+
+For starters, I dislike that the programmer is having to make decisions about
+how many chunks are needed, and the values of the chunk boundaries.
+
+The reasons are;
+
+   1) Choosing the number of cores as the number of chunks is likely too course
+      a chunk size. If one core finishes its chunk early, there are no
+      opportunities to steal or seek work from the other cores. It is better to
+      choose a smaller chunk size to provide better load balancing, but having
+      the programmer choose this, is likely not going to be as good a decision
+      that a compiler or library writer might choose.
+
+   2) Manual chunking in this manner is a form of static parallelism, where the
+      decisions on chunk boundaries is made before loop execution begins. One
+      ideally shouldn't necessarily have to commit to static chunking decisions,
+      as dynamic chunking algorithms are possible that might be beneficial in
+      cases where the amount of time to evaluate a single iteration can vary
+      wildly between iterations.
+
+   3) Having to write the code to convert back and forth from chunk index to
+      actual index is messy and error prone. It is more difficult to both read
+      and write.
+
+
+I think this particular problem is best solved using a library approach, in
+combination with the anonymous loop body AI.
+
+A library approach can provide the outer parallel loop that corresponds to the
+outer loop of the manual chunking solution, without the programmer having to
+deal with decisions about chunking, or converting from chunk numbers to real
+index values. A library also can provide more sophisticated parallelism
+strategies that might be too difficult for a programmer to write from scratch.
+
+If we take a library approach to find the maximum index that satisfies some
+condition, using anonymous loop bodies from AI we could write;
+
+       declare
+         package Search is new
+           Parallel.Loops.Static_Reduce (Loop_Index  => Index,
+                                         Result_Type => Index,
+                                         Reducer     => Index'Max,
+                                         Identity    => Index'First,
+                                         Result      => Maximum_Index);
+        begin
+         for (Start, Finish, Maximum_Index) of Search.Parallel_Loop
+            for I in reverse Start .. Finish loop
+               if I < Maximum_Index then
+                  exit;
+               elsif Satisfies (Arr (I)) and then I > Maximum_Index then
+                  Maximum_Index := I;
+               end if;
+            end loop;
+         end loop;
+       end;
+
+Note, the Maximum_Index of the Parallel_Loop anonymous procedure is an in out
+parameter. The value coming in, is the best value so far for that executor.
+
+Otherwise to do this with explicit manual chunking, I would write;
+
+         declare
+            Number_Of_Chunks : constant Positive := 50;
+            Iterations : constant Positive
+              := Positive (Arr'Last - Arr'First + 1);
+            Chunk_Size : constant Positive :=
+              Iterations / Number_Of_Chunks;
+            Results : array (1 .. Number_Of_Chunks) of Index :=
+               (others => Index'First);
+         begin
+            parallel
+              for I in 1 .. Number_Of_Chunks loop
+                 for J in reverse Arr'First +
+                    (Index (I) - 1) * Index (Chunk_Size)
+                 .. (if I = Number_Of_Chunks then Arr'Last
+                     else Arr'First + Index (I * Chunk_Size) - 1) loop
+
+                    if Satisfies (Arr (J)) then
+                       Results (I) := I;
+                       exit;
+                    end if;
+                 end loop;
+              end loop;
+
+           Maximum_Index := (for Element of Results => <0> + Element);
+
+         end;
+
+I can tell you that the amount of time it took me to write the second version
+involved quite a bit more time and mental power than the previous version. I
+think the previous one is also a lot more readable, because it provides a higher
+level of abstraction. The code only focuses on the actual algorithm, whereas the
+second version mixes the algorithm with a simplistic chunking algorithm.
+
+Note: The exit in the last example is non-preemptive, since the inner loop is
+not a parallel loop. If you look at that exit statement, one might worry for a
+moment whether that is a preemptive exit or a non-preemptive exit. If we
+disallow exits to transfer control out of a parallel loop, then suddenly that
+exit feels "safer" to me, since all exits would not have preemptive abortive
+
+Otherwise, I can imagine one might want to scan all the source code in a system
+looking for preemptive exits when doing code reviews, and feeling the need to
+provide extra analysis to show that any found are safe.
+
+****************************************************************
+
+From: Tucker Taft
+Sent: Tuesday, August 1, 2017  12:45 PM
+
+I think you have confused the issue of "exit" meaning "exit" with the issue of
+whether "abort" semantics or "as soon as possible" are used to terminate the
+other iterations.
+
+lI can understand the concern about using "abort" semantics, but after an exit
+by one iteration, when any other iteration hits a “natural” stopping point, or
+hasn't started at all, it should terminate.
+
+So let's separate these two issues, if possible.  What I mostly object to is the
+notion of giving meaning to multiple exits from the same parallel loop.  I don’t
+mind waiting a bit for iterations that are already running, for the loop to
+actually terminate.  But trying to define what it means to have multiple exits
+is a mistake in my view.  I believe the two-phase approach is implementable, and
+I believe should be used for any kind of "early" termination of a parallel loop,
+be it exception propagation, exit, return, goto, or requeue.  That is, if an
+iteration wants to terminate the loop, it "requests" a termination, and then if
+it “wins” the race, it determines the fate.  How quickly the loop actually
+finishes may vary, but it should be as soon as possible, and certainly no
+
+So no need to bring up the "boogie" man of "abort."  I could imagine some aspect
+that might affect how quickly termination happens, but it shouldn’t affect the
+basic semantics of the loop, which are “terminate as soon as possible.”
+
+****************************************************************
+
+Sent: Wednesday, August 2, 2017  9:03 AM
+
+> So let's separate these two issues, if possible.
+
+OK, I can buy this. We really now have 3 potential ways to exit a loop though.
+What I would call hard preemption, soft preemption, and non-preemption, where
+hard preemption is the abortive exit likely involving asynchronous transfer of
+control, soft preemption which is a polled exit that only occurs on iteration
+boundaries (grain or chunk boundaries most likely), and non-preemption which
+wouldn't be directly supported in syntax, but if one does manual chunking either
+inline, or via a library call, an exit is exiting a sequential inner loop, which
+then is non-preemptive. The searching example of the previous email represents a
+loop that must involve manual chunking that results in a non-preemptive exit.
+
+I think hard preemption should probably be reserved for exceptions, which would
+allow breaking out of a loop like the following on a quad-core machine, which
+with only soft preemption would likely not result in any savings in execution
+time;
+
+     parallel
+        for I in 1 .. 4 loop
+            if Flip_Coin then
+               raise Get_Me_Out_Of_Here;
+            else
+               Do_12_Hour_Calculation;
+            end if;
+        end loop;
+
+All parallel loops involving chunking are actually nested loops at the
+implementation level where the outer loop is the parallel loop that is provided
+by the implementation, and the inner sequential loop is a callout to user code.
+So in this sense, an exit from the inner loop should be treated as a
+non-preemptive exit. But we want to create the illusion that there is only a
+single loop at the syntax level, so an exit would involve soft preemption to
+more closely mirror sequential behaviour that when an exit statement is
+executed, there are no further iterations. Parallel execution is not exactly
+like that, as execution likely continues after an exit, and further iterations
+can be started, particularly if a tasklet involves multiple iterations and soft
+preemption polling occurs at tasklet boundaries. But we want to at least try to
+approximate the sequential behaviour as best we safely can.
+
+I actually have these three exit strategies implemented in Paraffin now. I had
+forgotten that I had implemented soft preemption some time ago, where a special
+exception can be raised from a user callback and is handled by the library which
+sets a flag that is polled at grain/tasklet boundaries to effect early
+termination.
+
+Any other exception raised in a Paraffin library call results in a hard
+preemptive abort involving asynchronous transfer of control, which is a feature
+
+Any exit statement in a user provided callback is treated as a regular exit
+statement in Ada, which is just a non-preemptive exit, which would also work
+well for manual chunking examples involving the anonymous loop body procedure
+AI.
+
+So I am happy now with this treatment of exit, return, and gotos if we say it
+has the soft-preemption model.
+
+I do think we probably want to distinguish between exceptions and the other
+forms of transfer of control out of the loop (exit, gotos, returns, requeues) to
+consider treating exceptions as the hard-abortive case, particularly for
+exceptions such as Constraint_Error at least, but probably all exceptions for
+consistency sake.
+
+****************************************************************
+
+From: Jeff Cousins
+Sent: Thursday, August 3, 2017  11:16 AM
+
+It's good to see this progressing even if it's taking a lot of to and fro.
+Tucker's ideas sound desirable, but are the others from the implementors' side
+happy that the "some kind of synchronization" won't be an undue implementation
+burden?
+
+****************************************************************
+
+From: Randy Brukardt
+Sent: Thursday, August 3, 2017  8:54 PM
+
+> It's good to see this progressing even if it's taking a lot of to and fro.
+> Tucker's ideas sound desirable, but are the others from the
+> implementors' side happy that the "some kind of synchronization" won't
+> be an undue implementation burden?
+
+To answer your question directly, I don't think there is any alternative. What
+Tucker describes is a bit more complex than the model I had in mind, but they're
+pretty similar. That is, there is a loop "manager" (for the lack of a better
+term) that determines the chunks, creates (or more likely, requests existing)
+tasklets for execution, gives the tasklets chunks, and then waits for the
+tasklets to finish their chunk (or complete some other way, such as exit or
+propagate an exception). That waiting implies some sort of synchronization, but
+it seems to me to be the cheapest kind available. The manager would generally be
+asleep waiting for tasklets, and would only wake up when one completes - it
+would then either assign another chunk (if there aren't any, just sleeping
+waiting for the others), or if the loop needs to complete (exit or exception),
+wait for the other tasklets to reach a termination point (or abort them - in
+Janus/Ada, these are effectively the same but probably not in most
+implementations).
+
+I definitely would allow iteration termination at any task dispatching point
+(especially for tasklets that are blocked), lest a blocked tasklet prevent loop
+termination indefinitely.
+
+With this sort of structure, I don't really see any issue with having exit work
+deterministically. The only requirement are to ensure that any iterations on the
+left of the iteration of the exit (that is, with lower indexes) run to
+completion. There certainly isn't any requirement to start new tasklets after an
+exit; you just have to assign chunks to tasklets in a left-to-right order (and
+it would be a lot of extra work to do anything else). Thus, if an exit occurs,
+any chunk not ever assigned to a tasklet can be immediately discarded (it
+necessarily comes after the exit). Only tasklets logically before the exit would
+need to continue.
+
+Exceptions are different, of course, they terminate everything as soon as
+possible as the loop is not going to provide an answer. Those are rather
+different use-cases.
+
+----
+
+I want to step back a minute and think a bit about the uses for these constructs
+and our actual goals for them. I'm getting concerned that there can be no real
+uses of the parallel loop construct.
+
+We've already eliminated reduction loops from using the parallel loop (they have
+their own construct now). That's a substantial number of the loops that I've
+written that would be sensibly parallel.
+
+Almost all of the other loops that I've written or maintained are searches of
+some sort. Depending upon the rules adopted, it's quite likely that many of
+those could not use the parallel loop construct. First of all, most of the
+search loops I write are intentionally not going to execute long, and many are
+walking through linked lists; probably the parallel overhead would be far more
+than any possible savings from parallel execution. (The cost to walk the list to
+find chunks would be more than the comparison costs.) More of the loops wouldn't
+work if a non-deterministic solution for exit is adopted; you wouldn't be able
+to find the *first* match or the *next* match as opposed to some random match.
+Coding this using a non-deterministic loop is very expensive - you want to avoid
+any further iterations above any match but continue to run the ones below. The
+only way I could imagine doing it would require pre-checking on every iteration
+if something is found below: but that would require synchronization, so a lock
+or fence would be needed -- most likely costing more than the iteration itself.
+It probably wouldn't be worth doing, meaning that one would have to search the
+entire space regardless of where or when any matches are found. And, depending
+on the likelihood of matches in the first third of the data, probably means that
+a parallel solution would perform worse than a sequential one.
+
+There must be some searches that would work with non-deterministic exit, but I
+can't think of any at the moment.
+
+So, if a loop is neither a reduction or a search (that is, no early exits), what
+else could it do? Something like a matrix multiply comes to mind, but that is
+very likely to violate the static safety check on the loop contents. (The
+compiler probably can't prove that the array elements being written are
+independent for each iteration, and certainly any fully static check based on
+Global alone cannot do that - you need information about how the array is
+indexed.) So, what can such a loop do? Is there enough to justify the extensive
+work on this feature (as opposed to parallel blocks and parallel reductions)??
+
+----
+
+Going back to first principles. The most important thing about these constructs
+is that they tell the compiler that the user doesn't care about the details of
+which iteration runs first, or last, or in between. Because of exceptions (and
+finalization), an Ada compiler cannot parallelize any code itself (unless of
+course it can prove the absence of such things, but that's hard and unusual).
+Most of the time, the programmer doesn't really care about the state of the loop
+when an exception is raised; the loop has a bug and we're not going to use the
+results of the loop at all. OTOH, a "normal" exit (whether via an exit
+statement, return statement, or a goto) usually just means that we have the
+needed results and don't need to continue.
+
+We're also looking to make checks to ensure that these constructs are "safe",
+that is free of race conditions and conflicts. The vast majority of programmers
+are not going to be able to successfully avoid these things (even if they know
+to look for them) -- I know I can't. Ideally, I want to have a programming
+construct that gives the benefit of parallel execution without any of the pain.
+
+For all of these reasons, I much prefer a model where exceptions and exits are
+treated differently, and exits give the same answer as a sequential loop. In
+such a model, almost all searching loops can be written in parallel. In the
+"first exit wins" scenario, only a few searching loops could be written in
+parallel. Since most other loops are reductions or would be prohibited by the
+parallel safety checks, that model seems necessary for parallel loops to be
+useful at all.

****************************************************************
```

Questions? Ask the ACAA Technical Agent