Version 1.3 of ai12s/ai12-0251-1.txt

Unformatted version of ai12s/ai12-0251-1.txt version 1.3
Other versions for file ai12s/ai12-0251-1.txt

!standard 5.5.2(2/3)          18-01-25 AI12-0251-1/01
!standard 5.5.2(5/4)
!standard 5.5.2(7/3)
!class Amendment 18-01-25
!status work item 18-01-25
!status received 18-01-13
!priority Low
!difficulty Medium
!subject Explicit chunk definition for parallel loops
!summary
** TBD.
!problem
The currently proposed facilities of parallel loops (AI12-0119-1) and reduction attributes (AI12-0242-1) handle many loop formulations that can be parallelized.
However, existing loops are not easy to convert into the reduction attribute form, especially if the loop has multiple accumlators. They could be written as multiple reduction expressions, but that could require recalculating parts of the expression multiple times, harming performance (which is the point of parallelizing in the first place). They also could be written using a a protected object to model the accumulators, but that would force some serialization of the loops (again, harming performance).
Another approach is needed.
!proposal
Provide a chunk index to a parallel loop, vaguely analogous to an entry family index. This is provided right after the keyword "parallel" in the parallel loop syntax.
For instance:
declare Partial_Sum : array (1 .. 2*Num_CPUs) of Integer := (others => 0); begin parallel (Chunk in Partial_Sum'range) for I in Arr'range loop declare Z : Integer; begin Z := some_complicated_computation; ... -- Complex machinations
Partial_Sum(Chunk) := @ + Arr(I)**2 * Z;
... -- Other stuff that also happens in this ... --- very complicated loop ... end; end loop;
Sum := Partial_Sum'Reduce("+", 0); end;
!wording
** TBD.
!discussion
This approach also provides some tuning capabilities for parallel loops, by letting the user vary the number of chunks. Smaller numbers of chunks (one chunk per executor) may provide the best performance for iterations that about the same amount of time regardless of inputs; larger numbers of chunks may provide better performance for iterations that take widely varying amounts of time.
---
There should be a function somewhere in Ada.Multiprocessing to give an estimate for a number of chunks to use, based on an estimate of the number of iterations, how conditional the iteration code is (the "variance"), which task is involved, and possibly other considerations. End users don't have enough information to know a good place to start. ("2*Num_CPUs" is unlikely to be a good choice for loops with a lot of variance.)
!ASIS
[Not sure. The syntax change might require some additional queries. - Editor.]
!ACATS test
ACATS B- and C-Tests are needed to check that the new capabilities are supported.
!appendix

From: Tucker Taft
Sent: Saturday, January 13, 2018  11:46 AM

Ada 83 was already a pretty good language for what I would call "hybrid"
functional/imperative programming, because you can have a sequence of
declarations that progressively compute a desired result.  Often it was only
because you needed to enter a loop that you had to do any significant amount of
computing after the "begin."  In today's Ada, with conditional expressions,
quantified expressions, the recently approved iterated component association,
and our newly proposed container aggregates, the need for explicit ifs and loops
after the "begin" is continuing to decrease.  I don't see problems with this
evolution toward providing a balance between functional and imperative features.
Clearly there is a resurgence of interest in functional programming these days,
but going all the way to "pure functional" programming, with "monads" instead of
assignment statements, and recursion only with no iteration, is not something
anyone in this discussion is suggesting for Ada.  But I do think it is important
to "complete" the functional primitives, if only to give the desired power to
Ada's contract-based programming.

I would claim that a reduction expression is the last remaining major component
needed to complete the "iterated" operations.  As I tried to show in my examples
drawn from the ParaSail compiler and static analyzer, there are many
applications of a reduction expression -- it isn't just a "gee-gaw" notation,
only used by functional-programming zealots.  And it isn't just for parallelism
-- it is a common sequential programming paradigm as well, and it will often be
the right way to define, in a postcondition, what is the final value of one
output of the subprogram, even though the subprogram might have multiple outputs
and might do many other things.

Parallelism is a somewhat separate issue in my mind, and where (and whether) it
is best inserted depends heavily on the amount of computing done within a single
iteration, the independence of those computations, and the number of iterations.
The advantage of the notations like quantified expressions, container
aggregates, and reduction expressions, is that they are self-contained, and can
often be analyzed holistically by the compiler to determine whether and where to
insert parallelism.  I believe we also need some explicit notations for
parallelism, but potentially even more important in the context of Ada is to
have sufficient annotations to allow the compiler to both check the safety of
explicit parallelism, and also safely insert implicit parallelism.  Hence the
importance of the Global and Nonblocking annotations.

As far as explicit parallelism, I now believe the right way to handle reductions
in that context is to provide a chunk index to a parallel loop, vaguely
analogous to an entry family index.  With our current parallel syntax, I would
propose that we allow the introduction of a chunk index as follows:

declare
     Partial_Sum : array (1 .. 2*Num_CPUs) of Integer := (others => 0); begin
     parallel (Chunk in Partial_Sum'Range) for I in Arr'Range loop
          declare
                 Z : Integer;
          begin
                 Z := some_complicated_computation;
                ...  -- Complex machinations

                 Partial_Sum(Chunk) := @ + Arr(I)**2 * Z;

                ...  --  Other stuff that also happens in this
                ...  ---  very complicated loop ...
          end;
     end loop;

     Sum := Partial_Sum'Reduce("+", 0);
end;

---

The semantics of the chunk index is that its range determines the limit on the
degree of parallelism of the loop, and provides the index to the body of the
loop.  That is the total extent of the feature.  There are no magic values, no
use of "<>", etc.

As seen above, we have chosen to declare a "Partial_Sum" array before the
parallel loop whose index range becomes the range of the chunk index.  We have
made the partial sum a multiple of the number of CPUs, but the programmer is
free to choose the limit of parallelism any way they want.  After the loop we
have "reduced" this array to an overall Sum.  Both of these are totally under
control of the user -- there is no compiler magic being performed here.  Of
course, the compiler might have some smarts in deciding whether to parallelize
the 'Reduce operation, but that is implicit parallelism, and depends on the
compiler to decide whether it is worth doing.

****************************************************************

From: Brad Moore
Sent: Sunday, January 14, 2018  11:25 PM

> Ada 83 was already a pretty good language for what I would call "hybrid"
> functional/imperative programming, because you can have a sequence of
> declarations that progressively compute a desired result.  Often it
> was only because you needed to enter a loop that you had to do any
> significant amount of computing after the "begin."  In today's Ada,
> with conditional expressions, quantified expressions, the recently
> approved iterated component association, and our newly proposed
> container aggregates, the need for explicit ifs and loops after the
> "begin" is continuing to decrease.  I don't see problems with this
> evolution toward providing a balance between functional and imperative
> features.  Clearly there is a resurgence of interest in functional
> programming these days, but going all the way to "pure functional"
> programming, with "monads" instead of assignment statements, and recursion
> only with no iteration, is not something anyone in this discussion is suggesting
> for Ada. But I do think it is important to "complete" the functional
> primitives, if only to give the desired power to Ada's contract-based
> programming.

I agree.

> I would claim that a reduction expression is the last remaining major
> component needed to complete the "iterated" operations.  As I tried to
> show in my examples drawn from the ParaSail compiler and static
> analyzer, there are many applications of a reduction expression -- it
> isn't just a "gee-gaw" notation, only used by functional-programming
> zealots.  And it isn't just for parallelism
> -- it is a common sequential programming paradigm as well, and it will
> often be the right way to define, in a postcondition, what is the
> final value of one output of the subprogram, even though the
> subprogram might have multiple outputs and might do many other things.

I Agree here too.

> Parallelism is a somewhat separate issue in my mind, and where (and
> whether) it is best inserted depends heavily on the amount of
> computing done within a single iteration, the independence of those
> computations, and the number of iterations.  The advantage of the
> notations like quantified expressions, container aggregates, and
> reduction expressions, is that they are self-contained, and can often
> be analyzed holistically by the compiler to determine whether and
> where to insert parallelism.  I believe we also need some explicit
> notations for parallelism, but potentially even more important in the
> context of Ada is to have sufficient annotations to allow the compiler
> to both check the safety of explicit parallelism, and also safely insert
> implicit parallelism.  Hence the importance of the Global and Nonblocking
> annotations.

So far so good.

> As far as explicit parallelism, I now believe the right way to handle
> reductions in that context is to provide a chunk index to a parallel
> loop, vaguely analogous to an entry family index.  With our current
> parallel syntax, I would propose that we allow the introduction of a chunk index as follows:
>
> declare
>     Partial_Sum : array (1 .. 2*Num_CPUs) of Integer := (others => 0);
> begin
>     parallel (Chunk in Partial_Sum'Range) for I in Arr'Range loop
>          declare
>                 Z : Integer;
>          begin
>                 Z := some_complicated_computation;
>                ...  -- Complex machinations
>
>                 Partial_Sum(Chunk) := @ + Arr(I)**2 * Z;
>
>                ...  --  Other stuff that also happens in this
>                ...  ---  very complicated loop ...
>          end;
>     end loop;
>
>     Sum := Partial_Sum'Reduce("+", 0); end;

I am still digesting this, while it is interesting, I am worried about adding
chunking management as one more thing the user has to think about to add
parallelism in cases where the user would rather not have to think about this.
There are a number of reduction problems where manual/user chunking is needed,
and this might be a good solution for that, but if we want to support
user/chunking this way, I think a few things are still missing. There are times
when the user would want to know more details about the chunk. e.g. What is the
first and last index of the chunk.

There are a couple other issues I see as well. Looking at Randy's monster
parallelism example, it involves 6 outputs where each result is a reduction. So,
with this syntax, one would need to declare 6 partial result arrays, which is
OK, but getting to be a bit of manipulation of the existing code. Then, you need
to mention one of these results in the iterator syntax, but there are 6
different result arrays. Would you need to list all 6 of these, perhaps comma
separated, or just pick one? I'm guessing you'd want to just pick one, but it
feels a bit wrong to not mention them all. What if the arrays are declared with
different numbers of elements? I presume you'd want a rule that says that all
arrays referenced in the loop with the chunk parameter would be required to have
the same number elements with the same indexing. Otherwise this syntax would not
make sense.

Randy liked the old version of the AI, I believe, because all you needed to do
was add the parallel keyword to the loop, then possibly add "parallel" to the
declarations of reduction results. That seems quite a bit easier to convert
existing loops. It might even be enough to just add "parallel" to the loop,
without adding "parallel" to reduction variables, since with the global aspect
and non-blocking aspect, the compiler will tell you about any problems with
global reduction results.

Getting back to Randy's monster example. One thing I didn't try before, was
writing the code in the form that could be applied as a single 'Reduce call.

I just tried that, and I found that it wasn't too difficult to do, as most of
the code is still intact. It took a little bit of time to set this up, but
really not that much effort, and I found the results satisfying.

I also tested this for performance, and found it comparable to the parallel loop
form.

For 80_000_000 records, with optimization turned off I got; (The new result is
the last one, for a single 'Reduce)


Multiple operations in a sequential Loop

{Elapsed = 00:00:01.51
Negative_Count= 0
Sample_Sum= 1.99965350799589E+07
Sample_Sqr_Sum= 1.33299985832170E+07
Product_Sum= 1.99965350799589E+07
Diff_Sum= 2.00046739200409E+07
Diff_Sqr_Sum= 1.33381374236478E+07


Multiple parallel loops

{Elapsed = 00:00:01.52
Negative_Count= 0
Sample_Sum= 1.99965350797951E+07
Sample_Sqr_Sum= 1.33299985832050E+07
Product_Sum= 1.99965350797952E+07
Diff_Sum= 2.00046739202052E+07
Diff_Sqr_Sum= 1.33381374236355E+07


Single parallel loops

{Elapsed = 00:00:00.43
Negative_Count= 0
Sample_Sum= 1.99965350797950E+07
Sample_Sqr_Sum= 1.33299985832051E+07
Product_Sum= 1.99965350797950E+07
Diff_Sum= 2.00046739202050E+07
Diff_Sqr_Sum= 1.33381374236362E+07


Single parallel, Reducer+Combiner

{Elapsed = 00:00:00.51
Negative_Count= 0
Sample_Sum= 1.99965350797948E+07
Sample_Sqr_Sum= 1.33299985832051E+07
Product_Sum= 1.99965350797948E+07
Diff_Sum= 2.00046739202052E+07
Diff_Sqr_Sum= 1.33381374236362E+07

A few seconds slower, probably because of an additional procedure call.

However, when I rerun with optimization turned on, I get;

************* Parallel Framework Test *************
  Physical Processors= 4

Multiple operations in a sequential Loop

{Elapsed = 00:00:00.65
Negative_Count= 0
Sample_Sum= 1.99965350799589E+07
Sample_Sqr_Sum= 1.33299985832170E+07
Product_Sum= 1.99965350799589E+07
Diff_Sum= 2.00046739200409E+07
Diff_Sqr_Sum= 1.33381374236478E+07


Multiple parallel loops

{Elapsed = 00:00:01.19
Negative_Count= 0
Sample_Sum= 1.99965350797948E+07
Sample_Sqr_Sum= 1.33299985832052E+07
Product_Sum= 1.99965350797952E+07
Diff_Sum= 2.00046739202051E+07
Diff_Sqr_Sum= 1.33381374236363E+07


Single parallel loops

{Elapsed = 00:00:00.20
Negative_Count= 0
Sample_Sum= 1.99965350797952E+07
Sample_Sqr_Sum= 1.33299985832052E+07
Product_Sum= 1.99965350797952E+07
Diff_Sum= 2.00046739202048E+07
Diff_Sqr_Sum= 1.33381374236363E+07


Single parallel, Reducer+Combiner

{Elapsed = 00:00:00.21
Negative_Count= 0
Sample_Sum= 1.99965350797952E+07
Sample_Sqr_Sum= 1.33299985832052E+07
Product_Sum= 1.99965350797952E+07
Diff_Sum= 2.00046739202048E+07
Diff_Sqr_Sum= 1.33381374236362E+07

Which is essentially the same speed (1/100 of a second slower in this run).

It's also interesting to see how the code would look. The user basically needs
to create a composite object.

e.g.

      type Composite is
         record
            Negative_Count : Natural;
            Sample_Sum     : Long_Float;
            Sample_Sqr_Sum : Long_Float;
            Product_Sum    : Long_Float;
            Diff_Sum       : Long_Float;
            Diff_Sqr_Sum   : Long_Float;
         end record;

      -- Reduce is only needed if we want to run Randy's code in parallel

      procedure Reduce (L, R : in out Composite)
        with Associative;

      procedure Reduce (L, R : in out Composite) is
      begin
         L := (Negative_Count => L.Negative_Count + R.Negative_Count,
               Sample_Sum     => L.Sample_Sum     + R.Sample_Sum,
               Sample_Sqr_Sum => L.Sample_Sqr_Sum + R.Sample_Sqr_Sum,
               Product_Sum    => L.Product_Sum    + R.Product_Sum,
               Diff_Sum       => L.Diff_Sum       + R.Diff_Sum,
               Diff_Sqr_Sum   => L.Diff_Sqr_Sum   + R.Diff_Sqr_Sum);
      end Reduce;

      -- A secondary object used for the 'Reduce because the array index is needed
      -- to look up data in a secondary array. Normally, this would not be needed
      type Item is
         record
            Index : Natural;
            Score : Long_Float;
         end record;

      -- The combiner function that is passed into 'Reduce
      procedure Include (L     : in out Composite;
                         Data  : Item)
        with Reducer => Reduce;

      -- NOTE: This code looks very much like Randy's original, as it was
      -- mostly cut and pasted. However, since we now have a composite record type,
      -- it was nice to use a record aggregate here.
      -- NOTE ALSO: That Randy's debugging code is still intact. So debugging parallel
      -- loops vs debugging 'Reduce attributes seems to be a non-issue.
      procedure Include (L     : in out Composite;
                         Data  : Item)
      is
         Score : Long_Float renames Data.Score;
         I     : Natural    renames Data.Index;
      begin
         L := (Negative_Count => L.Negative_Count + (if Score < 0.0 then 1 else 0),
               Sample_Sum     => L.Sample_Sum + Score,
               Sample_Sqr_Sum => L.Sample_Sqr_Sum + Score**2,
               Product_Sum    => L.Product_Sum +
                 Actual_Data(I).Weighted_Worse_Ratio * Score,
               Diff_Sum       => L.Diff_Sum +
                 Actual_Data(I).Weighted_Worse_Ratio - Score,
               Diff_Sqr_Sum   => L.Diff_Sqr_Sum +
                 (Actual_Data(I).Weighted_Worse_Ratio -
                      Sample_Data(I).Score)**2);

         if Result_Debug or else
           (Limited_Calc_Debug and then
              (I <= Debug_Calc_Limit
               or else I >= Actual_Len - Debug_Calc_Limit))
         then
            Ada.Text_IO.Put_Line ("Item " & Actual_Data(I).Item_Name
                                  & "; cnt="
                                  & Natural'Image(Actual_Count));
            Ada.Text_IO.Put("Sample Score="); LFlt_IO.Put
              (Sample_Data(I).Score, 10,6,0);
            Ada.Text_IO.Put("; Actual Score="); LFlt_IO.Put
              (Actual_Data(I).Weighted_Worse_Ratio, 10,6,0);
            Ada.Text_IO.New_Line;
            --Ada.Text_IO.Put("Sample Sum="); LFlt_IO.Put
            --  (Sample_Sum, 10,6,0);
            --Ada.Text_IO.Put("; Sample Sum**2="); LFlt_IO.Put
            --  (Sample_Sqr_Sum, 10,6,0);
            --Ada.Text_IO.New_Line;
            --Ada.Text_IO.Put("Actual Sum="); LFlt_IO.Put
            --  (Actual_Sum, 10,6,0);
            --Ada.Text_IO.Put("; Actual Sum**2="); LFlt_IO.Put
            --  (Actual_Sqr_Sum, 10,6,0);
            --Ada.Text_IO.New_Line;
            --Ada.Text_IO.Put("Product Sum="); LFlt_IO.Put
            --  (Product_Sum, 10,6,0);
            --Ada.Text_IO.Put("; Diff Sum=");
            --LFlt_IO.Put (Diff_Sum, 10,6,0);
            --Ada.Text_IO.Put("; Diff Sum**2="); LFlt_IO.Put
            --  (Diff_Sqr_Sum, 10,6,0);
            --Ada.Text_IO.New_Line;
            -- else no trace
         end if;
      end Include;

      -- Needed for the 'Reduce attribute.
      Identity : constant Composite := (Negative_Count => 0,
                                        Sample_Sum     => 0.0,
                                        Sample_Sqr_Sum => 0.0,
                                        Product_Sum    => 0.0,
                                        Diff_Sum       => 0.0,
                                        Diff_Sqr_Sum   => 0.0);


      With this in place, Randy's example can then be expressed as;

      Result := (for I of Sample_Data'Range when Sample_Data(I).Check_Item =>
                    Data'(Sample_Data(I).Score, I))
                 'Reduce(Include, Identity)


      Note, I needed to construct a temporary Data object, since there should
      only be two parameters to a combiner function. Also, the when clause meant
      was nice to eliminate the if statement just inside the original loop.

So writing Randy's example as a single 'Reduce felt like a good solution. The
loop seems simpler.

Maybe we don't need to have reduction support in parallel loops after all. Or at
least we consider leaving it out for now, and adding it later if needed.

****************************************************************

From: Tucker Taft
Sent: Monday, January 15, 2018  1:03 PM

> There are a couple other issues I see as well. Looking at Randy's
> monster parallelism example, it involves 6 outputs where each result
> is a reduction. So, with this syntax, one would need to declare 6
> partial result arrays, which is OK, but getting to be a bit of manipulation of
> the existing code. Then, you need to mention one of these results in the
> iterator syntax, but there are 6 different result arrays. Would you need to list
> all 6 of these, perhaps comma separated, or just pick one?

Just pick one.  Or not mention any of them.  They clearly all need to have the
same range, or you are making trouble for yourself.

> I'm guessing you'd want to just pick one, but it feels a bit wrong to not
> mention them all.

As I said, fine to not mention any of them.  Define the desired chunk-index
range as a named subtype, and then use that everywhere it is needed.

> What if the arrays are declared with different numbers of elements? I
> presume you'd want a rule that says that all arrays referenced in the
> loop with the chunk parameter would be required to have the same number
> elements with the same indexing. Otherwise this syntax would not make sense.

I think you are making this more complex than necessary.  The programmer could
choose to use an array of records, indexed by the chunk index.  In any case,
that is a "programming problem," not something the compiler needs to be involved
with, so long as the compiler provides the chunk index.

> Randy liked the old version of the AI, I believe, because all you
> needed to do was add the parallel keyword to the loop, then possibly add "parallel" to the declarations of reduction results. That seems quite a bit easier to convert existing loops.
> It might even be enough to just add "parallel" to the loop, without
> adding "parallel" to reduction variables, since with the global aspect and
> non-blocking aspect, the compiler will tell you about any problems with global
> reduction results.

I now believe this is just too much "magic."  The difficulty of implicitly
connecting the "parallel" arrays to the parallel loop that is being chunked
seems to result in too much hidden semantics.  It all works nicely in a
self-contained expression like the proposed reduction expression, but once you
have the declarations and the loop physically separated, reconnecting them is
very hard.  We have tried lots of different syntaxes to connect the
"accumulators" with the loop, and it all feels a bit magical and/or forced.

The advantage of just having a well-defined chunk index in the parallel loop is
that the connection to the particular accumulators is completely separated from
the syntax, allowing the programmer maximum flexibility while minimizing the
implementation effort and/or magic.

****************************************************************

From: Randy Brukardt
Sent: Monday, January 15, 2018  10:20 PM

> I now believe this is just too much "magic."  The difficulty of
> implicitly connecting the "parallel" arrays to the parallel loop that
> is being chunked seems to result in too much hidden semantics.  It all
> works nicely in a self-contained expression like the proposed
> reduction expression, but once you have the declarations and the loop
> physically separated, reconnecting them is very hard.  We have tried
> lots of different syntaxes to connect the "accumulators" with the
> loop, and it all feels a bit magical and/or forced.
>
> The advantage of just having a well-defined chunk index in the
> parallel loop is that the connection to the particular accumulators is
> completely separated from the syntax, allowing the programmer maximum
> flexibility while minimizing the implementation effort and/or magic.

My primary objection to having an explicit chunk index is that the average
programmer won't have a clue as to how to figure out the  number of chunks
needed (especially in any portable sort of way). The best value depends at least
on the amount of code per iteration, the variance in the amount of code per
iteration, the number of iterations, and the parallelism overhead (which most
likely can be modeled as some fixed amount + some variable amount*the number of
chunks).

In the general case, even the compiler doesn't have enough information to
determine this (the number of iterations can be determined at runtime, and parts
of the loop body are likely to be declared in other unit bodies, meaning they
might not even be compiled yet). (Obviously, some special cases would be easier,
but these are much more likely for the 'Reduce attribute than normal loops.)

I know Brad tried to provide a function in System.Multiprocessing for this
purpose, but that is missing the very important data about the cost of an
iteration and variance of iteration executions.

The variance matters because if each iteration takes about the same amount of
CPU time to execute, then one chunk per executor [CPU] is optimal -- that
minimizes overhead. If the variance between iteration times is high (as if most
of the execution is conditional), then there is value to more chunks, as that
guards against one chunk taking a lot longer than the others -- after all, the
loop isn't done until all of the chunks finish.

The cost of an iteration matters because if the iteration is short enough, then
a single chunk (and no parallelism) is likely best, as that avoids all of the
parallelism overhead.

Most users would simply end up choosing something as a WAG (Wild-A**ed-Guess) or
because they read it in a book -- neither seems very good for performance or
portability.

The clear advantage of Tucker's proposal is that it is simple enough to keep
from overwhelming us, which has to be a criteria here. So I'm cautiously in
favor.

****************************************************************

From: Randy Brukardt
Sent: Monday, January 15, 2018  11:03 PM

...
> I would claim that a reduction expression is the last remaining major
> component needed to complete the "iterated"
> operations.  As I tried to show in my examples drawn from the ParaSail
> compiler and static analyzer, there are many applications of a
> reduction expression -- it isn't just a "gee-gaw" notation, only used
> by functional-programming zealots.

Perhaps this is still "old guy-think", but to me, iteration is purely an
imperative operation. The functional guys had to add a bunch of operations like
"summation" to get sensible performance, and then they had to invent "Reduce" in
order to prevent a crazy proliferation of such operations. There is nothing
"natural" about Reduce!

Indeed, no one ever thought about Reduce until someone showed that it is
relatively easy to parallelize, and that an interesting class of problems can be
mapped (with substantial effort) into Reduce. (Supposedly, the Google empire is
built on this insight -- demonstrating that there is nothing obvious about this
idea.)

The problem is that Reduce is a wildly unnatural operation to an imperative
programmer; it's a difficult transition to think about this as some sort of
operation. Indeed, while I think the current formulation makes the most sense
from an Ada perspective, it still seems like a "gee-gaw" to me -- one may use
the operation, but only because it exists. (BTW, I view array slicing,
concatenation, and array ordering in this same category -- these operations only
really make sense for strings, but an over-generalization makes them available
for all array types -- and since they exist, they get used). It's the old "I
have a power-screwdriver in my toolkit, so let's make everything use screws".
[Modernizing an old adage!].

Yes, you can write a lot of things as a Reduce, but the only reason that you do
that is (A) you have it for parallelism, (B) you got used to using it for that
purpose, and (C) having a cool new gadget, you used it to do everything. That
certainly makes sense in Parasail, which was designed to make everything
parallel by default, but it makes a lot less sense in Ada (which is mainly an
imperative language, and I'd expect it to stay that way).

What the Reduce operation really says to me is that we cannot sanely support
parallelism without resorting to creeping functionalism. And if true, then we're
fighting a losing battle with Ada in any case, and probably ought not waste more
time on it.

> I would claim that a reduction expression is the last remaining major
> component needed ...

Sorry, I've heard that one before. Quantified expressions was the only thing
really needed for contracts. But to me it looks like an effort for creeping
functionalism -- and as you admit, Ada is never going to be good at that (we'd
need monads and lambdas and loads of other things that would clobber performance
without any real gain). Let's be honest here, and realize that there never will
be a "last thing" when it comes to adding more functional features.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, January 16, 2018  12:36 AM

...
> As far as explicit parallelism, I now believe the right way to handle
> reductions in that context is to provide a chunk index to a parallel
> loop, vaguely analogous to an entry family index. ...

Administrative question: Is this a new AI, or something that you want to add to
the existing AI12-0119-1 (or the existing AI12-0242-1 on reduction expressions)?
I think I'd vote for a new AI, even though that will probably make me more work
in the long run -- I'd like to get the most basic stuff finished ASAP, freeing
time to be spent on the more advanced ideas without having to worry if we finish
our critical features in time.

I'd like to know soon, so I can know where to file this thread before the
upcoming phone^H^H^H^H^Hgoogle Hangups meeting.

****************************************************************

From: Brad Moore
Sent: Tuesday, January 23, 2018  9:54 AM

> ...
>>
>> The advantage of just having a well-defined chunk index in the
>> parallel loop is that the connection to the particular accumulators
>> is completely separated from the syntax, allowing the programmer
>> maximum flexibility while minimizing the implementation effort and/or
>> magic.
>
> My primary objection to having an explicit chunk index is that the
> average programmer won't have a clue as to how to figure out the
> number of chunks needed (especially in any portable sort of way). The
> best value depends at least on the amount of code per iteration, the
> variance in the amount of code per iteration, the number of
> iterations, and the parallelism overhead (which most likely can be
> modeled as some fixed amount + some variable amount*the number of chunks).

I found with Paraffin that a good determination for the number of chunks can be
made just based on the number of iterations. All my libraries have optional
inputs to provide this, and tuning can sometimes give a bit of an edge in
performance, but I found it best to not bother with this, and leave the
determination of chunk number to the default calculation based on iterations.

The amount of code per iteration, and parallelism overhead are factors for
determining whether the parallelism is worthwhile or not, but that generally is
independent from determining a good chunk size value.

Variance of code per iteration can be important, but the chunk calculation I use
generally gives you small enough granularity that it works well with these sorts
of problems as well.

The calculation I came up with is;

         Natural'Max
           (1,
            Natural'Min
              (Iterations / Number_Of_Executors / 8,
               10_000_000));

This is why I was thinking that a library routine to return the number of
chunks, given a number of iterations could be sufficient, rather than building
syntax around this.

Its a trivial enough library routine, that we could consider leaving it to 3rd
party library writers to provide this, though a standard library routine would
give the implementation some input, and might tie in better with the target CPU,
or parallelism implementaiton, and provide a more complete solution for
parallelism.

Or the other alternative is something like Tucker proposed where we provide new
iteration syntax to iterate over chunks.

> In the general case, even the compiler doesn't have enough information to
> determine this (the number of iterations can be determined at runtime, and
> parts of the loop body are likely to be declared in other unit bodies,
> meaning they might not even be compiled yet).

You generally want to apply the parallelism at the highest level you can.
(In the outermost loop). Applying parallelism overhead inside nested loops gets
trickier, because that overhead is multiplied by the outer loop.

Given that a good chunk number can be determined based only on number of
iterations, determining this at runtime should be pretty easy to do.

> I know Brad tried to provide a function in System.Multiprocessing for this
> purpose, but that is missing the very important data about the cost of an
> iteration and variance of iteration executions.

I actually dont think that is very important. It might be important in
determining whether or not parallelism should be applied, but with explicit
parallelism, there is no need to determine that. The programmer decides.

> The variance matters because if each iteration takes about the same amount
> of CPU time to execute, then one chunk per executor [CPU] is optimal -- that
> minimizes overhead. If the variance between iteration times is high (as if
> most of the execution is conditional), then there is value to more chunks,
> as that guards against one chunk taking a lot longer than the others --
> after all, the loop isn't done until all of the chunks finish.

A simple calculation like the one I have above is good because regardless of the
amount of code per iteration, the number of chunks is small enough to not be a
noticeable amount of overhead to iterate through, yet generates small enough
chunks to allow enough granularity to account for variances between iteration
times.

> The cost of an iteration matters because if the iteration is short enough,
> then a single chunk (and no parallelism) is likely best, as that avoids all
> of the parallelism overhead.

Yes, but again, explicit parallelism lets the user decide. If the compiler cant
determine this for itself for implicit parallelism, it likely will default to
sequential anyways.

> Most users would simply end up choosing something as a WAG
> (Wild-A**ed-Guess) or because they read it in a book -- neither seems very
> good for performance or portability.

Which is why I was suggesting a library routine to calculate this.

****************************************************************

From: Tucker Taft
Sent: Tuesday, January 23, 2018  11:51 AM

You seem to only be considering cases where you know the number of iterations in
advance, and the chunks each end up with an identical number of iterations.  I
am not sure either of those assumptions are going to be true for container
iterators, or other more general iterators,

****************************************************************

From: Edmond Schonberg
Sent: Tuesday, January 23, 2018  12:33 PM

From what I can gather from the literature on existing multicore programming
tools, portability and performance are essentially incompatible; if you want
performance you need to tailor the program to the target (no. of threads,
processors, etc.) and none of this can be done without good profiling tools.  At
the programming level therefore you need to be able to specify explicitly the
chunk value, snd/or whatever other parameters are target-specific. Otherwise the
parallel constructs being proposed will not provide the performance improvement
a user may expect.  The compiler will certainly not be able to choose the right
values any better. So there must be some syntactic way of describing those
parameters.  I guess pragmas are the easiest language tools at hand...

****************************************************************

From: Randy Brukardt
Sent: Tuesday, January 23, 2018  4:09 PM

...
> > My primary objection to having an explicit chunk index is that the
> > average programmer won't have a clue as to how to figure out the
> > number of chunks needed (especially in any portable sort of way).
> > The best value depends at least on the amount of code per iteration,
> > the variance in the amount of code per iteration, the number of
> > iterations, and the parallelism overhead (which most likely can be
> > modeled as some fixed amount + some variable amount*the number of chunks).
>
> I found with Paraffin that a good determination for the number of
> chunks can be made just based on the number of iterations. All my
> libraries have optional inputs to provide this, and tuning can
> sometimes give a bit of an edge in performance, but I found it best to
> not bother with this, and leave the determination of chunk number to
> the default calculation based on iterations.

For portability, we're really looking for PGP (Pretty Good Performance) rather
than perfection. The best possible performance almost certainly will requiring
tuning to the target machine (not just the processor, but the entire hardware
setup), and that's clearly not portable at all.

> The amount of code per iteration, and parallelism overhead are factors
> for determining whether the parallelism is worthwhile or not, but that
> generally is independent from determining a good chunk size value.
>
> Variance of code per iteration can be important, but the chunk
> calculation I use generally gives you small enough granularity that it
> works well with these sorts of problems as well.
>
> The calculation I came up with is;
>
>          Natural'Max
>            (1,
>             Natural'Min
>               (Iterations / Number_Of_Executors / 8,
>                10_000_000));

I couldn't figure out whether this meant (Iterations / Number_Of_Executors) / 8
or Iterations / (Number_Of_Executors / 8) -- division is not associative. I
suspect that it means (Iterations / (Number_of_Executors * 8)). But it seems far
too large either way -- you're going to be doing an awful lot of stopping and
starting of chunks (which ought to be rather expensive). I'm very surprised that
this would even provide PGP, unless of course the amount of code in each
iteration is already very large. [In that case, loop overhead would not be
significant.] But that's not the common case for loops -- at least not my loops.

In any event, I suspect that you are looking at typical code that you write,
which hardly is representative of the entire universe of code (which includes
things that you or I would never write). And perhaps the behavior of various
compilers differs a lot for Ada tasking.

It's clear, for instance, that if the iteration variance is essentially zero (no
conditionals in the loop), that the number of chunks should equal the number of
executors (that minimizes overhead). That is a common enough case to include it
in the calculations.

It's also clear that if the amount of code in an iteration is relatively small
that the cost of looping overhead is significant (which is true of many of the
loops I write), then the number of chunks needs to be minimized.

But maybe you're right, in that it isn't worth parallelizing code unless it is
lengthy enough that it must have a significant variance and substantial size. In
that case, your formula here makes as much sense anything else.

> This is why I was thinking that a library routine to return the number
> of chunks, given a number of iterations could be sufficient, rather
> than building syntax around this.

I wasn't looking for syntax! I was looking for an answer to give PGP for
portable code, and to provide a first cut.


> Its a trivial enough library routine, that we could consider leaving
> it to 3rd party library writers to provide this, though a standard
> library routine would give the implementation some input, and might
> tie in better with the target CPU, or parallelism implementaiton, and
> provide a more complete solution for parallelism.

??? I'm not talking about a "general solution for parallelism"; I didn't even
expect the compiler to use whatever routine. (It surely can do better if it can
estimate the code size and variance.) But I thought that we would at least need
to have some way to determine how many executors are really available (if some
other task is also executing a parallel loop or block, it may make sense to take
that into account as well) [thus the Task_Id needs to be passed into such a
routine], and probably some indication of the variance (no variance is common
enough to special-case it).

> Or the other alternative is something like Tucker proposed where we
> provide new iteration syntax to iterate over chunks.

That's the ONLY thing I am talking about here!!! His proposal required
determining the number of chunks to use, and that needs some portable solution
for PGP. Tuning is a separate topic (although I would hope that his proposal
would provide about 98% of what is needed).

> > In the general case, even the compiler doesn't have enough information to
> > determine this (the number of iterations can be determined at runtime, and
> > parts of the loop body are likely to be declared in other unit
> > bodies, meaning they might not even be compiled yet).
>
> You generally want to apply the parallelism at the highest level you can.
> (In the outermost loop). Applying parallelism overhead inside nested loops gets
> trickier, because that overhead is multiplied by the outer loop.

Who said anything about "inside nested loops"?? I said that the compiler can't
determine this information because it is in other units, possibly not even
compiled yet.

> Given that a good chunk number can be determined based only on number
> of iterations, determining this at runtime should be pretty easy to
> do.

This is your unsupported assertion, one that no one else appears to believe.

> > I know Brad tried to provide a function in System.Multiprocessing for this
> > purpose, but that is missing the very important data about the cost of an
> > iteration and variance of iteration executions.
>
> I actually dont think that is very important. It might be important in
> determining whether or not parallelism should be applied, but with
> explicit parallelism, there is no need to determine that. The
> programmer decides.

I'm assuming that the overhead of executing a parallel chunk is rather high
compared to "typical" code. As such, running too many chunks increases the
overhead with no appreciable benefit.

If the iteration code is very large, this doesn't apply, and that seems to be
the sort of cases that you tried with Parafin. That makes sense, because using a
library like Parafin isn't that easy to set up, and you'll only do it if you're
certain that there will be a benefit.

But parallel loops will be relatively easy to apply (just add "parallel" and
assuming Tucker's explicit chunk mechanism, change any accumlators into arrays);
the code in the loop will often be a lot smaller. (Even easier if we're talking
about Parallel_Reduce or the raw parallel, but in that case there is no need to
talk about the user determining the number of chunks.)

> > The variance matters because if each iteration takes about the same amount
> > of CPU time to execute, then one chunk per executor [CPU] is optimal -- that
> > minimizes overhead. If the variance between iteration times is high (as if
> > most of the execution is conditional), then there is value to more chunks,
> > as that guards against one chunk taking a lot longer than the others
> > -- after all, the loop isn't done until all of the chunks finish.
>
> A simple calculation like the one I have above is good because regardless of
> the amount of code per iteration, the number of chunks is small enough to not
> be a noticeable amount of overhead to iterate through, yet generates small enough
> chunks to allow enough granularity to account for variances between
> iteration times.

Your calculation seems to yield 312 chunks for 10000 iterations on a 4 core
machine. That seems like way too many, I wouldn't use more than 16 in any
circumstance (the variance of real code isn't that wide). You're getting way too
much overhead; you probably can't tell that if the iterations are rather
expensive, but more typical code you'll be eaten up by overhead.

...
> > Most users would simply end up choosing something as a WAG
> > (Wild-A**ed-Guess) or because they read it in a book -- neither seems very
> > good for performance or portability.
>
> Which is why I was suggesting a library routine to calculate this.

I agree, but I want the routine to take more information so it can make a much
more useful guess. Just taking the number of iterations only will work if the
iteration is quite expensive and the variance is high -- probably a reasonably
common situation, and perhaps the best one can do if nothing else is available.
(Tucker also points out that the number of iterations isn't always determinable,
although it usually is.) But there user knows if their code contains
conditionals or not (I'd just make that a Boolean, as the user cannot make any
useful estimate beyond that -- I don't even know a good way to characterize it).
And some estimate for the code size for an iteration, perhaps in terms of MOPs
(memory operations). [Just Small-Medium-Large amounts of code probably is
enough, but that might be too weird.] The implementation will have an estimate
for overhead, so that doesn't need to be provided.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, January 23, 2018  4:16 PM

> You seem to only be considering cases where you know the number of
> iterations in advance, and the chunks each end up with an identical
> number of iterations.  I am not sure either of those assumptions are
> going to be true for container iterators, or other more general
> iterators,

I'm sort of with Brad on this one.

Containers have the Length function, so one can tell the maximum number of
iterations rather easily. Exits are non-deterministic in any case, so it's
pretty much impossible to trust anything about them; best to forget that.

And the containers have to return roughly equal-sized chunks. (Exactly equal
doesn't even mean anything unless the loop body has no variance [conditionals]
-- normal iterations will take different amounts of time to run.) Otherwise, no
estimates of any sort can be usefully made.

We're trying to determine portable PGP (Pretty Good Performance) here, anyway;
optimal performance clearly requires tuning that isn't relevant to my concern
(which is how to determine the number of chunks to use in portable code --
picking random numbers is not the answer!).

****************************************************************

From: Randy Brukardt
Sent: Tuesday, January 23, 2018  4:22 PM

...
> From what I can gather from the literature on existing multicore
> programming tools, portability and performance are essentially
> incompatible; if you want performance you need to tailor the program
> to the target (no. of threads, processors,
> etc.) and none of this can be done without good profiling tools.

We're not looking for ultimate performance from these, but rather "Pretty Good
Performance". And note that the Ada runtime has the information about be
processors and the implementation to at least make a starting point. That should
be possible portably.

At some future point, we're going to talk about further tuning parameters
(although I think Tucker's explicit chunk stuff provides the needed hooks for
users to tweak).

...
> At the programming level therefore you need to be able to specify
> explicitly the chunk value, snd/or whatever other parameters are
> target-specific. Otherwise the parallel constructs being proposed will
> not provide the performance improvement a user may expect.  The
> compiler will certainly not be able to choose the right values any
> better.
> So there must be some syntactic way of describing those parameters.  I
> guess pragmas are the easiest language tools at hand...

Tuning parameters are a separate, future AI. The current discussion (at least,
what *I* think the current discussion is about) is the parameters to use for
portable code, in particular using explicit chunk indexes. Users shouldn't have
to guess a starting point for portable code, nor for tuning.

I suspect that people will try to parallelize a lot of stuff that doesn't
execute long enough to really benefit (especially on desktop targets). I don't
think there is anything we can do at the language level (and not much at the
implementer level, either, other than education) to give them the performance
they want as opposed to what makes sense.

****************************************************************

From: Edmond Schonberg
Sent: Tuesday, January 23, 2018  7:42 PM

It depends who you are talking to. We have customers that want to use Ada on
GPU’s. that have computer-intensive applications (fluid dynamics, financial
computations) that are currently using other languages for this, and for whom
PGP will be a total disappointment,  And they dont expect their code to run on
laptops!  If, as was said repeatedly, parallelism is all about performance, then
PGP is besides the point: if you don’t provide tuning mechanisms you might as
well run sequentially  (which is certainly portable :-)!

****************************************************************

From: Tucker Taft
Sent: Wednesday, January 24, 2018  2:45 AM

>> ...
>>
> It depends who you are talking to. We have customers that want to use
> Ada on GPU’s. that have computer-intensive applications (fluid
> dynamics, financial
> computations) that are currently using other languages for this, and
> for whom PGP will be a total disappointment,  And they dont expect
> their code to run on laptops!  If, as was said repeatedly, parallelism
> is all about performance, then PGP is besides the point: if you don’t
> provide tuning mechanisms you might as well run sequentially  (which
> is certainly portable :-)!

It is all about performance, but there is a long-term and a short-term.  My
favorite example is the C programming language, which allowed users to specify
which variables they wanted the compiler to keep in a register.  I believe you
can still say "register int i;" but the only effect now is that a "register"
declaration means that you can no longer use the "&" operator with the variable.
The compiler is much better at register allocation than are programmers.

Today, it is likely that some programmers are better than the compiler at doing
"CPU" allocation, but that will continue to shift over time.  This is one reason
I am against producing parallel versions of self-contained things like
quantified expressions or reduction expressions.  On the other hand, at a
high-level loop level, the compiler is probably not as smart as some
programmers, but probably is already smarter than many programmers, as far as
CPU allocation.  So we should have reasonable defaults, for the "average"
programmer, but still accommodate the more sophisticated programmer who are
still "smarter" than the compiler w.r.t. CPU allocation.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, January 24, 2018  2:47 PM

...
> We have customers that
> want to use Ada on GPU's. that have computer-intensive applications
> (fluid dynamics, financial
> computations) that are currently using other languages for this, and
> for whom PGP will be a total disappointment,  And they dont expect
> their code to run on laptops!

Of course there will be people that expect 8x performance gains on a quad core.
No help there. And surely there are people that will need to tune their code to
absolute performance -- but that's not the point of these features. The point is
safe, portable parallelism, by definition that isn't going to be all things to
all people. (There's a lot of things I want that I'm never going to get, after
all.)

And quite likely they are going to want to run some version of their code on a
laptop -- almost all developers do initial testing on such machines before
running it in the final hardware environment. The smaller the change is between
environments, the better.

In any case, this particular thread was about Tucker's proposed way of writing
traditional loops that have some sort of reduction (which is essentially all of
them). Converting a sequential loop with multiple outputs into reduction form
requires turning the loop inside out and either introducing  additional (tuple)
types and operations, or converting into multiple reductions which may have to
repeat expensive calculations. Neither seems ideal.

It appears that Tucker's solution also provides at least some tuning parameters,
which is an added bonus (but not the primary goal).

We do need to discuss tuning features at some point, but if we try to do
everything at once there will be a guarantee that we will finish nothing on
time. (Of course, if Tucker's keeps updating AI12-0240-1 rather than his
critical path AIs like AI12-0079-1, nothing will get done anyway. ;-)

****************************************************************

From: Edmond Schonberg
Sent: Wednesday, January 24, 2018  3:28 PM

> Of course there will be people that expect 8x performance gains on a
> quad core. No help there. And surely there are people that will need
> to tune their code to absolute performance -- but that's not the point
> of these features. The point is safe, portable parallelism, by
> definition that isn't going to be all things to all people. (There's a
> lot of things I want that I'm never going to get, after all.)

My suspicion is that "safe, portable parallelism" with not be of interest to
anyone because it will not deliver on the implicit promise that it will improve
performance.

> And quite likely they are going to want to run some version of their
> code on a laptop -- almost all developers do initial testing on such
> machines before running it in the final hardware environment. The
> smaller the change is between environments, the better.

In that case what about zero changes, no new keywords, and leave the tuning to
TBD pragmas that can be target-specific?

****************************************************************

From: Randy Brukardt
Sent: Wednesday, January 24, 2018  4:26 PM

> > Of course there will be people that expect 8x performance gains on a
> > quad core. No help there. And surely there are people that will need
> > to tune their code to absolute performance -- but that's not the
> > point of these features. The point is safe, portable parallelism, by
> > definition that isn't going to be all things to all people. (There's
> > a lot of things I want that I'm never going to get, after all.)
>
> My suspicion is that "safe, portable parallelism" with not be of
> interest to anyone because it will not deliver on the implicit promise
> that it will improve performance.

Well, I don't expect parallelism to improve performance (tuning or not) for most
code on most targets. I can imagine an environment where it would (a bare
dedicated machine in an environment where power consumption was not an issue),
but that is not a practical target.

But it's clear that there are cases, not that common, where parallelization will
help massively: extreme numbers of iterations, very expensive iterations, and
certain patterns that can be mapped to special hardware. The first two cases can
be dealt with portably, and that's the point here.

> > And quite likely they are going to want to run some version of their
> > code on a laptop -- almost all developers do initial testing on such
> > machines before running it in the final hardware environment. The
> > smaller the change is between environments, the better.
>
> In that case what about zero changes, no new keywords, and leave the
> tuning to TBD pragmas that can be target-specific?

If the postulate is that Ada 2012 is good enough (using something like Parafin),
then I see no point in tuning pragmas, because one can tune everything sensible
via the parameters to the library (the number of executors, number of chunks,
and so on). About the only thing that would make sense selecting the processor
is in a heterogeneous environment (but Ada has never made any effort to support
such environments, so anything done would necessarily be
implementation-defined).

But one needs keywords and the like to get safety. Ada has always been about
safety more than extreme performance, but one is completely on your own when
writing tasks. The compiler gives absolutely no help in preventing deadlocks or
race conditions; as a consequence, every program I've ever written using Ada
tasks is a deadlock waiting to happen.

If writing safe parallel code is impossible in Ada, then literally Ada has no
future. Because clearly there will be new languages and techniques that do
provide such safety, and there will be no reason to use languages that don't do
that. Moreover, one will want to add parallel execution to existing code that
wasn't written with parallel execution in mind. If that isn't possible safely,
and a full rewrite needed, why wouldn't one find a newer language that does
promise safety?

I, for instance would like to have a loop something like:

            parallel for S of Current_Unit_Subprograms loop
                Optimize_Subprogram (S.Start, S.Finish);
            end loop;

Eliminating all race conditions from Optimize_Subprogram would certainly require
compiler help to point out issues; there is no way to look at 20,000 lines of
code and find everything. And I surely would want to know if there is some
lingering memory usage that needs protection.

Concluding a long-winded message: these features are 100% about adding safety to
parallel programming in Ada. There is no need to add features to add performance
to existing code (those necessarily would be target-specific, so those are
implementation-defined if they exist at all). There might be a need for
performance tuning for these new features, but I at least want to finish the
basic definition before worrying too much about tuning. (If almost no existing
code can be converted to safe parallel loops, tuning is irrelevant.)

****************************************************************

From: Tucker Taft
Sent: Wednesday, January 24, 2018  3:04 PM

> ... (Of course, if Tucker's keeps updating AI12-0240-1 rather than his
> critical path AIs like AI12-0079-1, nothing will get done anyway. ;-)

My own feeling was that phone meetings were not particularly good for debating
the fine points of detailed wording, but rather would be most useful to evaluate
new proposals.  So that is why I have focused on new proposals.  I'm not worried
about us getting the last bits of wording right for AIs where the intent is
already agreed upon.  I am more worried about AIs where we might not agree yet
on the intent.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, January 24, 2018  3:48 PM

> My own feeling was that phone meetings were not particularly good for
> debating the fine points of detailed wording, but rather would be most
> useful to evaluate new proposals.  So that is why I have focused on
> new proposals.  I'm not worried about us getting the last bits of
> wording right for AIs where the intent is already agreed upon.  I am
> more worried about AIs where we might not agree yet on the intent.

I rather disagree; phone meetings are best for wordsmithing since it isn't
necessary to diagram on a whiteboard like it is for syntax arguments. For Ada
2012, we used phone meetings to finalize almost all of the wording, so it's
amazing to me to think that it suddenly has gotten difficult to do what we've
done many times before.

In any case, AI12-0079-1 is nowhere near complete enough for me to be able to
annotate the containers with them, a job I need to start ASAP. For instance, we
have no solution in the AI yet to the problem of implementation-defined helper
packages. We haven't discussed how to handle atomic writes in "synchronized"
mode (these are easily race conditions which shouldn't be allowed in general,
lest our goal of preventing unsynchronized access be completely undermined).
There's probably other issues that aren't mentioned in the AI.

Moreover, you haven't really done much of an update for the last couple of
meetings. It seems like it isn't progressing, and it is the most core thing we
are working on.

I was expecting that we would spend a bit of time on the 4 BIs (none of which
should require much discussion, but maybe some wordsmithing), a bit of time on
some of the new proposals (to determine whether or not we want to pursue them,
and find a victim, er, ARG author for the AI), and the bulk of the time on
AI12-0079-1 and AI12-0119-1 in the hopes of getting some of our core work done.
[Remember that AI12-0119-1 depends on AI12-0079-1.] (We can't finish all of the
AIs for Ada 2020 in the last week, as I simply couldn't get the all into
Standard in the two weeks or so of editorial time the schedule allows.) And then
we would look at anything else as time permits.

****************************************************************

From: Brad Moore
Sent: Saturday, January 27, 2018  10:20 AM

> From what I can gather from the literature on existing multicore 
> programming tools, portability and performance are essentially 
> incompatible; if you want performance you need to tailor the program 
> to the target (no. of threads, processors, etc.) and none of this can 
> be done without good profiling tools.

In general, I think profiling tools can be helpful for pointing out places
in the program where one should look for Parallelism Opportunities. (POP's)

In many cases, this will be obvious, for instance if a lengthy number
crunching routine is being called. The best place to look for POP's is start
at the top. (The top level code of the routine doing the number crunching for
example). 

The higher up in the code the parallelism is applied, the better the chance
that the parallelism overhead will be overcome and be worthwhile.

...
> At the programming level therefore you need to be able to specify
> explicitly the chunk value, snd/or whatever other parameters are 
> target-specific. Otherwise the parallel constructs being proposed will 
> not provide the performance improvement a user may expect.

I think the situation where the programmer is disappointed with the
performance result is mostly the case where the amount of work to be performed
was not enough to overcome the parallelism overhead in the first place.
Attempting to apply tuning to that situation is likely not going to change the
outcome.

It is relatively easy for the programmer to determine if the parallelism
helped or not.

I typically just calculate the amount of elapsed time from before and after
the POP, and print it out in a debugging statement, and compare against when
I turn off the parallelism. Otherwise, I use no special profiling tools.

Given a situation where there the parallelism is actually worthwhile, using a
reasonable default for chunk size will give probably better than pretty good, 
but good or better results.

I gave a formula for this in a previous email, but that is actually the
formula I use for minimum grain size, which applies to some of my libraries
that do load balancing. All my library routines start off with the same chunk
size, using a different calculation based on the number of available executors
but the load balancing ones will attempt to steal/seek work from other
executors of size no smaller than the grain size in iterations, when they
complete the work that has been assigned to the executor.

I always set the chunk size based on the number of executors (threads), and
divide the iterations evenly amongst the executors. A good default for number
of executors is the number of available cores, but I find that using a number
3 to 4 times the number of cores gives optimal performance, though the
difference between optimal and 1 * number of cores is in the land of
diminishing returns. 

...
> The compiler will certainly not be able to choose the right values any better.
> So there must be some syntactic way of describing those parameters.  I 
> guess pragmas are the easiest language tools at hand...

Setting the number of executors probably is a good candidate for a
configuration pragma, but also allowing it to be overridden in a unit or
subprogram. In most cases, setting it once as a configuration parameter is
enough, or relying on a system default (number of available CPU * some low
multiple) if the pragma is not specified.

****************************************************************

From: Brad Moore
Sent: Saturday, January 27, 2018  10:46 AM

> It is all about performance, but there is a long-term and a 
> short-term.  My favorite example is the C programming language, which 
> allowed users to specify which variables they wanted the compiler to 
> keep in a register.  I believe you can still say "register int i;" but the
> only effect now is that a "register" declaration means that you can no
> longer use the "&" operator with the variable.  The compiler is much better
> at register allocation than are programmers.

I believe there is still value in C's register keyword though. Since you
cannot use the "&" operator, the compiler knows that the variable cannot be
aliased, which could allow the compiler to apply further optimisations, that
it might not otherwise be able to apply, in theory. It also documents the
programmers intent that the code that utilises that variable needs to be 
optimised for performance.

We are just now talking about adding some related features to Ada, such as in
AI12-0240, where we are attempting to prevent unknown aliasing of names, as
well as optimization features such as parallelism.

****************************************************************

From: Brad Moore
Sent: Saturday, January 27, 2018  12:24 PM

...
>> The calculation I came up with is;
>> 
>>          Natural'Max
>>            (1,
>>             Natural'Min
>>               (Iterations / Number_Of_Executors / 8,
>>                10_000_000));
> 
> I couldn't figure out whether this meant (Iterations / Number_Of_Executors)
> / 8 or Iterations / (Number_Of_Executors / 8) -- division is not
> associative. I suspect that it means (Iterations / (Number_of_Executors *
> 8)). But it seems far too large either way -- you're going to be doing an
> awful lot of stopping and starting of chunks (which ought to be rather
> expensive). I'm very surprised that this would even provide PGP, unless of
> course the amount of code in each iteration is already very large. [In that
> case, loop overhead would not be significant.] But that's not the common
> case for loops -- at least not my loops.
> 

Sorry, but this is the calculation I use for determining the minimum grain
size, where a grain is the smallest number of iterations that can be stolen
from another executor, for my library routines that apply load balancing. A
chunk might get dynamically broken down into smaller chunks, but no smaller
than a grain.

For determining chunk size, I use the following calculation, which is also
based on the number of available executors, and number of iterations.

Number_of_workers := Positive'Min (Iterations,
                                   Positive (Available_Executors) +
                                     (Iterations mod Positive (Available_Executors)));

Chunks_Size := Iterations / Number_Of_Workers;

Note: The number of available executors does not mean the number of available
CPUs. It means the number of available worker threads, which typically is a
small multiple of the number of available CPU.  (The multiple I use is 1, 2,
or 3 typically) Using 1 is a good default, using 3 typically gives optimal
performance, but not significantly better than using the 1 value.

> In any event, I suspect that you are looking at typical code that you write,
> which hardly is representative of the entire universe of code (which
> includes things that you or I would never write). And perhaps the behavior
> of various compilers differs a lot for Ada tasking.

I've tried this with many different problems, those with even work loads,
uneven work loads, large iteration counts, small iteration counts, 3 different
Ada compiler vendors, different OS's  (Windows, Linux, Raspian (Linux variant),
Android (Linux Variant), MAC OS), different targets, Intel, AMD quadcore, dual
core, 8 core, Raspberry PI, my Android cell phone, and compared against
different library strategies, my own dynamic load balancing (work seeking, work
stealing), and static non-load-balancing. I've compared against libraries
written with OpenMP, and Cilk, using my own bindings to these platforms, as
well as writing programs in C, and C++ calling OpenMP and Cilk directly.

I always get good results for problems that are worthwhile parallelizing, using
the same calculation for chunk size.

So I feel pretty confident these results can be extended to other environments
I haven't tried yet.

> It's clear, for instance, that if the iteration variance is essentially zero
> (no conditionals in the loop), that the number of chunks should equal the
> number of executors (that minimizes overhead). That is a common enough case
> to include it in the calculations.

The number of chunks is always equal to the number of executors, in my
libraries and use with other platforms. The real question is how to determine
the number of executors, which is typically a small multiple of the number of
available cores.

Even for the case where the number of executors = the number of available
cores, it is typical that all worker threads do not progress at exactly the
same rate and some executors will finish early, not fully utilizing all the
cores. Using a higher number of executors allows better granularity
to help ensure that the cores have better utilization.

> It's also clear that if the amount of code in an iteration is relatively
> small that the cost of looping overhead is significant (which is true of
> many of the loops I write), then the number of chunks needs to be minimized.
> 
> But maybe you're right, in that it isn't worth parallelizing code unless it
> is lengthy enough that it must have a significant variance and substantial
> size. In that case, your formula here makes as much sense anything else.

For about the smallest amount of code per iteration you can imagine, say
adding integers, I find roughly you need about a million iterations to hit
the break even point for overcoming the parallelism overhead. As the amount
of code per iteration increases, the break even point drops, as you'd expect.

> 
>> Given that a good chunk number can be determined based only
>> on number of iterations, determining this at runtime should
>> be pretty easy to do.
> 
> This is your unsupported assertion, one that no one else appears to believe.

Perhaps the assertion can be supported with a couple of examples.

First consider first a loop to simply add integers from 1 to N.

Using the default calculation, where the number of executors = number of cores,
and chunks = number of executors / number of cores

For 1_000_000_000 iterations, on 64 bit integers and 8 byte floating point, 
I get the following results using various libraries with GNAT on Linux AMD Quadcore.

************* Parallel Framework Test *************
  Physical Processors= 4
  Workers = 4
  Effective_Workers = 4
  Load_Sensitive=FALSE
  Grain_Size= 10000

*** Integer Sum Reduction Tests ***

Sequential Sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:02.28
Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.69
OpenMP Work Static sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.76
OpenMP Dynamic Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.64
OpenMP Guided Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.66
Cilk Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:01.07
Ordered Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.70
Work Seeking sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.70
Ordered Work Seeking sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.62
Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.66
Ordered Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.71
Work Sharing Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.65
Work Seeking Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.65
OpenMP Static Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.66
OpenMP Dynamic Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.67
OpenMP Guided Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.67
Pooled Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.84
Pooled Work Sharing Ordered sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.72
Unbounded Pooled Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.71
Unbounded Pooled Work Sharing ordered sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.68
Pooled Work Seeking sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.68
Pooled Work Seeking ordered sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.64
Pooled Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.73
Pooled Work Stealing ordered sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.78
Pooled Work Sharing Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.86
Pooled Work Seeking Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.71
Work Sharing Recursive sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.72
Work Seeking Recursive sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.98
Non Generic Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.77
Non Generic OpenMP Static sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.79
Non Generic OpenMP Dynamic Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.77
Non Generic OpenMP Guided Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.71
Non Generic Cilk Static sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.82
Non Generic Work Seeking sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.76
Non Generic Work Seeking sum of 1000000000 integers using Task Attributes in parallel is  500000000500000000, Elapsed = 00:00:00.77
Non Generic Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.92
Non Generic Pooled Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.81
Non Generic Pooled Work Seeking sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.78
Non Generic Pooled Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.78

*** Floating Point Addition Reduction Tests ***

Sequential Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:09.47
Work Sharing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.57
Work Sharing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.76
OpenMP Static Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.48
OpenMP Dynamic Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.56
OpenMP Guided Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.58
Work Seeking Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.46
Work Seeking Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.51
Work Stealing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.64
Work Stealing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.72
Work Sharing Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.70
Pooled Work Sharing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.57
Pooled Work Sharing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.92
Pooled_Work Seeking Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.49
Pooled_Work Seeking Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.50
Pooled Work Stealing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.52
Pooled Work Stealing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.49
Pooled Work Sharing Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.52
Pooled Work Seeking Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.57

*** Non Reducing Array Manipulation Tests ***

Sequential Manipulation bits of bit array is Done, Elapsed = 00:00:00.25
Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.10
Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.08
Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06
OpenMP static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.08
Work Sharing Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.09
Pooled Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.08
Pooled Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06
Pooled Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07
Work Sharing Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.06
Work Seeking Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.07
Non Generic Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.08
Non Generic Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07
Non Generic Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07
Non Generic OpenMP static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06
Non Generic Cilk static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.02
Non Generic Pooled Work Sharing Manipulation of bit array is Done, Elapsed = 00:00:00.07
Non Generic Pooled Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07
Non Generic Pooled Work Stealing Manipulation of bit array is Done, Elapsed = 00:00:00.07


This sped up the processing about 3.5-3.6 x faster for 4 cores, for 64 bit integer addition pretty consistently.

and about 3.7 x faster for 8 byte floating point.

After playing around with tuning, Using my suggested number of executors for optimal performance, I found I got pretty
much the same result with about 12 executors. (3 x number of cores).

When I rerun this example with 12 executors, I get;



************* Parallel Framework Test *************
  Physical Processors= 4
  Workers = 12
  Effective_Workers = 12
  Load_Sensitive=FALSE
  Grain_Size= 10000

*** Integer Sum Reduction Tests ***

Sequential Sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:02.25
Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.65
OpenMP Work Static sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.71
OpenMP Dynamic Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.66
OpenMP Guided Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.65
Cilk Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:01.11
Ordered Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.67
Work Seeking sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.62
Ordered Work Seeking sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.68
Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.69
Ordered Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.71
Work Sharing Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.73
Work Seeking Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.68
OpenMP Static Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.68
OpenMP Dynamic Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.68
OpenMP Guided Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.74
Pooled Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.70
Pooled Work Sharing Ordered sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.68
Unbounded Pooled Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.74
Unbounded Pooled Work Sharing ordered sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.75
Pooled Work Seeking sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.69
Pooled Work Seeking ordered sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.64
Pooled Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.69
Pooled Work Stealing ordered sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.67
Pooled Work Sharing Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.74
Pooled Work Seeking Parallel Blocks sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.71
Work Sharing Recursive sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.75
Work Seeking Recursive sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.96
Non Generic Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.80
Non Generic OpenMP Static sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.83
Non Generic OpenMP Dynamic Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.76
Non Generic OpenMP Guided Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.76
Non Generic Cilk Static sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.95
Non Generic Work Seeking sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.79
Non Generic Work Seeking sum of 1000000000 integers using Task Attributes in parallel is  500000000500000000, Elapsed = 00:00:00.81
Non Generic Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.77
Non Generic Pooled Work Sharing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.76
Non Generic Pooled Work Seeking sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.76
Non Generic Pooled Work Stealing sum of 1000000000 integers in parallel is  500000000500000000, Elapsed = 00:00:00.76

*** Floating Point Addition Reduction Tests ***

Sequential Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:09.47
Work Sharing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.58
Work Sharing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.57
OpenMP Static Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.53
OpenMP Dynamic Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.40
OpenMP Guided Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.62
Work Seeking Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.84
Work Seeking Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.59
Work Stealing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.68
Work Stealing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.70
Work Sharing Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.78
Pooled Work Sharing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.74
Pooled Work Sharing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.53
Pooled_Work Seeking Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.60
Pooled_Work Seeking Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.50
Pooled Work Stealing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.53
Pooled Work Stealing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.64
Pooled Work Sharing Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.68
Pooled Work Seeking Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.81

*** Non Reducing Array Manipulation Tests ***

Sequential Manipulation bits of bit array is Done, Elapsed = 00:00:00.23
Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.08
Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06
Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06
OpenMP static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07
Work Sharing Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.08
Pooled Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.07
Pooled Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06
Pooled Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07
Work Sharing Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.07
Work Seeking Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.07
Non Generic Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.08
Non Generic Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07
Non Generic Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.10
Non Generic OpenMP static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07
Non Generic Cilk static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.02
Non Generic Pooled Work Sharing Manipulation of bit array is Done, Elapsed = 00:00:00.08
Non Generic Pooled Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07
Non Generic Pooled Work Stealing Manipulation of bit array is Done, Elapsed = 00:00:00.07


Which is pretty much the same result, even though the chunk size is about 3 times smaller.


Now consider another example. For this one, the sum of the sum of integers from 1 to N

In other words, the sequential loop is;

            for I in 1 .. N loop
               for J in 1 .. I loop
                  Sum := Sum + J;
               end loop;
            end loop;

Here the amount of code per iteration is much higher, and very lop-sided.
The higher the number of iteration for the outside loop, the more iterations performed 
in the inner loop. One would expect the executors doing the lower number iterations
to complete much earlier that the higher ones.


If I run the program with the number of executors = number of cores, and
number of chunks = number of executors, and using a much smaller number of iterations, 
(50_000 Iterations) I get;

************* Parallel Framework Test *************
  Physical Processors= 4
  Workers = 4
  Effective_Workers = 4
  Load_Sensitive=FALSE
  Grain_Size= 3125

*** Integer Sum Reduction Tests ***

Sequential Sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:02.99
Work Sharing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:01.50
OpenMP Work Static sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.98
OpenMP Dynamic Work Sharing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.97
OpenMP Guided Work Sharing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.91
Cilk Work Stealing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.88
Ordered Work Sharing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:01.47
Work Seeking sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:01.09
Ordered Work Seeking sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:01.10
Work Stealing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:01.43
Ordered Work Stealing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:01.36

Here the load balancing strategies and 3rd party strategies do quite a bit better than my simple
work sharing library, that does no load balancing. But notice, all the results are "PGP" (Pretty Good Parallelism).

Now lets try the optimum number of executors, which I found to be about 12. (Anything higher doesn't improve results noticeably)

************* Parallel Framework Test *************
  Physical Processors= 4
  Workers = 12
  Effective_Workers = 12
  Load_Sensitive=FALSE
  Grain_Size= 1041

*** Integer Sum Reduction Tests ***

Sequential Sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:03.00
Work Sharing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.95
OpenMP Work Static sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.97
OpenMP Dynamic Work Sharing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.92
OpenMP Guided Work Sharing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.91
Cilk Work Stealing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.93
Ordered Work Sharing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.94
Work Seeking sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.92
Ordered Work Seeking sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.94
Work Stealing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:00.82
Ordered Work Stealing sum of 50000 integers in parallel is  20834583350000, Elapsed = 00:00:01.02

All the results are noticeably better, but not significantly better. Even the non-load-balancing one
had a good comparable time compared to the ones that did load balancing, and also note that the
best time for this run was my Paraffin Work Stealing library, which beat both Cilk and OpenMP versions here.

But depending on the problem one library might have a slight edge over another, the important point is that
they all come in around the same ball park, and all are quite a lot better than the sequential version.

> 
>> > I know Brad tried to provide a function in System.Multiprocessing for this
>> > purpose, but that is missing the very important data about the cost of an
>> > iteration and variance of iteration executions.
>> 
>> I actually dont think that is very important. It might be
>> important in determining
>> whether or not parallelism should be applied, but with
>> explicit parallelism,
>> there is no need to determine that. The programmer decides.
> 
> I'm assuming that the overhead of executing a parallel chunk is rather high
> compared to "typical" code. As such, running too many chunks increases the
> overhead with no appreciable benefit.

But also no noticeable detriment. (Assuming 3-4 times the number of chunks),
the overhead of iterating chunks is not where time is spent. It is the work
that is done during each chunk that matters. 
Also, any extra overhead in iterating over a  higher number of chunks is likely
offset by better granularity and processor utilization.

****************************************************************

From: Tucker Taft
Sent: Wednesday, March 28, 2018  3:17 PM

[The first part of this mail is relevant to AI12-0251-2, find it there. - ED]

And for AI12-0251-1, I would suggest that rather than only being allowed to
specify an explicit chunk index, you could also just specify a single
expression in the parentheses, which would give an upper bound on the number
of tasklets that may be created to execute the parallel loop.  Hence the
syntax would be either:

   parallel (expression) for ... loop ...  --  specify upper bound on number of tasklets

or:

   parallel (identifier IN discrete_subtype_definition) for ... loop ...  -- give index to each tasklet

***************************************************************

Questions? Ask the ACAA Technical Agent