!standard 5.5(3/3) 18-11-16 AI12-0251-1/03 !standard 5.5(5) !standard 5.5(6/5) !standard 5.5(9/4) !class Amendment 18-01-25 !status Amendment 1-2012 18-11-16 !status ARG Approved 5-0-2 18-10-21 !status work item 18-01-25 !status received 18-01-13 !priority Low !difficulty Medium !subject Explicit chunk definition for parallel loops !summary The iterations of a parallel loop are grouped into "chunks," optionally with a chunk parameter identifying the chunk associated with a given iteration. This supports a roll-your-own parallel map/reduce, as well as providing control over the level of parallelism to use to execute a given parallel loop. !problem The currently proposed facilities of parallel loops (AI12-0119-1) and reduction attributes (AI12-0242-1) handle many loop formulations that can be parallelized. However, existing loops are not easy to convert into the reduction attribute form, especially if the loop has multiple accumulators. They could be written as multiple reduction expressions, but that could require recalculating parts of the expression multiple times, harming performance (which is the point of parallelizing in the first place). They also could be written using a a protected object to model the accumulators, but that would force some serialization of the loops (again, harming performance). Another approach is needed. !proposal Provide a chunk index to a parallel loop, vaguely analogous to an entry family index. This is provided right after the keyword "parallel" in the parallel loop syntax. For instance: declare Partial_Sum : array (1 .. 2*Num_CPUs) of Integer := (others => 0); begin parallel (Chunk in Partial_Sum'Range) for I in Arr'Range loop declare Z : Integer; begin Z := some_complicated_computation; ... -- Complex machinations Partial_Sum(Chunk) := @ + Arr(I)**2 * Z; ... -- Other stuff that also happens in this ... --- very complicated loop ... end; end loop; Sum := Partial_Sum'Reduce("+", 0); end; To merely control the level of parallelism, a simpler version just specifies the maximum number of logical threads of control to be created to execute the loop: parallel (2*Num_CPUs) for I in Arr'Range loop A(I) := B(I) + C(I); end loop; Here we don't need the chunk index, but we want to set an upper bound on the number of logical threads of control that the implementation might devote to the execution of the various iterations of the loop. !wording [Author's note: these wording changes and additions are relative to the ARG-approved AI12-0119-1/13, dated 2018-08-31] Replace 5.5(3/3): iteration_scheme ::= while condition | for loop_parameter_specification | for iterator_specification | parallel [(chunk_specification)] for loop_parameter_specification chunk_specification ::= /integer_/simple_expression | defining_identifier in discrete_subtype_definition Add after 5.5(5): Name Resolution Rules In a chunk_specification that is an /integer_/simple_expression, the /integer_/simple_expression is expected to be of any integer type. Add after 5.5(6/5): In a chunk_specification that has a discrete_subtype_definition, the chunk_specification declares a /chunk parameter/ object whose subtype (and nominal subtype) is that defined by the discrete_subtype_definition. Modify 5.5(9/4): For the execution of a loop_statement with the iteration_scheme [being for] {including a} loop_parameter_specification, the loop_parameter_specification is first elaborated. This elaboration creates the loop parameter and elaborates the discrete_subtype_definition. If the discrete_subtype_definition defines a subtype with a null range, the execution of the loop_statement is complete. Otherwise, the sequence_of_statements is executed once for each value of the discrete subtype defined by the discrete_subtype_definition that satisfies the predicates of the subtype (or until the loop is left as a consequence of a transfer of control). Prior to each such iteration, the corresponding value of the discrete subtype is assigned to the loop parameter. If the reserved word parallel is present (a /parallel/ loop), [each] {the} iteration{s are partitioned into one or more /chunks/, each with its own} [is a] separate logical thread of control (see clause 9), {and} [with] its own copy of the loop parameter; otherwise {a single logical thread of control performs the loop. Within each logical thread of control,} the values are assigned to the loop parameter in increasing order unless the reserved word reverse is present, in which case the values are assigned in decreasing order. {If a chunk_specification is present in a parallel loop, it determines the maximum number of logical threads of control devoted to the parallel loop. The chunk_specification is elaborated after the loop_parameter_specification is elaborated, and before any iterations are performed. If the chunk_specification is an /integer_/simple_expression, this elaboration evaluates the expression, and the value of the expression determines the maximum number of logical threads of control devoted to the parallel loop. If a discrete_subtype_definition is present, this elaboration elaborates the discrete_subtype_definition, and creates one or more logical threads of control, each having its own copy of the chunk parameter initialized with a distinct value from the discrete subtype defined by the discrete_subtype_definition. Each logical thread of control handles a distinct subrange of the values of the subtype of the loop parameter such that all values are covered with no overlaps. If there is a chunk parameter, its assigned value increases with increasing values of the loop parameter. When the chunk_specification is elaborated, a check is made that the determined maximum number of logical threads of control is greater than zero. If this check fails, Program_Error is raised. Whether or not a chunk_specification is present in a parallel loop, the total number of iterations of the loop represents an upper bound on the number of logical threads of control devoted to the loop.} Delete the AARM note following 5.5(9/4): [AARM Implementation note: Each iteration of a parallel loop is a separate logical thread of control. Normally these logical threads of control will be grouped into "chunks" for the purpose of execution to reduce the cost of creating large numbers of (physical) threads of control.] !discussion In AI12-0119 we admitted the possibility of chunks, but the model was always one logical thread of control per iteration. Here we make chunks explicit in the model, with one logical thread of control per chunk. This is important to make it clear that all of the iterations associated with a single value of the chunk parameter are part of a single logical thread of control, so there is no data race between the iterations of a single chunk. This means that the iterations of a single chunk can all safely accumulate into an element of an array indexed by the chunk parameter. This approach also provides some tuning capabilities for parallel loops, by letting the user vary the number of chunks. Smaller numbers of chunks (e.g. one chunk per available processor) may provide the best performance for iterations that take about the same amount of time regardless of inputs; larger numbers of chunks may provide better performance for iterations that take widely varying amounts of time. Within each chunk, the iterations are performed in sequence, and increasing chunk parameter values correspond to increasing loop parameter values, so that reduction using a non-commutative operation can be performed reliably. Note that in the presence of a chunk parameter, it might be reasonable to allow the "reverse" keyword to be present on the loop, which would control the order of the iterations within a given chunk. However, this is likely to cause confusion, because the execution order of the chunks is arbitrary -- there is no promise that the chunks are executed in any particular order relative to one another, forward or reverse. Hence, we don't recommend allowing "reverse" even in the presence of a chunk parameter. --- Perhaps there should be a function somewhere in Ada.Multiprocessing to give an estimate for a number of chunks to use, based on an estimate of the number of iterations, how conditional the iteration code is (the "variance"), which task is involved, and possibly other considerations. End users don't have enough information to know a good place to start. ("2*Num_CPUs" is unlikely to be a good choice for loops with a lot of variance.) !example (See Proposal.) !corrigendum 5.5(3/3) @drepl @xcode<@fa@fa< ::= >@ft<@b> @fa | @ft<@b> @fa | @ft<@b> @fa> @dby @xcode<@fa@fa< ::= >@ft<@b> @fa | @ft<@b> @fa | @ft<@b> @fa | @ft<@b> @fa | @ft<@b> [(@fa)] @ft<@b> @fa> @xcode<@fa@fa< ::= >@ft<@i>@fa | @fa @ft<@b> @fa> !corrigendum 5.5(5) @dinsa If a @fa has a @i@fa, then the @fa shall be repeated after the @b; otherwise, there shall not be an @fa after the @fa. @dinst @s8<@i> In a @fa that is an @i@fa, the @i@fa is expected to be of any integer type. !corrigendum 5.5(6/4) @dinsa A @fa declares a @i, which is an object whose subtype (and nominal subtype) is that defined by the @fa. @dinst In a @fa that has a @fa, the @fa declares a @i object whose subtype (and nominal subtype) is that defined by the @fa. !corrigendum 5.5(9/4) @drepl For the execution of a @fa with the @fa being @b @fa, the @fa is first elaborated. This elaboration creates the loop parameter and elaborates the @fa. If the @fa defines a subtype with a null range, the execution of the @fa is complete. Otherwise, the @fa is executed once for each value of the discrete subtype defined by the @fa that satisfies the predicates of the subtype (or until the loop is left as a consequence of a transfer of control). Prior to each such iteration, the corresponding value of the discrete subtype is assigned to the loop parameter. These values are assigned in increasing order unless the reserved word @b is present, in which case the values are assigned in decreasing order. @dby For the execution of a @fa with the @fa including a @fa, the @fa is first elaborated. This elaboration creates the loop parameter and elaborates the @fa. If the @fa defines a subtype with a null range, the execution of the @fa is complete. Otherwise, the @fa is executed once for each value of the discrete subtype defined by the @fa that satisfies the predicates of the subtype (or until the loop is left as a consequence of a transfer of control). Prior to each such iteration, the corresponding value of the discrete subtype is assigned to the loop parameter. If the reserved word parallel is present (a @i), the iterations are partitioned into one or more @i, each with its own separate logical thread of control (see clause 9), and its own copy of the loop parameter; otherwise a single logical thread of control performs the loop. Within each logical thread of control, the values are assigned in increasing order unless the reserved word @b is present, in which case the values are assigned in decreasing order. If a @fa is present in a parallel loop, it determines the maximum number of logical threads of control devoted to the parallel loop. The @fa is elaborated after the @fa is elaborated, and before any iterations are performed. If the @fa is an @i@fa, this elaboration evaluates the expression, and the value of the expression determines the maximum number of logical threads of control devoted to the parallel loop. If a @fa is present, this elaboration elaborates the @fa, and creates one or more logical threads of control, each having its own copy of the chunk parameter initialized with a distinct value from the discrete subtype defined by the @fa. Each logical thread of control handles a distinct subrange of the values of the subtype of the loop parameter such that all values are covered with no overlaps. If there is a chunk parameter, its assigned value increases with increasing values of the loop parameter. When the @fa is elaborated, a check is made that the determined maximum number of logical threads of control is greater than zero. If this check fails, Program_Error is raised. Whether or not a @fa is present in a parallel loop, the total number of iterations of the loop represents an upper bound on the number of logical threads of control devoted to the loop. !ASIS [Not sure. The syntax change might require some additional queries. - Editor.] !ACATS test ACATS B- and C-Tests are needed to check that the new capabilities are supported. !appendix From: Tucker Taft Sent: Saturday, January 13, 2018 11:46 AM Ada 83 was already a pretty good language for what I would call "hybrid" functional/imperative programming, because you can have a sequence of declarations that progressively compute a desired result. Often it was only because you needed to enter a loop that you had to do any significant amount of computing after the "begin." In today's Ada, with conditional expressions, quantified expressions, the recently approved iterated component association, and our newly proposed container aggregates, the need for explicit ifs and loops after the "begin" is continuing to decrease. I don't see problems with this evolution toward providing a balance between functional and imperative features. Clearly there is a resurgence of interest in functional programming these days, but going all the way to "pure functional" programming, with "monads" instead of assignment statements, and recursion only with no iteration, is not something anyone in this discussion is suggesting for Ada. But I do think it is important to "complete" the functional primitives, if only to give the desired power to Ada's contract-based programming. I would claim that a reduction expression is the last remaining major component needed to complete the "iterated" operations. As I tried to show in my examples drawn from the ParaSail compiler and static analyzer, there are many applications of a reduction expression -- it isn't just a "gee-gaw" notation, only used by functional-programming zealots. And it isn't just for parallelism -- it is a common sequential programming paradigm as well, and it will often be the right way to define, in a postcondition, what is the final value of one output of the subprogram, even though the subprogram might have multiple outputs and might do many other things. Parallelism is a somewhat separate issue in my mind, and where (and whether) it is best inserted depends heavily on the amount of computing done within a single iteration, the independence of those computations, and the number of iterations. The advantage of the notations like quantified expressions, container aggregates, and reduction expressions, is that they are self-contained, and can often be analyzed holistically by the compiler to determine whether and where to insert parallelism. I believe we also need some explicit notations for parallelism, but potentially even more important in the context of Ada is to have sufficient annotations to allow the compiler to both check the safety of explicit parallelism, and also safely insert implicit parallelism. Hence the importance of the Global and Nonblocking annotations. As far as explicit parallelism, I now believe the right way to handle reductions in that context is to provide a chunk index to a parallel loop, vaguely analogous to an entry family index. With our current parallel syntax, I would propose that we allow the introduction of a chunk index as follows: declare Partial_Sum : array (1 .. 2*Num_CPUs) of Integer := (others => 0); begin parallel (Chunk in Partial_Sum'Range) for I in Arr'Range loop declare Z : Integer; begin Z := some_complicated_computation; ... -- Complex machinations Partial_Sum(Chunk) := @ + Arr(I)**2 * Z; ... -- Other stuff that also happens in this ... --- very complicated loop ... end; end loop; Sum := Partial_Sum'Reduce("+", 0); end; --- The semantics of the chunk index is that its range determines the limit on the degree of parallelism of the loop, and provides the index to the body of the loop. That is the total extent of the feature. There are no magic values, no use of "<>", etc. As seen above, we have chosen to declare a "Partial_Sum" array before the parallel loop whose index range becomes the range of the chunk index. We have made the partial sum a multiple of the number of CPUs, but the programmer is free to choose the limit of parallelism any way they want. After the loop we have "reduced" this array to an overall Sum. Both of these are totally under control of the user -- there is no compiler magic being performed here. Of course, the compiler might have some smarts in deciding whether to parallelize the 'Reduce operation, but that is implicit parallelism, and depends on the compiler to decide whether it is worth doing. **************************************************************** From: Brad Moore Sent: Sunday, January 14, 2018 11:25 PM > Ada 83 was already a pretty good language for what I would call "hybrid" > functional/imperative programming, because you can have a sequence of > declarations that progressively compute a desired result. Often it > was only because you needed to enter a loop that you had to do any > significant amount of computing after the "begin." In today's Ada, > with conditional expressions, quantified expressions, the recently > approved iterated component association, and our newly proposed > container aggregates, the need for explicit ifs and loops after the > "begin" is continuing to decrease. I don't see problems with this > evolution toward providing a balance between functional and imperative > features. Clearly there is a resurgence of interest in functional > programming these days, but going all the way to "pure functional" > programming, with "monads" instead of assignment statements, and recursion > only with no iteration, is not something anyone in this discussion is suggesting > for Ada. But I do think it is important to "complete" the functional > primitives, if only to give the desired power to Ada's contract-based > programming. I agree. > I would claim that a reduction expression is the last remaining major > component needed to complete the "iterated" operations. As I tried to > show in my examples drawn from the ParaSail compiler and static > analyzer, there are many applications of a reduction expression -- it > isn't just a "gee-gaw" notation, only used by functional-programming > zealots. And it isn't just for parallelism > -- it is a common sequential programming paradigm as well, and it will > often be the right way to define, in a postcondition, what is the > final value of one output of the subprogram, even though the > subprogram might have multiple outputs and might do many other things. I Agree here too. > Parallelism is a somewhat separate issue in my mind, and where (and > whether) it is best inserted depends heavily on the amount of > computing done within a single iteration, the independence of those > computations, and the number of iterations. The advantage of the > notations like quantified expressions, container aggregates, and > reduction expressions, is that they are self-contained, and can often > be analyzed holistically by the compiler to determine whether and > where to insert parallelism. I believe we also need some explicit > notations for parallelism, but potentially even more important in the > context of Ada is to have sufficient annotations to allow the compiler > to both check the safety of explicit parallelism, and also safely insert > implicit parallelism. Hence the importance of the Global and Nonblocking > annotations. So far so good. > As far as explicit parallelism, I now believe the right way to handle > reductions in that context is to provide a chunk index to a parallel > loop, vaguely analogous to an entry family index. With our current > parallel syntax, I would propose that we allow the introduction of a chunk index as follows: > > declare > Partial_Sum : array (1 .. 2*Num_CPUs) of Integer := (others => 0); > begin > parallel (Chunk in Partial_Sum'Range) for I in Arr'Range loop > declare > Z : Integer; > begin > Z := some_complicated_computation; > ... -- Complex machinations > > Partial_Sum(Chunk) := @ + Arr(I)**2 * Z; > > ... -- Other stuff that also happens in this > ... --- very complicated loop ... > end; > end loop; > > Sum := Partial_Sum'Reduce("+", 0); end; I am still digesting this, while it is interesting, I am worried about adding chunking management as one more thing the user has to think about to add parallelism in cases where the user would rather not have to think about this. There are a number of reduction problems where manual/user chunking is needed, and this might be a good solution for that, but if we want to support user/chunking this way, I think a few things are still missing. There are times when the user would want to know more details about the chunk. e.g. What is the first and last index of the chunk. There are a couple other issues I see as well. Looking at Randy's monster parallelism example, it involves 6 outputs where each result is a reduction. So, with this syntax, one would need to declare 6 partial result arrays, which is OK, but getting to be a bit of manipulation of the existing code. Then, you need to mention one of these results in the iterator syntax, but there are 6 different result arrays. Would you need to list all 6 of these, perhaps comma separated, or just pick one? I'm guessing you'd want to just pick one, but it feels a bit wrong to not mention them all. What if the arrays are declared with different numbers of elements? I presume you'd want a rule that says that all arrays referenced in the loop with the chunk parameter would be required to have the same number elements with the same indexing. Otherwise this syntax would not make sense. Randy liked the old version of the AI, I believe, because all you needed to do was add the parallel keyword to the loop, then possibly add "parallel" to the declarations of reduction results. That seems quite a bit easier to convert existing loops. It might even be enough to just add "parallel" to the loop, without adding "parallel" to reduction variables, since with the global aspect and non-blocking aspect, the compiler will tell you about any problems with global reduction results. Getting back to Randy's monster example. One thing I didn't try before, was writing the code in the form that could be applied as a single 'Reduce call. I just tried that, and I found that it wasn't too difficult to do, as most of the code is still intact. It took a little bit of time to set this up, but really not that much effort, and I found the results satisfying. I also tested this for performance, and found it comparable to the parallel loop form. For 80_000_000 records, with optimization turned off I got; (The new result is the last one, for a single 'Reduce) Multiple operations in a sequential Loop {Elapsed = 00:00:01.51 Negative_Count= 0 Sample_Sum= 1.99965350799589E+07 Sample_Sqr_Sum= 1.33299985832170E+07 Product_Sum= 1.99965350799589E+07 Diff_Sum= 2.00046739200409E+07 Diff_Sqr_Sum= 1.33381374236478E+07 Multiple parallel loops {Elapsed = 00:00:01.52 Negative_Count= 0 Sample_Sum= 1.99965350797951E+07 Sample_Sqr_Sum= 1.33299985832050E+07 Product_Sum= 1.99965350797952E+07 Diff_Sum= 2.00046739202052E+07 Diff_Sqr_Sum= 1.33381374236355E+07 Single parallel loops {Elapsed = 00:00:00.43 Negative_Count= 0 Sample_Sum= 1.99965350797950E+07 Sample_Sqr_Sum= 1.33299985832051E+07 Product_Sum= 1.99965350797950E+07 Diff_Sum= 2.00046739202050E+07 Diff_Sqr_Sum= 1.33381374236362E+07 Single parallel, Reducer+Combiner {Elapsed = 00:00:00.51 Negative_Count= 0 Sample_Sum= 1.99965350797948E+07 Sample_Sqr_Sum= 1.33299985832051E+07 Product_Sum= 1.99965350797948E+07 Diff_Sum= 2.00046739202052E+07 Diff_Sqr_Sum= 1.33381374236362E+07 A few seconds slower, probably because of an additional procedure call. However, when I rerun with optimization turned on, I get; ************* Parallel Framework Test ************* Physical Processors= 4 Multiple operations in a sequential Loop {Elapsed = 00:00:00.65 Negative_Count= 0 Sample_Sum= 1.99965350799589E+07 Sample_Sqr_Sum= 1.33299985832170E+07 Product_Sum= 1.99965350799589E+07 Diff_Sum= 2.00046739200409E+07 Diff_Sqr_Sum= 1.33381374236478E+07 Multiple parallel loops {Elapsed = 00:00:01.19 Negative_Count= 0 Sample_Sum= 1.99965350797948E+07 Sample_Sqr_Sum= 1.33299985832052E+07 Product_Sum= 1.99965350797952E+07 Diff_Sum= 2.00046739202051E+07 Diff_Sqr_Sum= 1.33381374236363E+07 Single parallel loops {Elapsed = 00:00:00.20 Negative_Count= 0 Sample_Sum= 1.99965350797952E+07 Sample_Sqr_Sum= 1.33299985832052E+07 Product_Sum= 1.99965350797952E+07 Diff_Sum= 2.00046739202048E+07 Diff_Sqr_Sum= 1.33381374236363E+07 Single parallel, Reducer+Combiner {Elapsed = 00:00:00.21 Negative_Count= 0 Sample_Sum= 1.99965350797952E+07 Sample_Sqr_Sum= 1.33299985832052E+07 Product_Sum= 1.99965350797952E+07 Diff_Sum= 2.00046739202048E+07 Diff_Sqr_Sum= 1.33381374236362E+07 Which is essentially the same speed (1/100 of a second slower in this run). It's also interesting to see how the code would look. The user basically needs to create a composite object. e.g. type Composite is record Negative_Count : Natural; Sample_Sum : Long_Float; Sample_Sqr_Sum : Long_Float; Product_Sum : Long_Float; Diff_Sum : Long_Float; Diff_Sqr_Sum : Long_Float; end record; -- Reduce is only needed if we want to run Randy's code in parallel procedure Reduce (L, R : in out Composite) with Associative; procedure Reduce (L, R : in out Composite) is begin L := (Negative_Count => L.Negative_Count + R.Negative_Count, Sample_Sum => L.Sample_Sum + R.Sample_Sum, Sample_Sqr_Sum => L.Sample_Sqr_Sum + R.Sample_Sqr_Sum, Product_Sum => L.Product_Sum + R.Product_Sum, Diff_Sum => L.Diff_Sum + R.Diff_Sum, Diff_Sqr_Sum => L.Diff_Sqr_Sum + R.Diff_Sqr_Sum); end Reduce; -- A secondary object used for the 'Reduce because the array index is needed -- to look up data in a secondary array. Normally, this would not be needed type Item is record Index : Natural; Score : Long_Float; end record; -- The combiner function that is passed into 'Reduce procedure Include (L : in out Composite; Data : Item) with Reducer => Reduce; -- NOTE: This code looks very much like Randy's original, as it was -- mostly cut and pasted. However, since we now have a composite record type, -- it was nice to use a record aggregate here. -- NOTE ALSO: That Randy's debugging code is still intact. So debugging parallel -- loops vs debugging 'Reduce attributes seems to be a non-issue. procedure Include (L : in out Composite; Data : Item) is Score : Long_Float renames Data.Score; I : Natural renames Data.Index; begin L := (Negative_Count => L.Negative_Count + (if Score < 0.0 then 1 else 0), Sample_Sum => L.Sample_Sum + Score, Sample_Sqr_Sum => L.Sample_Sqr_Sum + Score**2, Product_Sum => L.Product_Sum + Actual_Data(I).Weighted_Worse_Ratio * Score, Diff_Sum => L.Diff_Sum + Actual_Data(I).Weighted_Worse_Ratio - Score, Diff_Sqr_Sum => L.Diff_Sqr_Sum + (Actual_Data(I).Weighted_Worse_Ratio - Sample_Data(I).Score)**2); if Result_Debug or else (Limited_Calc_Debug and then (I <= Debug_Calc_Limit or else I >= Actual_Len - Debug_Calc_Limit)) then Ada.Text_IO.Put_Line ("Item " & Actual_Data(I).Item_Name & "; cnt=" & Natural'Image(Actual_Count)); Ada.Text_IO.Put("Sample Score="); LFlt_IO.Put (Sample_Data(I).Score, 10,6,0); Ada.Text_IO.Put("; Actual Score="); LFlt_IO.Put (Actual_Data(I).Weighted_Worse_Ratio, 10,6,0); Ada.Text_IO.New_Line; --Ada.Text_IO.Put("Sample Sum="); LFlt_IO.Put -- (Sample_Sum, 10,6,0); --Ada.Text_IO.Put("; Sample Sum**2="); LFlt_IO.Put -- (Sample_Sqr_Sum, 10,6,0); --Ada.Text_IO.New_Line; --Ada.Text_IO.Put("Actual Sum="); LFlt_IO.Put -- (Actual_Sum, 10,6,0); --Ada.Text_IO.Put("; Actual Sum**2="); LFlt_IO.Put -- (Actual_Sqr_Sum, 10,6,0); --Ada.Text_IO.New_Line; --Ada.Text_IO.Put("Product Sum="); LFlt_IO.Put -- (Product_Sum, 10,6,0); --Ada.Text_IO.Put("; Diff Sum="); --LFlt_IO.Put (Diff_Sum, 10,6,0); --Ada.Text_IO.Put("; Diff Sum**2="); LFlt_IO.Put -- (Diff_Sqr_Sum, 10,6,0); --Ada.Text_IO.New_Line; -- else no trace end if; end Include; -- Needed for the 'Reduce attribute. Identity : constant Composite := (Negative_Count => 0, Sample_Sum => 0.0, Sample_Sqr_Sum => 0.0, Product_Sum => 0.0, Diff_Sum => 0.0, Diff_Sqr_Sum => 0.0); With this in place, Randy's example can then be expressed as; Result := (for I of Sample_Data'Range when Sample_Data(I).Check_Item => Data'(Sample_Data(I).Score, I)) 'Reduce(Include, Identity) Note, I needed to construct a temporary Data object, since there should only be two parameters to a combiner function. Also, the when clause meant was nice to eliminate the if statement just inside the original loop. So writing Randy's example as a single 'Reduce felt like a good solution. The loop seems simpler. Maybe we don't need to have reduction support in parallel loops after all. Or at least we consider leaving it out for now, and adding it later if needed. **************************************************************** From: Tucker Taft Sent: Monday, January 15, 2018 1:03 PM > There are a couple other issues I see as well. Looking at Randy's > monster parallelism example, it involves 6 outputs where each result > is a reduction. So, with this syntax, one would need to declare 6 > partial result arrays, which is OK, but getting to be a bit of manipulation of > the existing code. Then, you need to mention one of these results in the > iterator syntax, but there are 6 different result arrays. Would you need to list > all 6 of these, perhaps comma separated, or just pick one? Just pick one. Or not mention any of them. They clearly all need to have the same range, or you are making trouble for yourself. > I'm guessing you'd want to just pick one, but it feels a bit wrong to not > mention them all. As I said, fine to not mention any of them. Define the desired chunk-index range as a named subtype, and then use that everywhere it is needed. > What if the arrays are declared with different numbers of elements? I > presume you'd want a rule that says that all arrays referenced in the > loop with the chunk parameter would be required to have the same number > elements with the same indexing. Otherwise this syntax would not make sense. I think you are making this more complex than necessary. The programmer could choose to use an array of records, indexed by the chunk index. In any case, that is a "programming problem," not something the compiler needs to be involved with, so long as the compiler provides the chunk index. > Randy liked the old version of the AI, I believe, because all you > needed to do was add the parallel keyword to the loop, then possibly add > "parallel" to the declarations of reduction results. That seems quite a bit > easier to convert existing loops. > > It might even be enough to just add "parallel" to the loop, without > adding "parallel" to reduction variables, since with the global aspect and > non-blocking aspect, the compiler will tell you about any problems with global > reduction results. I now believe this is just too much "magic." The difficulty of implicitly connecting the "parallel" arrays to the parallel loop that is being chunked seems to result in too much hidden semantics. It all works nicely in a self-contained expression like the proposed reduction expression, but once you have the declarations and the loop physically separated, reconnecting them is very hard. We have tried lots of different syntaxes to connect the "accumulators" with the loop, and it all feels a bit magical and/or forced. The advantage of just having a well-defined chunk index in the parallel loop is that the connection to the particular accumulators is completely separated from the syntax, allowing the programmer maximum flexibility while minimizing the implementation effort and/or magic. **************************************************************** From: Randy Brukardt Sent: Monday, January 15, 2018 10:20 PM > I now believe this is just too much "magic." The difficulty of > implicitly connecting the "parallel" arrays to the parallel loop that > is being chunked seems to result in too much hidden semantics. It all > works nicely in a self-contained expression like the proposed > reduction expression, but once you have the declarations and the loop > physically separated, reconnecting them is very hard. We have tried > lots of different syntaxes to connect the "accumulators" with the > loop, and it all feels a bit magical and/or forced. > > The advantage of just having a well-defined chunk index in the > parallel loop is that the connection to the particular accumulators is > completely separated from the syntax, allowing the programmer maximum > flexibility while minimizing the implementation effort and/or magic. My primary objection to having an explicit chunk index is that the average programmer won't have a clue as to how to figure out the number of chunks needed (especially in any portable sort of way). The best value depends at least on the amount of code per iteration, the variance in the amount of code per iteration, the number of iterations, and the parallelism overhead (which most likely can be modeled as some fixed amount + some variable amount*the number of chunks). In the general case, even the compiler doesn't have enough information to determine this (the number of iterations can be determined at runtime, and parts of the loop body are likely to be declared in other unit bodies, meaning they might not even be compiled yet). (Obviously, some special cases would be easier, but these are much more likely for the 'Reduce attribute than normal loops.) I know Brad tried to provide a function in System.Multiprocessing for this purpose, but that is missing the very important data about the cost of an iteration and variance of iteration executions. The variance matters because if each iteration takes about the same amount of CPU time to execute, then one chunk per executor [CPU] is optimal -- that minimizes overhead. If the variance between iteration times is high (as if most of the execution is conditional), then there is value to more chunks, as that guards against one chunk taking a lot longer than the others -- after all, the loop isn't done until all of the chunks finish. The cost of an iteration matters because if the iteration is short enough, then a single chunk (and no parallelism) is likely best, as that avoids all of the parallelism overhead. Most users would simply end up choosing something as a WAG (Wild-A**ed-Guess) or because they read it in a book -- neither seems very good for performance or portability. The clear advantage of Tucker's proposal is that it is simple enough to keep from overwhelming us, which has to be a criteria here. So I'm cautiously in favor. **************************************************************** From: Randy Brukardt Sent: Monday, January 15, 2018 11:03 PM ... > I would claim that a reduction expression is the last remaining major > component needed to complete the "iterated" > operations. As I tried to show in my examples drawn from the ParaSail > compiler and static analyzer, there are many applications of a > reduction expression -- it isn't just a "gee-gaw" notation, only used > by functional-programming zealots. Perhaps this is still "old guy-think", but to me, iteration is purely an imperative operation. The functional guys had to add a bunch of operations like "summation" to get sensible performance, and then they had to invent "Reduce" in order to prevent a crazy proliferation of such operations. There is nothing "natural" about Reduce! Indeed, no one ever thought about Reduce until someone showed that it is relatively easy to parallelize, and that an interesting class of problems can be mapped (with substantial effort) into Reduce. (Supposedly, the Google empire is built on this insight -- demonstrating that there is nothing obvious about this idea.) The problem is that Reduce is a wildly unnatural operation to an imperative programmer; it's a difficult transition to think about this as some sort of operation. Indeed, while I think the current formulation makes the most sense from an Ada perspective, it still seems like a "gee-gaw" to me -- one may use the operation, but only because it exists. (BTW, I view array slicing, concatenation, and array ordering in this same category -- these operations only really make sense for strings, but an over-generalization makes them available for all array types -- and since they exist, they get used). It's the old "I have a power-screwdriver in my toolkit, so let's make everything use screws". [Modernizing an old adage!]. Yes, you can write a lot of things as a Reduce, but the only reason that you do that is (A) you have it for parallelism, (B) you got used to using it for that purpose, and (C) having a cool new gadget, you used it to do everything. That certainly makes sense in Parasail, which was designed to make everything parallel by default, but it makes a lot less sense in Ada (which is mainly an imperative language, and I'd expect it to stay that way). What the Reduce operation really says to me is that we cannot sanely support parallelism without resorting to creeping functionalism. And if true, then we're fighting a losing battle with Ada in any case, and probably ought not waste more time on it. > I would claim that a reduction expression is the last remaining major > component needed ... Sorry, I've heard that one before. Quantified expressions was the only thing really needed for contracts. But to me it looks like an effort for creeping functionalism -- and as you admit, Ada is never going to be good at that (we'd need monads and lambdas and loads of other things that would clobber performance without any real gain). Let's be honest here, and realize that there never will be a "last thing" when it comes to adding more functional features. **************************************************************** From: Randy Brukardt Sent: Tuesday, January 16, 2018 12:36 AM ... > As far as explicit parallelism, I now believe the right way to handle > reductions in that context is to provide a chunk index to a parallel > loop, vaguely analogous to an entry family index. ... Administrative question: Is this a new AI, or something that you want to add to the existing AI12-0119-1 (or the existing AI12-0242-1 on reduction expressions)? I think I'd vote for a new AI, even though that will probably make me more work in the long run -- I'd like to get the most basic stuff finished ASAP, freeing time to be spent on the more advanced ideas without having to worry if we finish our critical features in time. I'd like to know soon, so I can know where to file this thread before the upcoming phone^H^H^H^H^Hgoogle Hangups meeting. **************************************************************** From: Brad Moore Sent: Tuesday, January 23, 2018 9:54 AM > ... >> >> The advantage of just having a well-defined chunk index in the >> parallel loop is that the connection to the particular accumulators >> is completely separated from the syntax, allowing the programmer >> maximum flexibility while minimizing the implementation effort and/or >> magic. > > My primary objection to having an explicit chunk index is that the > average programmer won't have a clue as to how to figure out the > number of chunks needed (especially in any portable sort of way). The > best value depends at least on the amount of code per iteration, the > variance in the amount of code per iteration, the number of > iterations, and the parallelism overhead (which most likely can be > modeled as some fixed amount + some variable amount*the number of chunks). I found with Paraffin that a good determination for the number of chunks can be made just based on the number of iterations. All my libraries have optional inputs to provide this, and tuning can sometimes give a bit of an edge in performance, but I found it best to not bother with this, and leave the determination of chunk number to the default calculation based on iterations. The amount of code per iteration, and parallelism overhead are factors for determining whether the parallelism is worthwhile or not, but that generally is independent from determining a good chunk size value. Variance of code per iteration can be important, but the chunk calculation I use generally gives you small enough granularity that it works well with these sorts of problems as well. The calculation I came up with is; Natural'Max (1, Natural'Min (Iterations / Number_Of_Executors / 8, 10_000_000)); This is why I was thinking that a library routine to return the number of chunks, given a number of iterations could be sufficient, rather than building syntax around this. Its a trivial enough library routine, that we could consider leaving it to 3rd party library writers to provide this, though a standard library routine would give the implementation some input, and might tie in better with the target CPU, or parallelism implementaiton, and provide a more complete solution for parallelism. Or the other alternative is something like Tucker proposed where we provide new iteration syntax to iterate over chunks. > In the general case, even the compiler doesn't have enough information to > determine this (the number of iterations can be determined at runtime, and > parts of the loop body are likely to be declared in other unit bodies, > meaning they might not even be compiled yet). You generally want to apply the parallelism at the highest level you can. (In the outermost loop). Applying parallelism overhead inside nested loops gets trickier, because that overhead is multiplied by the outer loop. Given that a good chunk number can be determined based only on number of iterations, determining this at runtime should be pretty easy to do. > I know Brad tried to provide a function in System.Multiprocessing for this > purpose, but that is missing the very important data about the cost of an > iteration and variance of iteration executions. I actually dont think that is very important. It might be important in determining whether or not parallelism should be applied, but with explicit parallelism, there is no need to determine that. The programmer decides. > The variance matters because if each iteration takes about the same amount > of CPU time to execute, then one chunk per executor [CPU] is optimal -- that > minimizes overhead. If the variance between iteration times is high (as if > most of the execution is conditional), then there is value to more chunks, > as that guards against one chunk taking a lot longer than the others -- > after all, the loop isn't done until all of the chunks finish. A simple calculation like the one I have above is good because regardless of the amount of code per iteration, the number of chunks is small enough to not be a noticeable amount of overhead to iterate through, yet generates small enough chunks to allow enough granularity to account for variances between iteration times. > The cost of an iteration matters because if the iteration is short enough, > then a single chunk (and no parallelism) is likely best, as that avoids all > of the parallelism overhead. Yes, but again, explicit parallelism lets the user decide. If the compiler cant determine this for itself for implicit parallelism, it likely will default to sequential anyways. > Most users would simply end up choosing something as a WAG > (Wild-A**ed-Guess) or because they read it in a book -- neither seems very > good for performance or portability. Which is why I was suggesting a library routine to calculate this. **************************************************************** From: Tucker Taft Sent: Tuesday, January 23, 2018 11:51 AM You seem to only be considering cases where you know the number of iterations in advance, and the chunks each end up with an identical number of iterations. I am not sure either of those assumptions are going to be true for container iterators, or other more general iterators, **************************************************************** From: Edmond Schonberg Sent: Tuesday, January 23, 2018 12:33 PM From what I can gather from the literature on existing multicore programming tools, portability and performance are essentially incompatible; if you want performance you need to tailor the program to the target (no. of threads, processors, etc.) and none of this can be done without good profiling tools. At the programming level therefore you need to be able to specify explicitly the chunk value, snd/or whatever other parameters are target-specific. Otherwise the parallel constructs being proposed will not provide the performance improvement a user may expect. The compiler will certainly not be able to choose the right values any better. So there must be some syntactic way of describing those parameters. I guess pragmas are the easiest language tools at hand... **************************************************************** From: Randy Brukardt Sent: Tuesday, January 23, 2018 4:09 PM ... > > My primary objection to having an explicit chunk index is that the > > average programmer won't have a clue as to how to figure out the > > number of chunks needed (especially in any portable sort of way). > > The best value depends at least on the amount of code per iteration, > > the variance in the amount of code per iteration, the number of > > iterations, and the parallelism overhead (which most likely can be > > modeled as some fixed amount + some variable amount*the number of chunks). > > I found with Paraffin that a good determination for the number of > chunks can be made just based on the number of iterations. All my > libraries have optional inputs to provide this, and tuning can > sometimes give a bit of an edge in performance, but I found it best to > not bother with this, and leave the determination of chunk number to > the default calculation based on iterations. For portability, we're really looking for PGP (Pretty Good Performance) rather than perfection. The best possible performance almost certainly will requiring tuning to the target machine (not just the processor, but the entire hardware setup), and that's clearly not portable at all. > The amount of code per iteration, and parallelism overhead are factors > for determining whether the parallelism is worthwhile or not, but that > generally is independent from determining a good chunk size value. > > Variance of code per iteration can be important, but the chunk > calculation I use generally gives you small enough granularity that it > works well with these sorts of problems as well. > > The calculation I came up with is; > > Natural'Max > (1, > Natural'Min > (Iterations / Number_Of_Executors / 8, > 10_000_000)); I couldn't figure out whether this meant (Iterations / Number_Of_Executors) / 8 or Iterations / (Number_Of_Executors / 8) -- division is not associative. I suspect that it means (Iterations / (Number_of_Executors * 8)). But it seems far too large either way -- you're going to be doing an awful lot of stopping and starting of chunks (which ought to be rather expensive). I'm very surprised that this would even provide PGP, unless of course the amount of code in each iteration is already very large. [In that case, loop overhead would not be significant.] But that's not the common case for loops -- at least not my loops. In any event, I suspect that you are looking at typical code that you write, which hardly is representative of the entire universe of code (which includes things that you or I would never write). And perhaps the behavior of various compilers differs a lot for Ada tasking. It's clear, for instance, that if the iteration variance is essentially zero (no conditionals in the loop), that the number of chunks should equal the number of executors (that minimizes overhead). That is a common enough case to include it in the calculations. It's also clear that if the amount of code in an iteration is relatively small that the cost of looping overhead is significant (which is true of many of the loops I write), then the number of chunks needs to be minimized. But maybe you're right, in that it isn't worth parallelizing code unless it is lengthy enough that it must have a significant variance and substantial size. In that case, your formula here makes as much sense anything else. > This is why I was thinking that a library routine to return the number > of chunks, given a number of iterations could be sufficient, rather > than building syntax around this. I wasn't looking for syntax! I was looking for an answer to give PGP for portable code, and to provide a first cut. > Its a trivial enough library routine, that we could consider leaving > it to 3rd party library writers to provide this, though a standard > library routine would give the implementation some input, and might > tie in better with the target CPU, or parallelism implementaiton, and > provide a more complete solution for parallelism. ??? I'm not talking about a "general solution for parallelism"; I didn't even expect the compiler to use whatever routine. (It surely can do better if it can estimate the code size and variance.) But I thought that we would at least need to have some way to determine how many executors are really available (if some other task is also executing a parallel loop or block, it may make sense to take that into account as well) [thus the Task_Id needs to be passed into such a routine], and probably some indication of the variance (no variance is common enough to special-case it). > Or the other alternative is something like Tucker proposed where we > provide new iteration syntax to iterate over chunks. That's the ONLY thing I am talking about here!!! His proposal required determining the number of chunks to use, and that needs some portable solution for PGP. Tuning is a separate topic (although I would hope that his proposal would provide about 98% of what is needed). > > In the general case, even the compiler doesn't have enough information to > > determine this (the number of iterations can be determined at runtime, and > > parts of the loop body are likely to be declared in other unit > > bodies, meaning they might not even be compiled yet). > > You generally want to apply the parallelism at the highest level you can. > (In the outermost loop). Applying parallelism overhead inside nested loops gets > trickier, because that overhead is multiplied by the outer loop. Who said anything about "inside nested loops"?? I said that the compiler can't determine this information because it is in other units, possibly not even compiled yet. > Given that a good chunk number can be determined based only on number > of iterations, determining this at runtime should be pretty easy to > do. This is your unsupported assertion, one that no one else appears to believe. > > I know Brad tried to provide a function in System.Multiprocessing for this > > purpose, but that is missing the very important data about the cost of an > > iteration and variance of iteration executions. > > I actually dont think that is very important. It might be important in > determining whether or not parallelism should be applied, but with > explicit parallelism, there is no need to determine that. The > programmer decides. I'm assuming that the overhead of executing a parallel chunk is rather high compared to "typical" code. As such, running too many chunks increases the overhead with no appreciable benefit. If the iteration code is very large, this doesn't apply, and that seems to be the sort of cases that you tried with Parafin. That makes sense, because using a library like Parafin isn't that easy to set up, and you'll only do it if you're certain that there will be a benefit. But parallel loops will be relatively easy to apply (just add "parallel" and assuming Tucker's explicit chunk mechanism, change any accumlators into arrays); the code in the loop will often be a lot smaller. (Even easier if we're talking about Parallel_Reduce or the raw parallel, but in that case there is no need to talk about the user determining the number of chunks.) > > The variance matters because if each iteration takes about the same amount > > of CPU time to execute, then one chunk per executor [CPU] is optimal -- that > > minimizes overhead. If the variance between iteration times is high (as if > > most of the execution is conditional), then there is value to more chunks, > > as that guards against one chunk taking a lot longer than the others > > -- after all, the loop isn't done until all of the chunks finish. > > A simple calculation like the one I have above is good because regardless of > the amount of code per iteration, the number of chunks is small enough to not > be a noticeable amount of overhead to iterate through, yet generates small enough > chunks to allow enough granularity to account for variances between > iteration times. Your calculation seems to yield 312 chunks for 10000 iterations on a 4 core machine. That seems like way too many, I wouldn't use more than 16 in any circumstance (the variance of real code isn't that wide). You're getting way too much overhead; you probably can't tell that if the iterations are rather expensive, but more typical code you'll be eaten up by overhead. ... > > Most users would simply end up choosing something as a WAG > > (Wild-A**ed-Guess) or because they read it in a book -- neither seems very > > good for performance or portability. > > Which is why I was suggesting a library routine to calculate this. I agree, but I want the routine to take more information so it can make a much more useful guess. Just taking the number of iterations only will work if the iteration is quite expensive and the variance is high -- probably a reasonably common situation, and perhaps the best one can do if nothing else is available. (Tucker also points out that the number of iterations isn't always determinable, although it usually is.) But there user knows if their code contains conditionals or not (I'd just make that a Boolean, as the user cannot make any useful estimate beyond that -- I don't even know a good way to characterize it). And some estimate for the code size for an iteration, perhaps in terms of MOPs (memory operations). [Just Small-Medium-Large amounts of code probably is enough, but that might be too weird.] The implementation will have an estimate for overhead, so that doesn't need to be provided. **************************************************************** From: Randy Brukardt Sent: Tuesday, January 23, 2018 4:16 PM > You seem to only be considering cases where you know the number of > iterations in advance, and the chunks each end up with an identical > number of iterations. I am not sure either of those assumptions are > going to be true for container iterators, or other more general > iterators, I'm sort of with Brad on this one. Containers have the Length function, so one can tell the maximum number of iterations rather easily. Exits are non-deterministic in any case, so it's pretty much impossible to trust anything about them; best to forget that. And the containers have to return roughly equal-sized chunks. (Exactly equal doesn't even mean anything unless the loop body has no variance [conditionals] -- normal iterations will take different amounts of time to run.) Otherwise, no estimates of any sort can be usefully made. We're trying to determine portable PGP (Pretty Good Performance) here, anyway; optimal performance clearly requires tuning that isn't relevant to my concern (which is how to determine the number of chunks to use in portable code -- picking random numbers is not the answer!). **************************************************************** From: Randy Brukardt Sent: Tuesday, January 23, 2018 4:22 PM ... > From what I can gather from the literature on existing multicore > programming tools, portability and performance are essentially > incompatible; if you want performance you need to tailor the program > to the target (no. of threads, processors, > etc.) and none of this can be done without good profiling tools. We're not looking for ultimate performance from these, but rather "Pretty Good Performance". And note that the Ada runtime has the information about be processors and the implementation to at least make a starting point. That should be possible portably. At some future point, we're going to talk about further tuning parameters (although I think Tucker's explicit chunk stuff provides the needed hooks for users to tweak). ... > At the programming level therefore you need to be able to specify > explicitly the chunk value, snd/or whatever other parameters are > target-specific. Otherwise the parallel constructs being proposed will > not provide the performance improvement a user may expect. The > compiler will certainly not be able to choose the right values any > better. > So there must be some syntactic way of describing those parameters. I > guess pragmas are the easiest language tools at hand... Tuning parameters are a separate, future AI. The current discussion (at least, what *I* think the current discussion is about) is the parameters to use for portable code, in particular using explicit chunk indexes. Users shouldn't have to guess a starting point for portable code, nor for tuning. I suspect that people will try to parallelize a lot of stuff that doesn't execute long enough to really benefit (especially on desktop targets). I don't think there is anything we can do at the language level (and not much at the implementer level, either, other than education) to give them the performance they want as opposed to what makes sense. **************************************************************** From: Edmond Schonberg Sent: Tuesday, January 23, 2018 7:42 PM It depends who you are talking to. We have customers that want to use Ada on GPU’s. that have computer-intensive applications (fluid dynamics, financial computations) that are currently using other languages for this, and for whom PGP will be a total disappointment, And they dont expect their code to run on laptops! If, as was said repeatedly, parallelism is all about performance, then PGP is besides the point: if you don’t provide tuning mechanisms you might as well run sequentially (which is certainly portable :-)! **************************************************************** From: Tucker Taft Sent: Wednesday, January 24, 2018 2:45 AM >> ... >> > It depends who you are talking to. We have customers that want to use > Ada on GPU’s. that have computer-intensive applications (fluid > dynamics, financial > computations) that are currently using other languages for this, and > for whom PGP will be a total disappointment, And they dont expect > their code to run on laptops! If, as was said repeatedly, parallelism > is all about performance, then PGP is besides the point: if you don’t > provide tuning mechanisms you might as well run sequentially (which > is certainly portable :-)! It is all about performance, but there is a long-term and a short-term. My favorite example is the C programming language, which allowed users to specify which variables they wanted the compiler to keep in a register. I believe you can still say "register int i;" but the only effect now is that a "register" declaration means that you can no longer use the "&" operator with the variable. The compiler is much better at register allocation than are programmers. Today, it is likely that some programmers are better than the compiler at doing "CPU" allocation, but that will continue to shift over time. This is one reason I am against producing parallel versions of self-contained things like quantified expressions or reduction expressions. On the other hand, at a high-level loop level, the compiler is probably not as smart as some programmers, but probably is already smarter than many programmers, as far as CPU allocation. So we should have reasonable defaults, for the "average" programmer, but still accommodate the more sophisticated programmer who are still "smarter" than the compiler w.r.t. CPU allocation. **************************************************************** From: Randy Brukardt Sent: Wednesday, January 24, 2018 2:47 PM ... > We have customers that > want to use Ada on GPU's. that have computer-intensive applications > (fluid dynamics, financial > computations) that are currently using other languages for this, and > for whom PGP will be a total disappointment, And they dont expect > their code to run on laptops! Of course there will be people that expect 8x performance gains on a quad core. No help there. And surely there are people that will need to tune their code to absolute performance -- but that's not the point of these features. The point is safe, portable parallelism, by definition that isn't going to be all things to all people. (There's a lot of things I want that I'm never going to get, after all.) And quite likely they are going to want to run some version of their code on a laptop -- almost all developers do initial testing on such machines before running it in the final hardware environment. The smaller the change is between environments, the better. In any case, this particular thread was about Tucker's proposed way of writing traditional loops that have some sort of reduction (which is essentially all of them). Converting a sequential loop with multiple outputs into reduction form requires turning the loop inside out and either introducing additional (tuple) types and operations, or converting into multiple reductions which may have to repeat expensive calculations. Neither seems ideal. It appears that Tucker's solution also provides at least some tuning parameters, which is an added bonus (but not the primary goal). We do need to discuss tuning features at some point, but if we try to do everything at once there will be a guarantee that we will finish nothing on time. (Of course, if Tucker's keeps updating AI12-0240-1 rather than his critical path AIs like AI12-0079-1, nothing will get done anyway. ;-) **************************************************************** From: Edmond Schonberg Sent: Wednesday, January 24, 2018 3:28 PM > Of course there will be people that expect 8x performance gains on a > quad core. No help there. And surely there are people that will need > to tune their code to absolute performance -- but that's not the point > of these features. The point is safe, portable parallelism, by > definition that isn't going to be all things to all people. (There's a > lot of things I want that I'm never going to get, after all.) My suspicion is that "safe, portable parallelism" with not be of interest to anyone because it will not deliver on the implicit promise that it will improve performance. > And quite likely they are going to want to run some version of their > code on a laptop -- almost all developers do initial testing on such > machines before running it in the final hardware environment. The > smaller the change is between environments, the better. In that case what about zero changes, no new keywords, and leave the tuning to TBD pragmas that can be target-specific? **************************************************************** From: Randy Brukardt Sent: Wednesday, January 24, 2018 4:26 PM > > Of course there will be people that expect 8x performance gains on a > > quad core. No help there. And surely there are people that will need > > to tune their code to absolute performance -- but that's not the > > point of these features. The point is safe, portable parallelism, by > > definition that isn't going to be all things to all people. (There's > > a lot of things I want that I'm never going to get, after all.) > > My suspicion is that "safe, portable parallelism" with not be of > interest to anyone because it will not deliver on the implicit promise > that it will improve performance. Well, I don't expect parallelism to improve performance (tuning or not) for most code on most targets. I can imagine an environment where it would (a bare dedicated machine in an environment where power consumption was not an issue), but that is not a practical target. But it's clear that there are cases, not that common, where parallelization will help massively: extreme numbers of iterations, very expensive iterations, and certain patterns that can be mapped to special hardware. The first two cases can be dealt with portably, and that's the point here. > > And quite likely they are going to want to run some version of their > > code on a laptop -- almost all developers do initial testing on such > > machines before running it in the final hardware environment. The > > smaller the change is between environments, the better. > > In that case what about zero changes, no new keywords, and leave the > tuning to TBD pragmas that can be target-specific? If the postulate is that Ada 2012 is good enough (using something like Parafin), then I see no point in tuning pragmas, because one can tune everything sensible via the parameters to the library (the number of executors, number of chunks, and so on). About the only thing that would make sense selecting the processor is in a heterogeneous environment (but Ada has never made any effort to support such environments, so anything done would necessarily be implementation-defined). But one needs keywords and the like to get safety. Ada has always been about safety more than extreme performance, but one is completely on your own when writing tasks. The compiler gives absolutely no help in preventing deadlocks or race conditions; as a consequence, every program I've ever written using Ada tasks is a deadlock waiting to happen. If writing safe parallel code is impossible in Ada, then literally Ada has no future. Because clearly there will be new languages and techniques that do provide such safety, and there will be no reason to use languages that don't do that. Moreover, one will want to add parallel execution to existing code that wasn't written with parallel execution in mind. If that isn't possible safely, and a full rewrite needed, why wouldn't one find a newer language that does promise safety? I, for instance would like to have a loop something like: parallel for S of Current_Unit_Subprograms loop Optimize_Subprogram (S.Start, S.Finish); end loop; Eliminating all race conditions from Optimize_Subprogram would certainly require compiler help to point out issues; there is no way to look at 20,000 lines of code and find everything. And I surely would want to know if there is some lingering memory usage that needs protection. Concluding a long-winded message: these features are 100% about adding safety to parallel programming in Ada. There is no need to add features to add performance to existing code (those necessarily would be target-specific, so those are implementation-defined if they exist at all). There might be a need for performance tuning for these new features, but I at least want to finish the basic definition before worrying too much about tuning. (If almost no existing code can be converted to safe parallel loops, tuning is irrelevant.) **************************************************************** From: Tucker Taft Sent: Wednesday, January 24, 2018 3:04 PM > ... (Of course, if Tucker's keeps updating AI12-0240-1 rather than his > critical path AIs like AI12-0079-1, nothing will get done anyway. ;-) My own feeling was that phone meetings were not particularly good for debating the fine points of detailed wording, but rather would be most useful to evaluate new proposals. So that is why I have focused on new proposals. I'm not worried about us getting the last bits of wording right for AIs where the intent is already agreed upon. I am more worried about AIs where we might not agree yet on the intent. **************************************************************** From: Randy Brukardt Sent: Wednesday, January 24, 2018 3:48 PM > My own feeling was that phone meetings were not particularly good for > debating the fine points of detailed wording, but rather would be most > useful to evaluate new proposals. So that is why I have focused on > new proposals. I'm not worried about us getting the last bits of > wording right for AIs where the intent is already agreed upon. I am > more worried about AIs where we might not agree yet on the intent. I rather disagree; phone meetings are best for wordsmithing since it isn't necessary to diagram on a whiteboard like it is for syntax arguments. For Ada 2012, we used phone meetings to finalize almost all of the wording, so it's amazing to me to think that it suddenly has gotten difficult to do what we've done many times before. In any case, AI12-0079-1 is nowhere near complete enough for me to be able to annotate the containers with them, a job I need to start ASAP. For instance, we have no solution in the AI yet to the problem of implementation-defined helper packages. We haven't discussed how to handle atomic writes in "synchronized" mode (these are easily race conditions which shouldn't be allowed in general, lest our goal of preventing unsynchronized access be completely undermined). There's probably other issues that aren't mentioned in the AI. Moreover, you haven't really done much of an update for the last couple of meetings. It seems like it isn't progressing, and it is the most core thing we are working on. I was expecting that we would spend a bit of time on the 4 BIs (none of which should require much discussion, but maybe some wordsmithing), a bit of time on some of the new proposals (to determine whether or not we want to pursue them, and find a victim, er, ARG author for the AI), and the bulk of the time on AI12-0079-1 and AI12-0119-1 in the hopes of getting some of our core work done. [Remember that AI12-0119-1 depends on AI12-0079-1.] (We can't finish all of the AIs for Ada 2020 in the last week, as I simply couldn't get the all into Standard in the two weeks or so of editorial time the schedule allows.) And then we would look at anything else as time permits. **************************************************************** From: Brad Moore Sent: Saturday, January 27, 2018 10:20 AM > From what I can gather from the literature on existing multicore > programming tools, portability and performance are essentially > incompatible; if you want performance you need to tailor the program > to the target (no. of threads, processors, etc.) and none of this can > be done without good profiling tools. In general, I think profiling tools can be helpful for pointing out places in the program where one should look for Parallelism Opportunities. (POP's) In many cases, this will be obvious, for instance if a lengthy number crunching routine is being called. The best place to look for POP's is start at the top. (The top level code of the routine doing the number crunching for example). The higher up in the code the parallelism is applied, the better the chance that the parallelism overhead will be overcome and be worthwhile. ... > At the programming level therefore you need to be able to specify > explicitly the chunk value, snd/or whatever other parameters are > target-specific. Otherwise the parallel constructs being proposed will > not provide the performance improvement a user may expect. I think the situation where the programmer is disappointed with the performance result is mostly the case where the amount of work to be performed was not enough to overcome the parallelism overhead in the first place. Attempting to apply tuning to that situation is likely not going to change the outcome. It is relatively easy for the programmer to determine if the parallelism helped or not. I typically just calculate the amount of elapsed time from before and after the POP, and print it out in a debugging statement, and compare against when I turn off the parallelism. Otherwise, I use no special profiling tools. Given a situation where there the parallelism is actually worthwhile, using a reasonable default for chunk size will give probably better than pretty good, but good or better results. I gave a formula for this in a previous email, but that is actually the formula I use for minimum grain size, which applies to some of my libraries that do load balancing. All my library routines start off with the same chunk size, using a different calculation based on the number of available executors but the load balancing ones will attempt to steal/seek work from other executors of size no smaller than the grain size in iterations, when they complete the work that has been assigned to the executor. I always set the chunk size based on the number of executors (threads), and divide the iterations evenly amongst the executors. A good default for number of executors is the number of available cores, but I find that using a number 3 to 4 times the number of cores gives optimal performance, though the difference between optimal and 1 * number of cores is in the land of diminishing returns. ... > The compiler will certainly not be able to choose the right values any better. > So there must be some syntactic way of describing those parameters. I > guess pragmas are the easiest language tools at hand... Setting the number of executors probably is a good candidate for a configuration pragma, but also allowing it to be overridden in a unit or subprogram. In most cases, setting it once as a configuration parameter is enough, or relying on a system default (number of available CPU * some low multiple) if the pragma is not specified. **************************************************************** From: Brad Moore Sent: Saturday, January 27, 2018 10:46 AM > It is all about performance, but there is a long-term and a > short-term. My favorite example is the C programming language, which > allowed users to specify which variables they wanted the compiler to > keep in a register. I believe you can still say "register int i;" but the > only effect now is that a "register" declaration means that you can no > longer use the "&" operator with the variable. The compiler is much better > at register allocation than are programmers. I believe there is still value in C's register keyword though. Since you cannot use the "&" operator, the compiler knows that the variable cannot be aliased, which could allow the compiler to apply further optimisations, that it might not otherwise be able to apply, in theory. It also documents the programmers intent that the code that utilises that variable needs to be optimised for performance. We are just now talking about adding some related features to Ada, such as in AI12-0240, where we are attempting to prevent unknown aliasing of names, as well as optimization features such as parallelism. **************************************************************** From: Brad Moore Sent: Saturday, January 27, 2018 12:24 PM ... >> The calculation I came up with is; >> >> Natural'Max >> (1, >> Natural'Min >> (Iterations / Number_Of_Executors / 8, >> 10_000_000)); > > I couldn't figure out whether this meant (Iterations / Number_Of_Executors) > / 8 or Iterations / (Number_Of_Executors / 8) -- division is not > associative. I suspect that it means (Iterations / (Number_of_Executors * > 8)). But it seems far too large either way -- you're going to be doing an > awful lot of stopping and starting of chunks (which ought to be rather > expensive). I'm very surprised that this would even provide PGP, unless of > course the amount of code in each iteration is already very large. [In that > case, loop overhead would not be significant.] But that's not the common > case for loops -- at least not my loops. > Sorry, but this is the calculation I use for determining the minimum grain size, where a grain is the smallest number of iterations that can be stolen from another executor, for my library routines that apply load balancing. A chunk might get dynamically broken down into smaller chunks, but no smaller than a grain. For determining chunk size, I use the following calculation, which is also based on the number of available executors, and number of iterations. Number_of_workers := Positive'Min (Iterations, Positive (Available_Executors) + (Iterations mod Positive (Available_Executors))); Chunks_Size := Iterations / Number_Of_Workers; Note: The number of available executors does not mean the number of available CPUs. It means the number of available worker threads, which typically is a small multiple of the number of available CPU. (The multiple I use is 1, 2, or 3 typically) Using 1 is a good default, using 3 typically gives optimal performance, but not significantly better than using the 1 value. > In any event, I suspect that you are looking at typical code that you write, > which hardly is representative of the entire universe of code (which > includes things that you or I would never write). And perhaps the behavior > of various compilers differs a lot for Ada tasking. I've tried this with many different problems, those with even work loads, uneven work loads, large iteration counts, small iteration counts, 3 different Ada compiler vendors, different OS's (Windows, Linux, Raspian (Linux variant), Android (Linux Variant), MAC OS), different targets, Intel, AMD quadcore, dual core, 8 core, Raspberry PI, my Android cell phone, and compared against different library strategies, my own dynamic load balancing (work seeking, work stealing), and static non-load-balancing. I've compared against libraries written with OpenMP, and Cilk, using my own bindings to these platforms, as well as writing programs in C, and C++ calling OpenMP and Cilk directly. I always get good results for problems that are worthwhile parallelizing, using the same calculation for chunk size. So I feel pretty confident these results can be extended to other environments I haven't tried yet. > It's clear, for instance, that if the iteration variance is essentially zero > (no conditionals in the loop), that the number of chunks should equal the > number of executors (that minimizes overhead). That is a common enough case > to include it in the calculations. The number of chunks is always equal to the number of executors, in my libraries and use with other platforms. The real question is how to determine the number of executors, which is typically a small multiple of the number of available cores. Even for the case where the number of executors = the number of available cores, it is typical that all worker threads do not progress at exactly the same rate and some executors will finish early, not fully utilizing all the cores. Using a higher number of executors allows better granularity to help ensure that the cores have better utilization. > It's also clear that if the amount of code in an iteration is relatively > small that the cost of looping overhead is significant (which is true of > many of the loops I write), then the number of chunks needs to be minimized. > > But maybe you're right, in that it isn't worth parallelizing code unless it > is lengthy enough that it must have a significant variance and substantial > size. In that case, your formula here makes as much sense anything else. For about the smallest amount of code per iteration you can imagine, say adding integers, I find roughly you need about a million iterations to hit the break even point for overcoming the parallelism overhead. As the amount of code per iteration increases, the break even point drops, as you'd expect. > >> Given that a good chunk number can be determined based only >> on number of iterations, determining this at runtime should >> be pretty easy to do. > > This is your unsupported assertion, one that no one else appears to believe. Perhaps the assertion can be supported with a couple of examples. First consider first a loop to simply add integers from 1 to N. Using the default calculation, where the number of executors = number of cores, and chunks = number of executors / number of cores For 1_000_000_000 iterations, on 64 bit integers and 8 byte floating point, I get the following results using various libraries with GNAT on Linux AMD Quadcore. ************* Parallel Framework Test ************* Physical Processors= 4 Workers = 4 Effective_Workers = 4 Load_Sensitive=FALSE Grain_Size= 10000 *** Integer Sum Reduction Tests *** Sequential Sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:02.28 Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.69 OpenMP Work Static sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.76 OpenMP Dynamic Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.64 OpenMP Guided Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.66 Cilk Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:01.07 Ordered Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.70 Work Seeking sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.70 Ordered Work Seeking sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.62 Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.66 Ordered Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.71 Work Sharing Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.65 Work Seeking Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.65 OpenMP Static Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.66 OpenMP Dynamic Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.67 OpenMP Guided Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.67 Pooled Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.84 Pooled Work Sharing Ordered sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.72 Unbounded Pooled Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.71 Unbounded Pooled Work Sharing ordered sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.68 Pooled Work Seeking sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.68 Pooled Work Seeking ordered sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.64 Pooled Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.73 Pooled Work Stealing ordered sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.78 Pooled Work Sharing Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.86 Pooled Work Seeking Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.71 Work Sharing Recursive sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.72 Work Seeking Recursive sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.98 Non Generic Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.77 Non Generic OpenMP Static sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.79 Non Generic OpenMP Dynamic Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.77 Non Generic OpenMP Guided Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.71 Non Generic Cilk Static sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.82 Non Generic Work Seeking sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.76 Non Generic Work Seeking sum of 1000000000 integers using Task Attributes in parallel is 500000000500000000, Elapsed = 00:00:00.77 Non Generic Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.92 Non Generic Pooled Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.81 Non Generic Pooled Work Seeking sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.78 Non Generic Pooled Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.78 *** Floating Point Addition Reduction Tests *** Sequential Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:09.47 Work Sharing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.57 Work Sharing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.76 OpenMP Static Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.48 OpenMP Dynamic Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.56 OpenMP Guided Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.58 Work Seeking Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.46 Work Seeking Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.51 Work Stealing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.64 Work Stealing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.72 Work Sharing Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.70 Pooled Work Sharing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.57 Pooled Work Sharing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.92 Pooled_Work Seeking Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.49 Pooled_Work Seeking Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.50 Pooled Work Stealing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.52 Pooled Work Stealing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.49 Pooled Work Sharing Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.52 Pooled Work Seeking Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.57 *** Non Reducing Array Manipulation Tests *** Sequential Manipulation bits of bit array is Done, Elapsed = 00:00:00.25 Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.10 Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.08 Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06 OpenMP static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.08 Work Sharing Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.09 Pooled Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.08 Pooled Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06 Pooled Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07 Work Sharing Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.06 Work Seeking Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.07 Non Generic Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.08 Non Generic Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07 Non Generic Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07 Non Generic OpenMP static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06 Non Generic Cilk static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.02 Non Generic Pooled Work Sharing Manipulation of bit array is Done, Elapsed = 00:00:00.07 Non Generic Pooled Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07 Non Generic Pooled Work Stealing Manipulation of bit array is Done, Elapsed = 00:00:00.07 This sped up the processing about 3.5-3.6 x faster for 4 cores, for 64 bit integer addition pretty consistently. and about 3.7 x faster for 8 byte floating point. After playing around with tuning, Using my suggested number of executors for optimal performance, I found I got pretty much the same result with about 12 executors. (3 x number of cores). When I rerun this example with 12 executors, I get; ************* Parallel Framework Test ************* Physical Processors= 4 Workers = 12 Effective_Workers = 12 Load_Sensitive=FALSE Grain_Size= 10000 *** Integer Sum Reduction Tests *** Sequential Sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:02.25 Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.65 OpenMP Work Static sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.71 OpenMP Dynamic Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.66 OpenMP Guided Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.65 Cilk Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:01.11 Ordered Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.67 Work Seeking sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.62 Ordered Work Seeking sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.68 Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.69 Ordered Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.71 Work Sharing Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.73 Work Seeking Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.68 OpenMP Static Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.68 OpenMP Dynamic Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.68 OpenMP Guided Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.74 Pooled Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.70 Pooled Work Sharing Ordered sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.68 Unbounded Pooled Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.74 Unbounded Pooled Work Sharing ordered sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.75 Pooled Work Seeking sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.69 Pooled Work Seeking ordered sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.64 Pooled Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.69 Pooled Work Stealing ordered sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.67 Pooled Work Sharing Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.74 Pooled Work Seeking Parallel Blocks sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.71 Work Sharing Recursive sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.75 Work Seeking Recursive sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.96 Non Generic Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.80 Non Generic OpenMP Static sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.83 Non Generic OpenMP Dynamic Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.76 Non Generic OpenMP Guided Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.76 Non Generic Cilk Static sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.95 Non Generic Work Seeking sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.79 Non Generic Work Seeking sum of 1000000000 integers using Task Attributes in parallel is 500000000500000000, Elapsed = 00:00:00.81 Non Generic Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.77 Non Generic Pooled Work Sharing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.76 Non Generic Pooled Work Seeking sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.76 Non Generic Pooled Work Stealing sum of 1000000000 integers in parallel is 500000000500000000, Elapsed = 00:00:00.76 *** Floating Point Addition Reduction Tests *** Sequential Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:09.47 Work Sharing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.58 Work Sharing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.57 OpenMP Static Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.53 OpenMP Dynamic Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.40 OpenMP Guided Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.62 Work Seeking Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.84 Work Seeking Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.59 Work Stealing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.68 Work Stealing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.70 Work Sharing Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.78 Pooled Work Sharing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.74 Pooled Work Sharing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.53 Pooled_Work Seeking Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.60 Pooled_Work Seeking Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.50 Pooled Work Stealing Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.53 Pooled Work Stealing Ordered Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.64 Pooled Work Sharing Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.68 Pooled Work Seeking Parallel Blocks Addition of 1000000000 floats in parallel is 500000000500000000.000, Elapsed = 00:00:02.81 *** Non Reducing Array Manipulation Tests *** Sequential Manipulation bits of bit array is Done, Elapsed = 00:00:00.23 Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.08 Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06 Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06 OpenMP static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07 Work Sharing Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.08 Pooled Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.07 Pooled Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.06 Pooled Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07 Work Sharing Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.07 Work Seeking Parallel Blocks Manipulation bits of bit array is Done, Elapsed = 00:00:00.07 Non Generic Work Sharing Manipulation bits of bit array is Done, Elapsed = 00:00:00.08 Non Generic Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07 Non Generic Work Stealing Manipulation of bits of bit array is Done, Elapsed = 00:00:00.10 Non Generic OpenMP static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07 Non Generic Cilk static Manipulation of bits of bit array is Done, Elapsed = 00:00:00.02 Non Generic Pooled Work Sharing Manipulation of bit array is Done, Elapsed = 00:00:00.08 Non Generic Pooled Work Seeking Manipulation of bits of bit array is Done, Elapsed = 00:00:00.07 Non Generic Pooled Work Stealing Manipulation of bit array is Done, Elapsed = 00:00:00.07 Which is pretty much the same result, even though the chunk size is about 3 times smaller. Now consider another example. For this one, the sum of the sum of integers from 1 to N In other words, the sequential loop is; for I in 1 .. N loop for J in 1 .. I loop Sum := Sum + J; end loop; end loop; Here the amount of code per iteration is much higher, and very lop-sided. The higher the number of iteration for the outside loop, the more iterations performed in the inner loop. One would expect the executors doing the lower number iterations to complete much earlier that the higher ones. If I run the program with the number of executors = number of cores, and number of chunks = number of executors, and using a much smaller number of iterations, (50_000 Iterations) I get; ************* Parallel Framework Test ************* Physical Processors= 4 Workers = 4 Effective_Workers = 4 Load_Sensitive=FALSE Grain_Size= 3125 *** Integer Sum Reduction Tests *** Sequential Sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:02.99 Work Sharing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:01.50 OpenMP Work Static sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.98 OpenMP Dynamic Work Sharing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.97 OpenMP Guided Work Sharing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.91 Cilk Work Stealing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.88 Ordered Work Sharing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:01.47 Work Seeking sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:01.09 Ordered Work Seeking sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:01.10 Work Stealing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:01.43 Ordered Work Stealing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:01.36 Here the load balancing strategies and 3rd party strategies do quite a bit better than my simple work sharing library, that does no load balancing. But notice, all the results are "PGP" (Pretty Good Parallelism). Now lets try the optimum number of executors, which I found to be about 12. (Anything higher doesn't improve results noticeably) ************* Parallel Framework Test ************* Physical Processors= 4 Workers = 12 Effective_Workers = 12 Load_Sensitive=FALSE Grain_Size= 1041 *** Integer Sum Reduction Tests *** Sequential Sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:03.00 Work Sharing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.95 OpenMP Work Static sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.97 OpenMP Dynamic Work Sharing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.92 OpenMP Guided Work Sharing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.91 Cilk Work Stealing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.93 Ordered Work Sharing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.94 Work Seeking sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.92 Ordered Work Seeking sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.94 Work Stealing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:00.82 Ordered Work Stealing sum of 50000 integers in parallel is 20834583350000, Elapsed = 00:00:01.02 All the results are noticeably better, but not significantly better. Even the non-load-balancing one had a good comparable time compared to the ones that did load balancing, and also note that the best time for this run was my Paraffin Work Stealing library, which beat both Cilk and OpenMP versions here. But depending on the problem one library might have a slight edge over another, the important point is that they all come in around the same ball park, and all are quite a lot better than the sequential version. > >> > I know Brad tried to provide a function in System.Multiprocessing for this >> > purpose, but that is missing the very important data about the cost of an >> > iteration and variance of iteration executions. >> >> I actually dont think that is very important. It might be >> important in determining >> whether or not parallelism should be applied, but with >> explicit parallelism, >> there is no need to determine that. The programmer decides. > > I'm assuming that the overhead of executing a parallel chunk is rather high > compared to "typical" code. As such, running too many chunks increases the > overhead with no appreciable benefit. But also no noticeable detriment. (Assuming 3-4 times the number of chunks), the overhead of iterating chunks is not where time is spent. It is the work that is done during each chunk that matters. Also, any extra overhead in iterating over a higher number of chunks is likely offset by better granularity and processor utilization. **************************************************************** From: Tucker Taft Sent: Wednesday, March 28, 2018 3:17 PM [The first part of this mail is relevant to AI12-0251-2, find it there. - ED] And for AI12-0251-1, I would suggest that rather than only being allowed to specify an explicit chunk index, you could also just specify a single expression in the parentheses, which would give an upper bound on the number of tasklets that may be created to execute the parallel loop. Hence the syntax would be either: parallel (expression) for ... loop ... -- specify upper bound on number of tasklets or: parallel (identifier IN discrete_subtype_definition) for ... loop ... -- give index to each tasklet **************************************************************** From: Tucker Taft Sent: Tuesday, October 16, 2018 4:12 PM Here is an update to AI12-0251 on chunking of parallel loops. It includes the wording and a summary, and a bit more discussion. [This is version /02 of the AI - Editor.] Perhaps the biggest change relative to the model in ai12-0119 is that we have one logical thread of control per chunk, rather than one per iteration. ai-119 mentioned that the implementation might "chunk" the iterations behind the scenes, but that was intended to be invisible. With a visible chunk parameter, we want to clarify that there is a single logical thread of control in a single chunk, so there are no data races if the iterations of a single chunk all update the same accumulator (which is presumably an element of an array indexed by the chunk parameter). **************************************************************** From: Randy Brukardt Sent: Tuesday, October 16, 2018 8:58 PM Editorials: "[editor's note:" Last I heard, I am the editor, and I didn't put a note there. Hope I haven't been replaced. :-) Anyway I changed this to "[Author's note:". Also, you didn't really need the date (the version ought to be unique without it). And it might have been better to write this against Draft 14 of the RM (for instance, 5.5(3/5) has both changes from AI12-0119-1 and AI12-0189-1; it's annoying to look at one but not the other). I don't have the energy to figure that out right now. --- "... declares a /chunk_parameter/ objecgt whose subtype ..." I don't think I need to tell anyone what needs changing here. ;-) --- I removed a number of extra spaces after periods. **************************************************************** From: Tucker Taft Sent: Tuesday, October 16, 2018 9:01 PM Thanks for the corrections. **************************************************************** From: Randy Brukardt Sent: Friday, November 16, 2018 5:27 PM Consider this a "for-the-record" note. The body of AI12-0251-1 as approved includes a change to the wording of AI12-0267-1: Modify "Parallel_Conflict_Checks" paragraph from AI12-0267-1 that mentions chunks: * Parallel_Conflict_Checks This policy disallows a parallel construct from reading or updating a variable that is global to the construct, unless it is a synchronized object, or unless the construct is a parallel loop, and the global variable is a part of a component of an array denoted by an indexed component with at least one index expression that statically denotes the loop parameter of the loop_parameter_specification or the chunk [index] parameter of the parallel loop. But this is nonsense. First of all, the AI12-0267-1 wording clearly depends on the AI12-0251-1 definition of chunk ('cause there isn't any definition of chunk parameter or chunk index in AI12-0119-1, so the original wording had to be referring to the definition in AI12-0251-1). It's not acceptable to have two AIs depend on each other. Second, AI12-0267-1 wasn't approved when AI12-0251-1 was considered, so any fixes needed should have just been made there. (Yes, it was approved an hour later, but that's still afterwards.) Therefore, I eliminated this change from AI12-0251-1, and modified AI12-0267-1 appropriately (along with mentioning that it depends on AI12-0251-1). *************************************************************** From: Randy Brukardt Sent: Friday, November 16, 2018 7:16 PM AI12-0251-1 defines the chunk syntax as: parallel [(chunk_specification)] for loop_parameter_specification but the Dynamic Semantics are: The chunk_specification is elaborated after the loop_parameter_specification is elaborated, and before any iterations are performed. Why is the elaboration in the opposite order of the textual appearance? I can't think of any reason that elaboration of one should have anything to do with elaboration of the other, so this order seems arbitrary. In such a case, I'd expect the order to be textual (that's certainly the order that a reader would expect). Moreover, you can't actually do anything until both are elaborated (you have to know how many chunks there are before you can do any iterations), so I don't see any implementation reason for this choice. The only reason that I can see for this choice is that it caused less change in the wording, but that shouldn't be the reason for any choice which has a run-time consequence. Is there something that I missed? I'd either leave the order unspecified (but that would be unusual for elaboration) or do the elaboration in textual order. I can't think of any other case where elaboration is done out-of-order within a single construct. *************************************************************** From: Randy Brukardt Sent: Friday, November 16, 2018 7:59 PM ... > Why is the elaboration in the opposite order of the textual > appearance? Note that arguably this order is outright wrong, since the existing wording says: For the execution of a loop_statement with the iteration_scheme including a loop_parameter_specification, the loop_parameter_specification is first elaborated. This elaboration creates the loop parameter and elaborates the discrete_subtype_definition. But how can you create "the loop parameter" when you don't know how many chunks there are? Since this creation doesn't have any actual effect, it's probably safe to ignore that wording, but still it doesn't make a ton of sense. *************************************************************** From: Tucker Taft Sent: Friday, November 16, 2018 9:22 PM I agree it makes more sense to elaborate in the textual order. [This will be handled in the clean-up AI12-0294-1. Editor.] *************************************************************** From: Randy Brukardt Sent: Friday, November 16, 2018 7:41 PM This AI has: When the chunk_specification is elaborated, a check is made that the determined maximum number of logical threads of control is greater than zero. If this check fails, Program_Error is raised. Is this check Suppressible? If so, what check name is it associated with?? And if not, that's pretty weird. (I noticed this because I have index all checks, and it isn't obvious as to what to index it under.) The options for Program_Error are Accessibility_Check, Allocation_Check, and Elaboration_Check. It seems like there ought to be some catch-all check (like "Bug_Check"), as there seem to be other cases that raise Program_Error but don't have a check name. Probably a separate AI if the group thinks this is a good idea (or even a neutral idea). Note that it seems that All_Checks covers everything, even those things that don't have a name. Aside: All_Checks is lead-in by the following text: The following check corresponds to all situations in which any predefined exception is raised. But this is wrong, since (even if we decide that IO_Exceptions aren't covered) we don't want suggest that there is any suppression of exceptions raised by language-defined libraries (even if that is Program_Error). And we certainly don't want to imply that one can suppress things that aren't checks (specifically the case completeness exception for case statements/expressions and variants; I'd guess that Detect_Blocking also falls into this category). Nor do want to imply that this might apply to an explicit raise statement for a predefined exception (which would be complete madness). It should at least say "is raised by a check". Or maybe "... all situations where a check raises any predefined exception". The other lead-ins have similar issues. Note that the formal definition seems to avoid this issue, and these are marked as redundant, so arguably there's no real problem. But I have to wonder the value of misleading text in the RM. [Discussion following this is found in AI12-0295-1. Editor.] --- A for-the-record point: The AI defines "chunk_parameter" (WITH an underscore), but this seems to be an English term. There's a second use that is in italics for no obvious reason (there should only be one definition of a term). It rather looks like this was originally a syntax term and then the syntax was removed but the underscore left behind. So I removed the underscore from both uses and the italics from the second use. *************************************************************** From: Randy Brukardt Sent: Wednesday, November 21, 2018 6:59 PM In the wording as proposed, The first sentence of the next paragraph after 5.5(6) [which is in a Static Semantics section] is: In a chunk_specification that is an /integer_/simple_expression, the /integer_/simple_expression is expected to be of any integer type. This sounds suspiciously like a Name Resolution Rule. (The other sentence is clearly Static Semantics, it talks about declaring a chunk parameter.) After discussion with Tucker, Bob, Steve, and Jeff, all concur that this should be a Name Resolution Rule, so I am moving this sentence into a new paragraph after 5.5(5.1/5) under the heading Name Resolution Rules. I'm treating this as an editorial fix as no wording will be changed (just moved). ***************************************************************