!standard 5.5.2(2/3) 18-03-29 AI12-0119-1/08 !class Amendment 14-06-20 !status work item 14-06-20 !status received 14-06-17 !priority Medium !difficulty Hard !subject Parallel operations !summary New syntax and semantics to facilitate parallelism via Parallel Loops and Parallel Blocks. !problem The increased presence of parallel computing platforms brings concerns to the general purpose domain that were previously prevalent only in the specific niche of high-performance computing. As parallel programming technologies become more prevalent in the form of new emerging programming languages and extensions of existing languages, safety concerns need to consider the paradigm shift from sequential to parallel behavior. Ada needs mechanisms to allow the programmer to specify explicity in the code where parallelism should be applied. In particular, loops are often a good candidate for parallelisation, but there also needs to be a way to specify parallelism more generally to allow potentially different groups of calls to be executed in parallel with other groups of calls. While Ada tasks can be used to generate parallel execution, they are better suited for separating concerns of independent processing where the need for parallelisation is mostly a perforamance concern, where it should be easy to convert an existing sequential algorithm into a parallel one, or vice versa. Declaring tasks for this purpose is too error prone, requires significant rework, more difficult to mainain, and not as easily transferable to different hardware platforms. It is better to rely more on the compiler to map the parallelism to the target platform. !proposal This proposal depends on the facilities for aspect Global (AI12-0079-1) and for aspect Nonblocking (AI12-0064-2). Those proposals allow the compiler to statically determine where parallelism may be introduced without introducing data races or deadlocking. This proposal informally introduces the semantic notion of a Parallel OPportunity (POP) and a tasklet to the language. POPs are places in program text where work can be distributed to parallel executing workers that work in concert to correctly execute the algorithm. Tasklets are the notional (logical) units within a task that are executed in parallel with each other. The goals of this proposal are: - To permit existing programs to benefit from parallelism through minimal restructuring of sequential programs into ones that permit more parallel execution; - To support the development of complete programs that maximize parallelism; - To efficiently guide the compiler in the creation of effective parallelism (without oversubscription of the parallelism resources) with minimal input from the programmer; - To avoid syntax that would make a POP erroneous or produce noticeably different results if executed sequentially vs in parallel. One issue to consider is the underlying runtime. In this model, tasklets are associated with tasks. Regardless of implementation, tasklets are considered to execute in the semantic context of the task where they have been spawned, which means that any operation that identifies a task, such as those in Task_Identification, will identify the task in which the tasklet is spawned. On the other hand, calls by different tasklets of the same task into the same protected object are treated as different calls resulting in distinct protected actions; therefore synchronization between tasklets can be performed using protected operations. Note that this is consistent with the current standard which already supports multiple concurrent calls by a single task in the presence of the asynchronous transfer of control capability (9.7.4 (23)). There are two areas of this proposal; one that introduce capabilities for parallel/concurrent blocks, and another that introduces capabilities for parallel loops. A third proposal that introduces capabilities for parallel reduction is covered in a separate AI, AI12-0242-1. Parallel Blocks --------------- Parallel blocks may be used to specify that two or more parts of an algorithm may be executed in parallel with each other. Semantics: A parallel block statement encloses two or more sequences of statements (two or more "parallel sequences") separated by the reserved word "and". Each parallel sequence represents a separate tasklet, but all within a single Ada task. Task identity remains that of the enclosing Ada task, and a single set of task attributes is shared between the tasklets. Each sequence of statements is assigned a unique and independent executor to execute the tasklet. With respect to the rules for shared variables (see 9.10(11-14), and specifically 9.10(13)), two actions occurring within two different parallel sequences of the same parallel block are not automatically sequential, so execution can be erroneous if one such action assigns to an object, and the other reads or updates the same object or a neighboring object that is not independently addressable from the first object. Potentially blocking operations are not allowed in a parallel sequence of a parallel block. Otherwise, the appropriate use of atomic, protected, or task objects (which as a group we will call synchronized objects) can be used to avoid erroneous execution. In addition, the new Global and Nonblocking aspects may be specified to enable the static detection of such problems at compile time (see AI12-0079-1 and AI12-0064-2). Any transfer of control out of one parallel sequence will initiate the aborting of the other parallel sequences not yet completed. Once all other parallel sequences complete normally or abort, the transfer of control takes place. If multiple parallel sequences attempt a transfer of control before completing, the first occurrence of transfer of control is selected and the others are aborted. The parallel block completes when all of the parallel sequences complete, either normally or by being aborted. Note that aborting a tasklet need not be preemptive, but should prevent the initiation of further nested parallel blocks or parallel loops. Parallel Loops -------------- A parallel loop is a loop where any iteration of the loop can execute in parallel with any other. There is no implied execution order for the iterations. Iterations that execute in parallel with other iterations are associated with different tasklets. To indicate that a loop is a candidate for parallelization, the reserved word "parallel" may be inserted immediately before the word "for" in a "for" loop. Note that the same rules presented for parallel blocks above also apply to the update of shared variables and the transfer of control to a point outside of the loop, and for this purpose each iteration can be considered as being equivalent to a separate sequence of a concurrent block. !wording Append to Introduction (28) "A parallel block statement requests that two or more sequences of statements should execute in parallel with each other." Modify 2.9(2/3) Add "parallel" to the list of reserved words. Modify 5.1(5/2) add "parallel_block_statement" to the list of compound statements Replace 5.5(3/3) Iteration_scheme ::= while condition | [parallel] for loop_parameter_specification | for iterator_specification Add after 5.5(6/5) {Legality Rules} A loop_statement shall not have both reserved words reverse and parallel present. Add after 5.5(7) When the reserved word parallel is present and a transfer of control out of the loop occurs, an attempt is made to cancel further parallel execution of the sequence_of_statements that has not yet started. The completion of a loop_statement for a transfer of control out of the loop is delayed until all parallel execution of the sequence_of_statements is complete. If a transfer of control out of the loop occurs in multiple parallel executions of the sequence_of_statements then only the run time action of the first encountered transfer of control occurs. Modify 5.5(9/4) For the execution of a loop_statement with the iteration_scheme being for loop_parameter_specification, the loop_parameter_specification is first elaborated. This elaboration creates {objects of} the loop parameter and elaborates the discrete_subtype_definition. {If the keyword parallel is present, multiple objects of the loop parameter are created where each iteration is associated with a specific loop parameter object, otherwise a single loop parameter object is created. Each loop parameter object is associated with a thread of control where each thread proceeds independently and concurrently between the points where they interact with other tasks and with each other.} If the discrete_subtype_definition defines a subtype with a null range, the execution of the loop_statement is complete. Otherwise, the sequence_of_statements is executed once for each value of the discrete subtype defined by the discrete_subtype_definition that satisfies the predicates of the subtype (or until the loop is left as a consequence of a transfer of control). Prior to each such iteration, the corresponding value of the discrete subtype is assigned to [the]{a} loop parameter {object}. These values are assigned in increasing order unless the reserved word reverse is present, in which case the values are assigned in decreasing order {, or unless the reserved word parallel is present, in which case the order is arbitrary}. AARM - An implementation should statically treat the sequence_of_statements as being executed by separate threads of control, but whether they actually execute in parallel or sequentially should be a determination that is made dynamically at run time, dependent on factors such as the available computing resources. Add after 5.5(21) Example of a parallel loop -- See 3.6(30/2) parallel for I in Grid(1)'Range loop Grid(I, 1) := (for all J in Grid(2)'Range => Grid(I,J) = True); end loop; Add a new subclause 5.6.1 Parallel Block Statements A parallel_block_statement encloses two or more handled_sequence_of_statements where all the handled_sequence_of_statements can execute in parallel with each other. Syntax parallel_block_statement ::= parallel do handled_sequence_of_statements and handled_sequence_of_statements {and handled_sequence_of_statements} end do; Legality Rules A parallel_block_statement shall not update variables global to the statement unless the action is sequential (see 9.10). A handled_sequence_of_statements of a parallel_block_statement shall not consist of a statement that can invoke a potentially blocking operation (see 9.5). Static Semantics Each handled_sequence_of_statements represents a separate thread of control that proceeds independently and in parallel between the points where they interact with other tasks and with each other. For the execution of a parallel_block_statement, each handled_sequence_of_statements is executed once, until the parallel_block_statement is complete. The parallel_block_statement is complete when all handled_sequence_of_statements have completed execution or when transfer of control occurs that transfers control out of the parallel block. When a transfer of control out of the parallel block occurs, an attempt is made to cancel further parallel execution of a handled_sequence_of_statements that have not yet started. The completion of a parallel_block_statement for a transfer of control out of the parallel block is delayed until all parallel execution of the handled_sequence_of_statements is complete. If a transfer of control out of the loop occurs in multiple parallel executions of handled_sequence_of_statements then only the run time action of the first encountered transfer of control out of the parallel block occurs. AARM - An implementation should statically treat each handled_sequence_of_statements as a separate thread of control, but whether they actually execute in parallel or sequentially should be a determination that is made dynamically at run time, dependent on factors such as the available computing resources. Examples Example of a parallel block statement: do Foo(Z); and Bar(X, Y); and Put_Line ("Executing Foo and Bar in parallel with Other_Work"); Other_Work; exception when Constraint_Error => Put_Line ("Constraint Error raised doing Other_Work"); end do; function Fibonacci (N : Natural) return Natural is X, Y : Natural; begin if N < 2 then return N; end if; parallel do X := Fibonacci (N - 2); and Y := Fibonacci (N - 1); end do; return X + Y; end Fibonacci; Modify 9.10(13) Both actions occur as part of the execution of the same task {unless either are part of a; - different handled_sequence_of_statements of a parallel block statement, - sequence_of_statements of a parallel loop statement.} Add a new subclause 9.12 9.12 Executors and Tasklets A task may distribute execution across different physical processors in parallel, where each execution is a separate thread of control that proceeds independently and concurrently between the points where they interact with other tasks and with each other. Each separate thread of control of a task is an Executor, and the execution that each executor performs between synchronization is a tasklet. When a task distributes its execution to a set of executors, it cannot proceed with its own execution until all the executors have completed their respective executions. A parallel block statement or parallel loop statement may assign a set of executors to execute the construct, if extra computing resources are available. Add after C.7.1(5) The Task_Id value associated with each handled_sequence_of_statements of a parallel_block_statement, or of each sequence_of_statements of a parallel loop statement is the same as that of the enclosing statement. AARM - Each handled_sequence_of_statements of a parallel block or of each sequence_of_statements of a parallel loop are treated as though they are all executing as the task that encountered the parallel construct. !discussion There is a continuing trend of exponential growth of computational elements embedded on a single chip. This has led to significant challenges for software designers and implementers. Prior to 2005, the increase in transistor count corresponded to a similar increase in CPU clock speed, which boosted performance of sequential algorithms. Since then, CPU clock speeds have leveled off for reasons of practicality, and chip manufacturers have instead moved towards multicore technologies as a means of achieving performance increases. As parallel technologies become more prevalent in the form of new emerging hardware platforms, safety concerns need to consider the paradigm shift from sequential to parallel behaviour. It is a paradigm that has been for a long time contained within a specialized domain of high-performance computing, but that is now required in all domains of computing systems. It is important for the programmer to indicate places in the code where parallelism is desired, which allows the reader to better understand the algorithm, and allows the programmer to receive confirmation from the compiler whether the desired parallelism is safe, or even worthwhile. This proposed model does not define syntax for the explicit parallelization of individual subprogram calls, since such parallelization can be performed implicitly by the compiler, when it knows that the calls are free of side-effects. This is facilitated by annotations identifying global variable usage on subprogram specifications (see AI12-0079-1). Parallel Blocks --------------- example: declare X, Y : Integer; Z : Float; begin parallel do X := Foo(100); and Z := Sqrt(3.14) / 2.0; Y := Bar(Z); end do; Put_Line("X + Y=" & Integer'Image(X + Y)); end; In this example, the calculation of Z and Y occur sequentially with respect to each other, but in parallel with the calculation of X. Note that the compiler, using the rules specified in AI12-0079-1, may complain if the parallel sequences might have conflicting global side-effects, or if the parallel sequences might have Potentially Blocking operations. The parallel block construct is flexible enough to support recursive usage as well, such as: function Fibonacci (N : Natural) return Natural is X, Y : Natural; begin if N < 2 then return N; end if; parallel do X := Fibonacci (N - 2); and Y := Fibonacci (N - 1); end do; return X + Y; exception when others => Log ("Unexpected Error"); end Fibonacci; We considered allowing the parallel block construct to be preceded with an optional declare part, but it was observed that it was more likely to be useful to have objects that are shared across multiple parallel sequences to outlive the parallel block. Therefore we reverted to the simpler syntax proposed above. Because there are no local declarations, there was also no point in having a statement_identifier (block label) for a parallel block. Allowing an exit statement to replace a goto that targets the location following the end of the parallel block statement seems useful, and to allow such a statement, a block label might be needed to identify the parallel block that is being exited. However, allowing exit statements is problematic for the same reason that exit statements are not currently allowed in block statements. It could be confusing if an exit occurred within a loop enclosed by the parallel block. Some might see it as syntax to exit the loop, others might see it as syntax to exit the parallel block. Rather than deal with this now, it was felt that if there was strong enough of a need for allowing an exit statement in a parallel block, it could be the subject of a separate AI. We considered what the semantics might be for a parallel block if the parallel keyword were absent. This might be a good syntactic construct to use for supporting coroutines, for example. Rather than deal with that question in this AI, we leave that for consideration in separate AI's. We considered whether blocking calls should be allowed in a parallel block statement. We felt that allowing that could add significant complexity for the implementor, as well as introduce safety concerns about potential deadlocking. While such a capability is possible, it was felt that it should be disallowed for now. If the demand and need is felt in the future, it could be added then, but it is better to not standardize that capability until we know it is needed. Parallel Loops -------------- In most compute-intensive applications, a significant proportion of the computation time is spent in loops, either iterating over arrays/container data structures, or systematically searching a large solution space. To benefit from parallel hardware, the computation associated with a loop should be spread across the available processors. Note: Allowing the parallel keyword on while loops was considered, but was discarded since while loops cannot be easily parallelized, because the control variables are inevitably global to the loop. For example, here is a simple use of a parallelized loop, to add two arrays together to produce a result array: parallel for I in Loop_Iterations'Range loop Result (I) := A(I) + B(I); end loop; Note that the compiler, using the rules specified in AI12-0079-1, may complain if the parallel sequences might have conflicting global side-effects. We considered extending the syntax to allow reductions to be performed with parallel loops, but we discovered that parallel reduction expressions can provide the same capability more concisely, so for now we allow parallel loops to work if there are no global conflicts. A separate AI could be created to add reduction loop capability if it is decided that this is needed. In the meantime, Reduction Expressions appear to better fill this need. The main purpose of a loop is to apply processing iteratively. The main purpose of a reduction however, is to generate a result value. Iteration is a means to achieve that end, but a reduction expression seems better suited since it is designed to produce a result value. !ASIS ** TBD. !ACATS test ACATS B-Test and C-Test are needed for these features. !appendix From: Tucker Taft Sent: Tuesday, June 17, 2014 10:18 AM Here is the paper the "gang of 4" is submitting to HILT 2014 with a relatively detailed proposal for parallel programming in Ada 202x: http://bit.ly/parallada-gangof4 If there is time at the upcoming ARG meeting, some discussion and feedback from ARG members would be appreciated. **************************************************************** From: Randy Brukardt Sent: Tuesday, June 17, 2014 12:52 PM Are we going to get formal proposals for some/all of these features? For instance, we have an open AI on global in/out (AI12-0079-1) which is certainly more broadly applicable than just parallelism. There is also an AI for Potentially_Blocking, AI12-0026-1, misnamed "Task_Safe", which is useful for any program including tasking. In both of these cases, an important part of the proposals will be how they interact with predefined and language-defined subprograms. (If they're not specified for predefined "+", nothing useful could be done, for example.) As with all of these sorts of things, the devil is in the details. The basic ideas look fine, but we'll never really know without detailed proposals that can be tested against. **************************************************************** From: Tucker Taft Sent: Tuesday, June 17, 2014 1:54 PM True, but it is not worth making a more detailed proposal if the overall reaction is "yuck"! ;-) **************************************************************** From: Randy Brukardt Sent: Tuesday, June 17, 2014 3:10 PM Well, it looks great to me, but that's mainly because all of the details are missing. A lot of stuff looks great without the details (it's how modeling works, after all). The "yuck" only comes when the details are fleshed out... **************************************************************** From: Bob Duff Sent: Tuesday, June 17, 2014 5:52 PM Sounds like there are some good ideas in the paper, but I am very strongly opposed to ARG getting "detailed proposals" about how to do parallelism, or even wasting time discussing the matter. ARG should be moving more to the "standardize existing practise" way of doing. And that goes double for features whose ONLY purpose is efficiency -- such as parallelism. We (ARG) don't need detailed proposals in that area; we need an experimental implementation, and MEASUREMENTS of the efficiency of realistic programs. (And don't try to show me speedups of the recursive Fibonacci -- that's downright dishonest!) Until then, ARG should avoid wasting any time on the matter. Once we have proven practise, then ARG should take charge of standardizing it. My comments above don't apply so much to things like Globals, which are more about correctness/safety than efficiency. ARG could reasonably discuss Globals, although even there, an experimental implementation would be helpful. Regarding those: You can put Globals and Potentially_Blocking on a package. That's good. I think it should apply to child and nested packages of that package. I think you also want a way to give a default for Overlaps_Storage: "no overlaps unless specified, in this package and children". **************************************************************** From: Robert Dewar Sent: Wednesday, June 18, 2014 2:52 AM > Sounds like there are some good ideas in the paper, but I am very > strongly opposed to ARG getting "detailed proposals" about how to do > parallelism, or even wasting time discussing the matter. ARG should > be moving more to the "standardize existing practise" way of doing. > And that goes double for features whose ONLY purpose is efficiency -- > such as parallelism. I am in full agreement with this. At the last RTW, which I attended, I was struck by the fact that the RTW was debating removing some features in Annex D which no one has ever implemented, yet alone found useful in actual practice. The RM is not a research proposal :-) **************************************************************** From: Alan Burns Sent: Wednesday, June 18, 2014 5:59 AM The policy currently in Ada has been implemented by the Ada run-time from Santandar, and is supported in EDF operating systems (not Ada), though I agree these are mainly research and experimental platforms, not mainstream industrial platforms. **************************************************************** From: Robert Dewar Sent: Wednesday, June 18, 2014 10:39 AM But I am talking about the specific Ada features, which for sure have not been used by anyone in an Ada program :-) **************************************************************** From: Tucker Taft Sent: Wednesday, June 18, 2014 8:38 AM > The RM is not a research proposal :-) I agree that many language standards come after the fact, and are used to try to tame a wildly varying set of ad hoc extensions. That has never been the story with Ada. There are relatively few extensions (other than as packages, pragmas, or attributes) that Ada vendors or users try in any serious way, before the Ada Rapporteur Group begins to talk seriously about them. The ideal is that the ARG gets the ball rolling, indicates a direction we are serious about, and then vendors try prototyping the ideas, and then we come around to picking among the "serious" ideas based on prototyping results. I really don't see anyone beside the ARG being taken seriously at this point as a fountain for "ideas worth prototyping" (apologies to TED ;-). **************************************************************** From: Stephen Michell Sent: Sunday, June 22, 2014 10:24 PM Attached is a report that ARG requested from Brad, Tucker and myself in investigating how OpenMP could be used to implement Ada parallelism. For discussion at the ARG meeting this week. ----- Report to WG 9/ARG on the use of OpenMP for Parallelism in Ada Stephen Michell with contributions from Luis Miguel Pinho Brad Moore Tucker Taft Foreward At the ARG meeting November 2014, ideas were presented about ways to implement parallelism in Ada, and ways to interoperate with other language systems while using the multicore capabilities of modern computers. We were requested to investigate OpenMP from a technical perspective as well as a business perspective as a potential solution to implementing multicore parallelism in Ada. OpenMP Characteristics OpenMP is a set of libraries and language extensions that permit a user to access the capabilities of multicore systems from within certain languages - namely Fortran, C and C++. OpenMP's parallelism is "explicit parallelism" - i.e. all aspects of the parallelism are specified by compiler directives in the code that invoke library routines at the place of the directive through OpenMP API's. OpenMP lets the user specify code segments (usually "for loops", blocks and functions) that can be executed by threads (possibly operating on multiple cores) that are co-ordinated to implement an algorithm. The language extensions are converted into library calls to interact with a runtime to implement the algorithm. In trying to implement OpenMP on an Ada system, one would note that the concurrency model is quite different than Ada's. The "threads" that are created interact with other threads that use OpenMP primitives. This means that they will not interact with Ada tasks using Ada tasking primitives. For the interoperability with Fortran, C and C++ systems that are using OpenMP, an Ada OpenMP capability would be useful, whetehr or not it interacted with Ada tasks. This brings us to the business model. Business Model To implement An OpenMP for Ada, one must participate in the OpenMP community. One could develop compiler directives, either as aspects or pragmas and convert calls to meet the API, but to have it "standardized", one must be a full participant in the OpenMP group. One can participate in OpenMP as a university, as an individual or as a corporate participant. AdaCore was approached to see if they were interested in joining OpenMP (and paying fees), but declined. There is no point joining as individuals, as individuals cannot propose new language specifications for standardization. This leaves participation through a university. Barcelona Supercomputing Group is a participant, and Miguel Pinho has approached them about participating through Barcelona. Barcelona is receptive, but participation requires that Miguel find a PhD student willing to participate. To date there has been no final resolution. There is a GNU OpenMP binding (GOMP) that supports the GNU Fortran, C and C++ GNU compilers. Ada could define a thin binding to that interface. This would still leave the challenge of defining Ada syntax for the Open MP compiler directives and determining how such a library would interact with the Ada runtime (Tasking is not the only issue - elaboration issues are also a potential issue). Conclusion There are business reasons and technical reasons for creating an OpenMP Ada implementation. If an interface with OpenMP through Barcelona Supercomputing Group becomes possible, further exploration should happen. Otherwise, no further action in this area is anticipated. **************************************************************** From: Brad Moore Sent: Monday, October 3, 2016 10:09 AM [This is version /02 of the AI - Editor.] Here is my attempt to refine ideas from the gang of 4 working on parallelism, so we might have something more to talk about on this topic at the upcoming ARG meeting. There are many ideas to choose from, so I chose what I thought were the most promising, I'm not sure if all 4 of us would agree, though it is possible that we might be in agreement, as we haven't yet gotten around to deciding on the best way forward. Most of the writing related to parallel blocks is unchanged from the previous version of this AI, and was mostly written by Tucker. The updates I have are mostly to do with parallel loops and reduction. I have attempted to put in wording for these changes, but the wording is not polished. The wording hopefully does capture and describe the intent at least. The main hope is to gain some feedback, to see if this direction makes sense, or if other directions should be pursued. It would probably also be good to discuss whether this parallel loop proposal is an improvement over what was written in the previous version of the AI, which discussed the use of parallel arrays. **************************************************************** From: Brad Moore Sent: Saturday, June 10, 2017 1:18 AM [This is version /03 of the AI - Editor.] Here is a bit of a redesign in the area of parallel loops, and a new approach, parallel reduction expressions. The parallel block part is mostly unchanged. The parallel loop design in much simpler. It now just does parallel iteration, leaving parallel reductions to reduction expressions. We can add reduction back to loops later, possibly in a different AI if we decide we need it, but for the time being, it seems like reduction expressions are a better fit for the problem. Parallel reduction expressions are very much similar to quantified expressions, which is where I started thinking about this idea. It turns out that Tucker was also thinking along these lines in Parasail, as Parasail apparently already has this feature. After some email exchange with the gang of 4, the more this idea evolved, the more similar it became to the Parasail approach. The basic idea is that a Reduction expression applies an operation using iteration, where the previous result of the last iteration is fed back as an input into the next iteration. Special syntax, "" is used to indicate where this feedback occurs, which also encloses the initial value to be used for the first iteration. Here are a few examples that are in the AI, to give a basic idea. -- Sum of the elements of an array Sum : Integer := (for parallel Element of Arr => <0> + Element); -- Minimum value of an array Min : Integer := (for parallel X of Arr => Integer'Min(, X)); -- Construct the alphabet Alphabet : Unbounded_String := (for Letter in 'A' .. 'Z' => & Letter); -- Number of people who are 30 or older ThirtySomething : contant Natural := (for P of parallel database => <0> + (if Age(P) > 30 then 1 else 0)); -- Factorial function function Fact(N : Natural) return Natural is (for J in 1..N => <1> * J); -- Taxable_Income : constant Float := Total_Income + (for Deduction of Deductions => <0.0> - Deduction); -- Compute Taylor expansion for Sin of X function Sin(X : Float; Num_Terms : Positive := 5) return Float is (for I in 1..Num_Terms => <0.0> + (-1.0)**I * X**(2*I-1)/Float(Fact(2*I-1))); -- Compute Pi Number_Of_Steps : constant := 100_000; Step : constant := 1.0 / Number_Of_Steps; Pi : constant Long_Float := Step * (for I in parallel 1 .. Number_Of_Steps => <0.0> + (4.0 / (1.0 + ((Long_Float (I) - 0.5) * Step)**2))); -- Print the sum of squares Put_Line ("Sum of Squares is" & Integer'Image(for I in 1 .. 10 => <0> + I**2)); To see the similarity between Quantified expressions and the more general form, Reduction expressions, consider: -- Quantified expressions about the graduating class All_Graduated : constant Boolean := (for all Student of parallel Class => Passed_Exam (Student)); Someone_Graduated : constant Boolean := (for some Student of parallel Class => Passed_Exam (Student)); -- The same result can be obtained by writing these as -- reduction expressions All_Graduated : constant Boolean := (for Student of parallel Class => and Passed_Exam (Student)); Someone_Graduated : constant Boolean := (for Student of parallel Class => or Passed_Exam (Student)); I've taken a first crack at writing this up, but didn't spend a lot of time wordsmithing. I'd hope to first get some feedback in Vienna to see this idea has merit. **************************************************************** From: Randy Brukardt Sent: Monday, June 12, 2017 5:35 PM > Here is a bit of a redesign in the area of parallel loops, and a new > approach, parallel reduction expressions. The parallel block part is > mostly unchanged. Here's a couple barely informed comments on this. First, a bit of a rant. It's disconcerting that we get a wildly different proposal at each meeting. There's no opportunity to get the details right if we change the basic proposal completely between every meeting. In Pittsburgh, we spent a significant amount of time figuring out the right model for reducers. I see here that all of the conclusions of that meeting have completely been discarded in favor of what appears on first glance the reducer-de-jour. Why should we even spend time trying to figure out this wildly different proposal when it seems obvious that whatever we decide will be discarded all over again next time? I want to at least feel like we are making progress toward a solution, rather than starting from nearly ground zero with each proposal. End of mini-rant. ... > Special syntax, "" is used to indicate where this feedback > occurs, which also encloses the initial value to be used for the first > iteration. Regardless of the merits of the proposal itself (which I haven't thought about), I don't think angle brackets is going to fly as syntax in Ada. There would be substantial confusion with relational operators, which would likely make such text unparsable. (We discussed this extensively with some of the assignment target proposals; it didn't fly there and I don't see what could have changed in 8 months.) You could use square or curly brackets for this, but in any case, such syntax doesn't seem very Ada-like. I'd suggest using an attribute or "magic" function call as the appropriate syntax. > Here are a few examples that are in the AI, to give a basic idea. > -- Minimum value of an array > Min : Integer := > (for parallel X of Arr => Integer'Min(, X)); Min : Integer := (for parallel X of Arr => Integer'Min(Integer'Last'Initial, X)); or Min : Integer := (for parallel X of Arr => Integer'Min(Initial(Integer'Last), X)); Maybe there is a better idea out there (neither of these look very promising), but whatever we do needs to look and feel like Ada. **************************************************************** From: Brad Moore Sent: Tuesday, June 13, 2017 9:21 AM > First, a bit of a rant. It's disconcerting that we get a wildly > different proposal at each meeting. There's no opportunity to get the > details right if we change the basic proposal completely between every > meeting. > > In Pittsburgh, we spent a significant amount of time figuring out the > right model for reducers. I see here that all of the conclusions of > that meeting have completely been discarded in favor of what appears > on first glance the reducer-de-jour. Why should we even spend time > trying to figure out this wildly different proposal when it seems > obvious that whatever we decide will be discarded all over again next time? I started with that proposal, but the more I tried to apply it to real problems, the more problems I saw. To describe the problem in a nutshell, for parallel reduction problems there are two mostly independent algorithms to be expressed, which I call the "what" and the "how" The "what" answers the question: What is the application doing? - This includes the application logic that mostly remains the same regardless whether parallelism is being applied or not. It also includes an expression of the desire that the application code should be optimized to run in parallel. The "how" answers the question: How is the application code implemented? e.g. How does the optimization such as parallelism happen? How is chunking performed, how are the results being reduced and combined? With parallelism, this is complex enough that it can be viewed as a separate algorithm being overlaid on top of the application's algorithm. I found that syntax that tries to express both "how" and "what" ends up being messy, and too constraining. One problem I could see, is that by using array syntax to hold chunk results, implies that there is an array of chunk results that exists in time that can be iterated over and examined. In the parallelism libraries I've been working with, that isn't the case, the chunk results come and go and get combined during the computation. There typically never is a point in time where all the chunk results exist. Having an array of chunk results implies that the reduction does not occur until after the looping has completed for all the chunks. This is eliminating a parallelism possibility where the reduction occurs in parallel during the computation of the loop, rather than sequentially after the looping of all the chunks is complete. There are some algorithms where reducers can be written that depend on the reduction occurring in parallel during the loop processing. If we look at the example from the minutes, to sum the elements of an array we have; My_Sum := 0; parallel [(Num_CPUs*2)] Sum : array (<>) of Integer := 0; for I in 1 .. 1_000_000 loop Sum (<>) := @ + A(I); then for S of Sum loop My_Sum := @ + S; end loop; end parallel loop; The middle loop expands into two loops: for Chunk_Index in Sum’Range loop for I in Split(1..1000000, Chunk_Index, Num_CPUs*2) loop Sum(Chunk_Index) := @ + A(I); end loop; end loop; The parallel array of Sum, the "then" keyword, and the explanation for the middle loop demonstrates the intent that the reduction occurs after the main loop processing. The point of parallelism is to make the applications logic run as fast as possible. Why are we attempting to go with syntax that is injecting sequential processing when it isn't needed? It would be better more parallelism can be applied, if it is possible, and it is. Having an array of chunk results also implies that the chunking is more static in nature, suggesting that the computation of chunks occurs before the loop is executed, which rules out other implementation possibilities where the chunking is more dynamic in nature, such as some of the work stealing algorithms where chunks can be dynamically resized as workers become free, they steal work from other chunks being processed by other workers. Why are we choosing syntax that forces implementors to apply a limited range of possible approaches? If the chunking is computed before the loop begins executing, then it implies there can be quite a large number of chunk results, if there are a lot of loop iterations to be chunked, as the chunks should be smaller to allow for better load balancing between the cores. The number of chunks would be a calculation similar to; Number_of_Cores * x, where x is some factor based on the number of iterations. This could require larger storage needs, for the case where the reduction result is a larger data type such as a large matrix. With the parallelism libraries I have been working with, using dynamic chunking, the only needs to be a single chunk result for each core. These are just some of the problems I see with the design we came up with last meeting. I also find that there is a lot we expecting the reader to understand. - - The concept of the parallel arrays - That chunking is being applied - That each chunk is going to a different element in the array - That a second loop that looks like a nested loop is not actually nested, but is executed after the outer loop, etc Being able to express the above problem as an expression such as; (for I of Arr => <0> + I) Seems more useful, a lot simpler to parse for the reader, allows implementors more freedom, allows for better parallelism, etc It also seems like an expression is a better tool for the job, when it comes to reduction problems. The point of looping is to apply some processing multiple times. The point of an expression is to produce a result. I would argue that the point of reduction problems is mainly to produce a result. Looping is needed as part of the implementation to produce the result, but looping is not the main goal here. It is secondary. Further, this reduction expression syntax can be compatible with whatever parallel loop syntax we might decide to go with, including what we came up with at the last ARG meeting. This syntax raises the question about whether we even need reduction syntax to be combined with loop syntax however. > I want to at least feel like we are making progress toward a solution, > rather than starting from nearly ground zero with each proposal. End > of mini-rant. I feel like is considerable progress toward a solution. It's not that this is just another alternative that more or less accomplishes the same thing as what was proposed during the ARG meeting. To me, this feels like a new idea that is a significant improvement over what we had before. > ... >> Special syntax, "" is used to indicate where this feedback >> occurs, which also encloses the initial value to be used for the >> first iteration. > > Regardless of the merits of the proposal itself (which I haven't > thought about), I don't think angle brackets is going to fly as syntax > in Ada. There would be substantial confusion with relational > operators, which would likely make such text unparsable. (We discussed > this extensively with some of the assignment target proposals; it > didn't fly there and I don't see what could have changed in 8 months.) But we already have angle brackets in Ada, <> box notation appears in several places in the syntax. This is just another case of box, except that we stick a value in between. Tucker has already indicated several other possible syntax replacements here, so if we dont like the box notation, it shouldn't be too hard to come up with something else. I think we could potentially just use <> also, without providing an intial value, and just say that there has to be at least one iteration, and the intial value is the value of the first iteration. > You could use square or curly brackets for this, but in any case, such > syntax doesn't seem very Ada-like. Its hard to see how <> isnt Ada-like, since its already in Ada. **************************************************************** From: Randy Brukardt Sent: Tuesday, June 13, 2017 12:59 PM ... > One problem I could see, is that by using array syntax to hold chunk > results, implies that there is an array of chunk results that exists > in time that can be iterated over and examined. In the parallelism > libraries I've been working with, that isn't the case, the chunk > results come and go and get combined during the computation. There > typically never is a point in time where all the chunk results exist. Arrays in this context is just a syntax, there's no reason that it has to be materialized. So I find this concern mainly one about reading too much into the syntax. One hopes that the array *isn't* materialized, and the actual semantic rules allow that. However, we want the syntax to be as easy as possible to understand by a normal sequential programmer, and they certainly will understand the use of arrays in this context. ... > Being able to express the above problem as an expression such as; > > (for I of Arr => <0> + I) > > Seems more useful, a lot simpler to parse for the reader, allows > implementors more freedom, allows for better parallelism, etc This seems very limited to me, as it limits what can be parallelized with reductions to things that can be written as a single expression. It seems as thought people will end up writing helper functions (which necessarily reduce parallelism as the compiler can't see the contents) or extremely complex single expressions (hard to read, hard to debug). It also seems to indicate (along with the thought to add anonymous subprograms, lets, and the like) that some people want to force Ada into being a functional language. Since there is a vocal part of the Ada community that finds functional languages to be evil, that could be a very controversial direction. (I'm undecided on this point.) > > ... > >> Special syntax, "" is used to indicate where this feedback > >> occurs, which also encloses the initial value to be used for the > >> first iteration. > > > > Regardless of the merits of the proposal itself (which I haven't > > thought about), I don't think angle brackets is going to fly as > > syntax in Ada. There would be substantial confusion with relational > > operators, which would likely make such text unparsable. (We > > discussed this extensively with some of the assignment target > > proposals; it didn't fly there and I don't see what could have > > changed in > 8 months.) > > But we already have angle brackets in Ada, <> box notation appears in > several places in the syntax. This is just another case of box, except > that we stick a value in between. Umm, no. <> is a lexical symbol. If you try to put "something in between", you have (at least) three lexical symbols: "<" expression ">", and that has to be parsed to be understood. > Tucker has already indicated several other possible syntax > replacements here, so if we dont like the box notation, it shouldn't > be too hard to come up with something else. > > I think we could potentially just use <> also, without providing an > intial value, and just say that there has to be at least one > iteration, and the intial value is the value of the first iteration. Almost anything is better than "". Maybe not "+expr+". :-) **************************************************************** From: Brad Moore Sent: Thursday, June 15, 2017 2:04 AM >> One problem I could see, is that by using array syntax to hold chunk >> results, implies that there is an array of chunk results that exists >> in time that can be iterated over and examined. In the parallelism >> libraries I've been working with, that isn't the case, the chunk >> results come and go and get combined during the computation. There >> typically never is a point in time where all the chunk results exist. > > Arrays in this context is just a syntax, there's no reason that it has > to be materialized. So I find this concern mainly one about reading > too much into the syntax. One hopes that the array *isn't* > materialized, and the actual semantic rules allow that. Its not just about the storage layout though. If it looks like an array, people will tend to use it that way. This includes doing slices, copying, bit operations (if its an array of bits), etc I can imagine cases where one will want to iterate through the array initially after loop processing to perhaps do intermediate processing and adjustments to the values before applying the reduction. Nested functions may be needed to do this. For example, one might want to apply data smoothing the values. The more one tries to use this as an array, I think the danger is the more that ends up having to be the implementation. > However, we want the syntax to be as easy as possible to understand by > a normal sequential programmer, and they certainly will understand the > use of arrays in this context. That part is pretty easy to understand, since it looks like existing Ada. However, the chunking, the meaning of (<>), and the overall construct, I would think will be quite foreign to sequential programmers, and existing Ada programmers. > ... >> Being able to express the above problem as an expression such as; >> >> (for I of Arr => <0> + I) >> >> Seems more useful, a lot simpler to parse for the reader, allows >> implementors more freedom, allows for better parallelism, etc > > This seems very limited to me, as it limits what can be parallelized > with reductions to things that can be written as a single expression. > It seems as thought people will end up writing helper functions (which > necessarily reduce parallelism as the compiler can't see the contents) > or extremely complex single expressions (hard to read, hard to debug). Similar to the rewriting of array syntax, I would think the compiler could also do some rearranging of at least certain helper functions, such as inline functions, to further optimize what actually gets generated by the compiler, if it is needed. > It also seems to indicate (along with the thought to add anonymous > subprograms, lets, and the like) that some people want to force Ada into > being a functional language. Since there is a vocal part of the Ada > community that finds functional languages to be evil, that could be a very > controversial direction. (I'm undecided on this point.) I think an important point to consider is whether this syntax is appealing enough to add to Ada. It seems to be useful for other things than parallelism, for example it is convenient to be able to express factorial as an expression function, where parallelism likely isn't needed or wanted. It doesn't necessarily need to be the only way to write reductions, just as quantified expressions can also be expressed in Ada in other ways, such as via looping and function calls. We can also have a parallel loop solution, such as the Pittsburgh proposal as well. I think this important to discuss the parallel reduction syntax however, because a) If we do want to support this, this likely will impact the syntax proposal for parallel loop reductions b) Even if we decide to not support this, (at least not now), it offers some ideas that still could impact the parallel loop proposal. In particular, I'm referring to the idea for the Reducer aspect, and Identity aspect that can be applied to function declarations. This is useful to separate the "what" from the "how" as I described in my previous email, where the "what" syntax shows what application code to be parallelized, and the "how" syntax which is separate, provides details on how this is to happen. The reducer for "+" is always going to be "+" itself, so why force the programmer to write this out every time one want to add numbers in parallel, this can be baked into the standard in the form of aspects applied to the function declarations? This could eliminate the need for both the "then" clause as well as the need for the parallel array, which could result in a much simpler syntax construct involving only adding the work parallel to a loop. This also makes sense because adding parallelism I think in a lot of cases will be a later step in the development process of an application - The optimisation phase. In that case you have existing sequential code to be parallelized, and it is desirable to have to make minimal changes to the code to get the parallelism. Further, one might find that adding the parallelism doesn't help performance, or may even make things worse, in which case the programmer will want to back out those changes. It would be desirable to be able to undo changes with minimal effort. For example, just adding or removing the keyword "parallel" would be appealing. If we take the sequential code: declare Sum : Integer := 0; begin for I in 1 .. 1_000_000 loop Sum := @ + I; end loop; Put_Line ("Sum:" & Integer'Image (Sum)); end; And convert it to parallel just by adding the parallel keyword declare Sum : Integer := 0; begin for I in parallel 1 .. 1_000_000 loop Sum := @ + I; end loop; Put_Line ("Sum:" & Integer'Image (Sum)); end; I think that is closer to what programmers would want to be able to do. We could say that as written, the compiler would reject this, because the global aspect could indicate this is a data race for parallel operations. We could then have a Boolean aspect, "Reduction" that can be applied to a variable declaration that indicates that variable involves parallel reduction. This would then allow the compiler to accept the loop, and provides better feedback to the user that parallel reduction is occurring. i.e. declare Sum : Integer with Reduction := 0; begin for I in parallel 1 .. 1_000_000 loop Sum := @ + I; end loop; Put_Line ("Sum:" & Integer'Image (Sum)); end; The compiler and the reader can see that the operation being applied to the global result is function "+" (L, R : Integer) with Identity => 0; Both can also see that this is also a reducer function because it accepts two parameters of the same type, and returns a result of that same type, and because it has the Identity aspect. So this shows how the results of the chunks are combined. The compiler now has the necessary inputs to be able to implement the parallelism, and the reader who is mostly only concerned with the application logic can see that the code is to run in parallel, and in many cases, the programmer will not care how this happens, but if he does he can look up the reducer function for the loop, as well as the identity value for that function. I would think the sequential programmer which prefer to see the above solution for parallel loops, over the syntax proposed in Pittsburgh. This also is more harmonious with the parallel reduction expression syntax as currently written in the AI. Note that writing as a reduction expression, isn't a data race because the result is computed before it is assigned to the global value, so in that case, the Reduction aspect would not be needed on the Sum declaration, though it could be optionally added as a confirming aspect. i.e. the above code could be replaced with; declare Sum : constant Integer := (for I in parallel 1 .. 1_000_000 => <> + I); begin Put_Line ("Sum:" & Integer'Image (Sum)); end; or even just; Put_Line ("Sum:" & Integer'Image (for I in parallel 1 .. 1_000_000 => <> + I)); **************************************************************** From: Brad Moore Sent: Saturday, June 17, 2017 11:53 PM I think we can further simplify and clarify the Reduction Expression syntax, and at the same time eliminate some problems. Specifically, we can eliminate the Reducer aspect and Identity aspect by saying that the top level operation of the reduction expression must be a reducer function. (i.e. a function that takes two arguments of the same type, and returns a result of that same type) In that case, we would not need to know of an identity value. Rather, the result of the first iteration of any chunk effectively becomes the initial value for that chunk. The accumulation back into the same result only occurs if there are at least two iterations for a chunk. The expression inside the box is then not the initial value, but instead is the default value, whose only purpose is to provide the result of the expression if the reduction expression iterates over a null range, i.e. 0 iterations. A <> with no initial value would signify that there is no default value for the expression, which implies that an exception would be raised if a null range were specified. This actually is needed for operations such as set intersection, where the intersection of 0 sets is not defined mathematically, or at least difficult to produce a value for, if we wanted it to mean the universe. consider: We want to find the intersection of animals that can be found in a set of zoos. Set_Array(Vienna_Zoo).Include("Gorilla"); Set_Array(Vieena_Zoo).Include("Giraffe") ... Set_Array(Calgary_Zoo).Include("Giraffe"); Set_Array(Calgary_Zoo).Include("Buffalo"); Common_Animals : constant Set := (for Set of Set_Array => Intersection(<>, Set)); It would be pretty much impossible to create an initial value for a zoo that contains all known animals. Since that case is only needed when there are zero zoos being examined, we do not need to supply that value. Also in this case, the programmer knows that there are at 1 or more zoos in the array, so the default value will not be needed, and and exception would not be raised. Probably it would make sense that the exception raised would be Constraint_Error, if one did try to iterate with a Reduction expression that did not provide a default value. This solves another problem. Yesterday Erhard brought up the example of matrix to the power of N. Now that the Identity aspect is not needed, the Identity Matrix is only needed as a default value for the reduction expression. This can now be trivially supplied by the programmer by a function call. eg. Matrix_To_The_Nth_Power : constant Matrix := (for I in 1 .. N => * A; Where, Identity (A) is a function of a Matrix abstraction that returns the Identity matrix for the matrix, A. If we know that N is of subtype Positive, we can further simplify this to, Matrix_To_The_Nth_Power : constant Matrix := (parallel for I in 1 .. N => <> * A; Since, the Identity matrix would not be needed to evaluate the expression. You may ask, what about the case in the previous examples as currently written in the AI where the top level operation is not a reducer function. The answer is that it should be trivial to rewrite the expression so that a reducer function is used. Instead of; Alphabet : constant Unbounded_String := (for Letter in 'A' .. 'Z' => Append(<>, Letter)); One could write; Alphabet : constant Unbounded_String := (for Letter in 'A' .. 'Z' => <> & To_Unbounded_String ("" & Letter)); Does this seem like a way forward? **************************************************************** From: Brad Moore Sent: Sunday, June 18, 2017 12:26 AM Another problem that would benefit from this simplication is bitwise "and" for modular integer types. It otherwise would be difficult to produce an initial value for modular integer types where the modulus is not a power of two. Attempting to set all bits to 1 could result in a value that is outside the range of values for the type. With this simplification, it should not be a problem anymore, since we are only "and"ing together supplied values that are all valid, and the result will also be valid and belong to the subtype. eg. (for Value of Arr => <> & Value) **************************************************************** From: Tucker Taft Sent: Sunday, June 18, 2017 2:02 AM I would rather leave it the way we discussed yesterday, as these changes in my view are making the construct more complex from the user’s point of view. It is easy enough for the compiler to recognize special cases. It is a pain for the user to have to obey additional restrictions. **************************************************************** From: Brad Moore Sent: Sunday, July 9, 2017 1:23 AM I have been thinking about iteration schemes, for for loops, and I think there are 5 that we'd want to support in Ada. By iteration scheme, I am referring to the combination of chunking and iteration direction. The 5 are; - Forward Sequential - Reverse Sequential - Forward Parallel - Reverse Parallel - All Parallel - Forward Sequential Already exists in Ada. (A for loop without the reverse keyword) Sequential and Forward are both implicit. Semantically, this is a single chunk, executed by a single tasklet, with iteration from lowest to highest. - Reverse Sequential Already exists in Ada. (A for loop with the reverse keyword) Sequential is implicit, Reverse is explicit. Semantically, also a single chunk, executed by a single tasklet, with iteration from highest to lowest. - Forward Parallel Forward is implicit, parallel is explicit. The for loops with the parallel keyword we were discussing in Vienna Semantically, multiple chunks, each corresponding to a tasklet where iteration within each chunk is from lowest to highest. - Reverse Parallel Both Reverse and Parallel are explicit. A for loop with both the parallel keyword, and the reverse keyword. Semantically, multiple chunks, each corresponding to a tasklet where iteration within each chunk is from highest to lowest. - All Parallel Both all and Parallel are explicit. A for loop with the parallel keyword, and the "all" keyword. Semantically, each iteration is a separate chunk, or alternatively viewed as there not being any chunks. There is no order to the iteration, any iteration can occur before any other. The case for reverse parallel would admittedly be quite rare, just as Reverse Sequential loops are quite rare today in Ada, however it is not hard to come up with an example. Consider finding the highest index of an array whose element satisfies some property of interest. If chunking is involved, it makes sense to want to search in reverse order, because once you've found a candidate, you know that there is no need to search further within the chunk, and processing can be abandoned. With a forward sequential loop, one cannot be sure that they've found a candidate without searching the entire chunk. Eg. -- Note: We do not want exit to pro-actively terminate processing in -- other chunks here. We just want to exit processing for the -- current chunk if any of the exit conditions hold true. -- Updater is a protected object in this example. -- Find maximum array index whose element satisfies some property for I in reverse parallel Arr'Range loop -- This first check causes other chunks to stop processing -- if a higher chunk has already found a result if I < Updater.Value then exit; -- This second check abandons the current chunk if a -- result is found, since any other finds in the same -- chunk will have a lower index elsif Satisfies (Arr (I)) then Updater.Update (I); exit; end if; end loop; Put_Line ("The answer is" & Integer'Image (Updater.Value)); The all parallel case is useful when the programmer wants to do manual chunking. This is useful for instance in problems such as generating a cumulative array sum of another array, as two loops are needed and generally both loops should be chunked the same way. Another useful case for all parallel loops is when one wants to mix sequential processing within the parallel processing, for instance if one wants to output some results at the end of each iteration through the chunks. Consider the following example to compute a 7 day weather forecast for various cities in the world, as well as an average of all these values for each day. declare type Cities is (London, Tokyo, New_York, Beijing, Calgary, Paris, Los_Angeles); Today : constant Calendar.Time := Calendar.Clock; Forecast : array (Cities) of Float := (others => 0.0); Done_Parallelism, Done_Sequential : Synchronous_Barrier (Release_Threshold => Forecast'Length); function Max_Temp (City : Cities; Day : Calendar.Time) return Float is begin ... -- Presumably some complex computer intensive calculation.... return Result; end Max_Temp; Notified : Boolean := False; use type Calendar.Arithmetic.Day_Count; begin Day_Loop : for Day in 1 .. 7 loop -- Compute 7 day forecast City_Loop : for City in all parallel Cities'Range loop -- Calculated in parallel for each city Forecast (City) := Max_Temp (City, Today + Calendar.Arithmetic.Day_Count (Day)); Synchronous_Barriers.Wait_For_Release (Done_Parallelism, Notified); -- Transitioning from parallel to sequential if Notified then Output_Loop : for City in Cities'Range loop Put (Cities'Image (City) & Float'Image (Forecast (City)) & ' '); -- Note use of parallel reduction expression Put ("World Avg" & Float'Image ((for Temp of parallel Forecast => <0.0> + Temp) / Float (Forecast'Length))); end loop Output_Loop; end if; -- Transitioning from Sequential back to Parallel Synchronous_Barriers.Wait_For_Release (Done_Sequential, Notified); end loop City_Loop; end loop Day_Loop; end; The synchronous barriers are used to transition between parallel and sequential processing, and then back to to parallel processing for each day of the forecast. For this to work, each iteration in City_Loop must iterate in parallel with each other since the Release_Threshold of the barriers needs to be known in advance. The calculation of each forecast for each city for each day is done in parallel, but presumably, the weather calculations are such that one cannot proceed to the next day until all the weather forecasting is complete for the previous day for all the cities. I know in Vienna, we had taken a straw poll, which favored putting the keyword parallel before the word "for" in the for loop. While this is a useful data point, I think if we want to support "all parallel" loops, I find it makes a lot more sense to have all parallel together after the word "in" or "of". eg for I in all parallel 1 .. 10 loop or for Element of all parallel List loop rather than; all parallel for i in 1 .. 10 loop or parallel all for I in 1 .. 10 loop as it reads more like an English sentence, similar to reverse parallel loops. for I in reverse parallel 1 .. 10 loop instead of; parallel for I in reverse 1 .. 10 loop The other point that was not made in Vienna, is that there is a case for saying that the syntax for specifying iteration scheme should be grouped together in the same place in the syntax. Having parallel at the beginning of the loop statement, and reverse in the middle of the loop statement means you have to look in two places for the reader to determine the iteration scheme. To recap some of my other arguments that were made in Vienna, which might resonate better with the above considerations 1. for ... loop identifies the control construct which is the most important part of the syntax. That the iterations occur in parallel, reverse, forward, etc is secondary, which suggests that parallel is not the most important part of the syntax and therefore should not appear first. 2. This point was not specifically made in Vienna, but the parallel block is also a control structure, so in that case, it makes sense that parallel is the first word. 3. In Vienna, we decided that since parallel for loops with parallel at the front of the syntax would have some ambiguity with parallel blocks, we felt we had to add the Begin keyword to the parallel blocks. I did not like that we were being forced to alter our other syntax on account of the choice of parallel keyword placement in for loops. In the examples I have used I have found that any declarations typically need to have greater scope than the parallel block, so it is not clear to me that having a begin declaration area as part of the parallel block is particularly useful. Comments? **************************************************************** From: Tucker Taft Sent: Sunday, July 9, 2017 8:08 AM ... > Both Reverse and Parallel are explicit. > A for loop with both the parallel keyword, and the reverse keyword. > Semantically, multiple chunks, each corresponding to a tasklet where > iteration within each chunk is from highest to lowest. > Given where we are placing the “parallel” word in the latest proposal discussed by the ARG (at the beginning), there seems no particular reason to disallow “reverse.” But I don’t find your example very compelling: ... > Consider finding the highest index of an array whose element > satisfies some property of interest. If chunking is involved, it makes > sense to want to search in reverse order, because once you've found a > candidate, you know that there is no need to search further within the > chunk, and processing can be abandoned. … > > Eg. > > -- Note: We do not want exit to pro-actively terminate processing in > -- other chunks here. We just want to exit processing for the > -- current chunk if any of the exit conditions hold true. > -- Updater is a protected object in this example. > > -- Find maximum array index whose element satisfies some property > > for I in reverse parallel Arr'Range loop > > -- This first check causes other chunks to stop processing > -- if a higher chunk has already found a result > if I < Updater.Value then > exit; > > -- This second check abandons the current chunk if a > -- result is found, since any other finds in the same > -- chunk will have a lower index > elsif Satisfies (Arr (I)) then > Updater.Update (I); > exit; > end if; > end loop; > > Put_Line ("The answer is" & Integer'Image (Updater.Value)); The semantics for exiting a loop early as currently proposed is that if any iteration invokes an “exit,” all other iterations are abandoned. So you wouldn’t end up with the highest index overall. You would end up with the one that is the highest within its chunk, but of which chunk would be somewhat random. And along the way you have added a synchronized operation in each iteration (if I < Updater.Value) which seems highly likely to slow things down significantly. One place where "reverse parallel" definitely makes sense is in the proposed map/reduce construct, where the “parallel” is really about doing the “map” part in parallel, and then doing the “reduction” part in some kind of parallel tree, preserving order. So “reverse” has a well-defined meaning. I don’t see “reverse" having much meaning in the more conventional parallel iteration schemes. It just produces a bit of a bias within each chunk, but has no particularly well-defined overall semantic effect. ... > - All Parallel > Both all and Parallel are explicit. > A for loop with the parallel keyword, and the "all" keyword. > Semantically, each iteration is a separate chunk, or alternatively > viewed as there not being any chunks. There is no order to the > iteration, any iteration can occur before any other. I would say using “all” to signify a universal quantified expression as well as this new idea is asking for confusion. ... > The all parallel case is useful when the programmer wants to do manual > chunking. This is useful for instance in problems such as generating a > cumulative array sum of another array, as two loops are needed and > generally both loops should be chunked the same way. ... > The synchronous barriers are used to transition between parallel and > sequential processing, and then back to to parallel processing for > each day of the forecast. For this to work, each iteration in > City_Loop must iterate in parallel with each other since the > Release_Threshold of the barriers needs to be known in advance. The > calculation of each forecast for each city for each day is done in > parallel, but presumably, the weather calculations are such that one > cannot proceed to the next day until all the weather forecasting is complete > for the previous day for all the cities. I guess I understand the semantics, but it just feels too complicated to me. For something like this, two separate parallel loops, or an explicit array of tasks might be more appropriate. Trying to synchronize across tasklets is going to be complex, and the semantics will be subject to a lot of caveats, I suspect. Tasklets can generally use run-to-completion semantics, and that clearly won’t work here with barriers in the middle. If you want a barrier, you can end the parallel loop, and then start a new one, which creates a natural, highly visible, easy-to-understand barrier. > I know in Vienna, we had taken a straw poll, which favored putting the > keyword parallel before the word "for" in the for loop. While this is > a useful data point, I think if we want to support "all parallel" > loops, I find it makes a lot more sense to have all parallel together > after the word "in" or "of”. … Personally I would rather not re-open this decision. We have already gone round and round on a number of these issues. The usual rule is that only someone who voted *for* the chosen approach can re-open the discussion. **************************************************************** From: Brad Moore Sent: Monday, July 10, 2017 2:24 PM >> … - Reverse Parallel >> Both Reverse and Parallel are explicit. >> A for loop with both the parallel keyword, and the reverse keyword. >> Semantically, multiple chunks, each corresponding to a tasklet where >> iteration within each chunk is from highest to lowest. >> > > Given where we are placing the “parallel” word in the latest proposal > discussed by the ARG (at the beginning), there seems no particular reason to > disallow “reverse.” But I don’t find your example very compelling: I will try to see if I can come up with some more compelling examples. But I agree that reverse parallel loops are not likely to be used much. It might be that the best use case is finding the max occurrence of something, similar to the example I provided, which is not a common problem, but might be common enough to allow the syntax to cover that case. ... > The semantics for exiting a loop early as currently proposed is that if any > iteration invokes an “exit,” all other iterations are abandoned. Not quite, the AI adds this note: "Note that aborting a tasklet need not be preemptive, but should prevent the initiation of further nested parallel blocks or parallel loops." So if the loop is running its chunks non-preemptively, then presumably the code would work as I suggested. I think there are two cases to consider. 1) A high number of iterations with small processing per iteration eg. (incrementing integer elements of a large array) 2) A small number of iterations with large processing per iteration. eg. (Calculating weather for a handful of cities) In case 1, the majority of time is being spent doing the iteration. I suspect that preemptive tasklet abortion in that case will introduce significant overhead and interfere too much with the parallelism. This is better suited for the non-preemptive case. In case 2, the actual iteration is insignificant, and there is much to be gained by preemptively aborting other tasklets. I can imagine we might want to give the programmer control of this decision via an aspect or pragma, such as Preemptive_Aborts but otherwise let the compiler decide how to deal with such transfer of control. > One place where "reverse parallel" definitely makes sense is in the proposed > map/reduce construct, where the “parallel” is really about doing the “map” part > in parallel, and then doing the “reduction” part in some kind of parallel tree, > preserving order. Agreed. ... >> - All Parallel >> Both all and Parallel are explicit. >> A for loop with the parallel keyword, and the "all" keyword. >> Semantically, each iteration is a separate chunk, or alternatively >> viewed as there not being any chunks. There is no order to the >> iteration, any iteration can occur before any other. > > I would say using “all” to signify a universal quantified expression as well > as this new idea is asking for confusion. I dont think this is a real problem, just as it isn't for using "all" for dereferencing access objects, because the surrounding context is different. As in english, "all" typically describes what follows, for all X ... -- meaning for all values of X for X in all parallel .... -- meaning for all parallel iterations Also Quantified expressions and Loops statements appear in very different contexts in Ada code. > ... >> The synchronous barriers are used to transition between parallel and >> sequential processing, and then back to to parallel processing for >> each day of the forecast. For this to work, each iteration in >> City_Loop must iterate in parallel with each other since the >> Release_Threshold of the barriers needs to be known in advance. The >> calculation of each forecast for each city for each day is done in >> parallel, but presumably, the weather calculations are such that one >> cannot proceed to the next day until all the weather forecasting is complete >> for the previous day for all the cities. > > I guess I understand the semantics, but it just feels too complicated to me. > For something like this, two separate parallel loops, or an explicit array > of tasks might be more appropriate. Trying to synchronize across tasklets > is going to be complex, and the semantics will be subject to a lot of > caveats, I suspect. Tasklets can generally use run-to-completion semantics, > and that clearly won’t work here with barriers in the middle. If you want a > barrier, you can end the parallel loop, and then start a new one, which > creates a natural, highly visible, easy-to-understand barrier. To do this as two separate loops, you need to store much more state. In this case, you'd need a two dimensional array and store all the values for all days of the forecast. Using barriers, you only need to track the current day forecast, so a single dimensional array works here. There are other similar problems that benefit from the use of barriers. One is the n-body problem, where the masses of heavenly bodies and direction of movement are tracked so that the orbital paths can be predicted. The technique is to use small increments of time and then update the location of all the heavenly bodies taking gravitational pulls into account, and the position of the heavenly bodies can then be output for each successive time increment. Using barriers, only the current location of each body needs to be tracked. Using your suggestion of two separate loops, one would have to store all the updated movements in memory before outputting the results. This could potentially and literally require an astronomical amount of storage. Using barriers really is the only reasonable solution for this, that I can see. Yet another example of this sort of problem is of linear algebra solving solving matrices using gaussian elimination. As each column is processed, barriers allow one to find the pivot point for the next column. I dont think I can imagine how to solve this by breaking into separate loops, not that I have tried, but it seems like it would be very difficult. I think using barriers is the best solution for that problem when solving in parallel. I also think in the AI we already talk about calling protected objects from tasklets. " calls by different tasklets of the same task into the same protected object are treated as different calls resulting in distinct protected actions; therefore synchronization between tasklets can be performed using non-blocking protected operations. Note that this is consistent with the current standard which already supports multiple concurrent calls by a single task in the presence of the asynchronous transfer of control capability (RM 9.7.4 (23))." So I think the current AI design suggests that barriers should work as I describe in the example. >> I know in Vienna, we had taken a straw poll, which favored putting >> the keyword parallel before the word "for" in the for loop. While >> this is a useful data point, I think if we want to support "all >> parallel" loops, I find it makes a lot more sense to have all >> parallel together after the word "in" or "of”. … > > Personally I would rather not re-open this decision. We have already gone > round and round on a number of these issues. The usual rule is that only > someone who voted *for* the chosen approach can re-open the discussion. I wouldn't ordinarily have suggested reexamining this issue, but I would hope that it should be oK for anyone to open a discussion if new ideas and arguments are brought to light that were not considered as inputs in the original decision. **************************************************************** From: Tucker Taft Sent: Monday, July 10, 2017 2:43 PM >>> ... >> The semantics for exiting a loop early as currently proposed is that if any >> iteration invokes an “exit,” all other iterations are abandoned. > > Not quite, the AI adds this note: > > "Note that aborting a tasklet need not be preemptive, but should > prevent the initiation of further nested parallel blocks or parallel loops." > > So if the loop is running its chunks non-preemptively, then presumably > the code would work as I suggested. That is certainly not something to be relied upon. And even if things run non-preemptively, there might be tasklets that have never been started, so certainly those would not start after an “exit.” > I think there are two cases to consider. > 1) A high number of iterations with small processing per iteration > eg. (incrementing integer elements of a large array) > 2) A small number of iterations with large processing per iteration. > eg. (Calculating weather for a handful of cities) > > > In case 1, the majority of time is being spent doing the iteration. I > suspect that preemptive tasklet abortion in that case will introduce > significant overhead and interfere too much with the parallelism. This > is better suited for the non-preemptive case. > > In case 2, the actual iteration is insignificant, and there is much to > be gained by preemptively aborting other tasklets. > > I can imagine we might want to give the programmer control of this > decision via an aspect or pragma, such as Preemptive_Aborts but > otherwise let the compiler decide how to deal with such transfer of > control. This is adding yet more complexity to the model. And as pointed out above, if you have more tasklets than cores, some of them might not be started at all before the “exit." >>> ... >>> >> I guess I understand the semantics, but it just feels too complicated to >> me. For something like this, two separate parallel loops, or an explicit >> array of tasks might be more appropriate. Trying to synchronize across >> tasklets is going to be complex, and the semantics will be subject to a lot >> of caveats, I suspect. Tasklets can generally use run-to-completion >> semantics, and that clearly won’t work here with barriers in the middle. >> If you want a barrier, you can end the parallel loop, and then start a new >> one, which creates a natural, highly visible, easy-to-understand barrier. > > To do this as two separate loops, you need to store much more state. You should factor in the amount of space needed to store the stacks of all of the temporarily suspended tasklets. You are requiring that all but one tasklet be suspendible until they all reach the barrier. **************************************************************** From: Randy Brukardt Sent: Monday, July 10, 2017 3:16 PM Just a few words on this, because I don't think it pays to get bogged down in detail... ... > So if the loop is running its chunks non-preemptively, then presumably > the code would work as I suggested. > > I think there are two cases to consider. > 1) A high number of iterations with small processing per iteration > eg. (incrementing integer elements of a large array) > 2) A small number of iterations with large processing per iteration. > eg. (Calculating weather for a handful of cities) This is the crux of the issue. There clearly are multiple termination approaches that make sense for a parallel loop, and those are mostly independent of the form of the loop itself. (And there's more than two!) Preemptive vs. not preemptive Order significant vs. not significant Specifically, one possibility is the proposed rule that an exit of any tasklet terminated the others. Another possibility is that the exit effectively works the same as a sequential loop -- meaning that an exit terminates any tasklets with indicies above the exiting one, and the tasklets with indicies below continue (this would work well for the search application among others). Since the latter is likely to have a higher overhead, one probably would want the option to select the termination mechanism. There's other details that might be worth controlling. My overall point is that it is unlikely that the details could be controlled solely with keywords, there are just too many. We need something like an aspect specification for these loops so that the details can be controlled without introducing loads of keywords (which will be hard to remember). And the default should be that semantics are as close as reasonable to sequential loops; many cases can use simpler termination but I think the default should be fewest surprises. ... > >> I know in Vienna, we had taken a straw poll, which favored putting > >> the keyword parallel before the word "for" in the for loop. While > >> this is a useful data point, I think if we want to support "all > >> parallel" loops, I find it makes a lot more sense to have all > >> parallel together after the word "in" or "of". . > > > > Personally I would rather not re-open this decision. We > have already gone round and round on a number of these issues. The > usual rule is that only someone who voted *for* the chosen approach > can re-open the discussion. > > I wouldn't ordinarily have suggested reexamining this issue, but I > would hope that it should be oK for anyone to open a discussion if new > ideas and arguments are brought to light that were not considered as > inputs in the original decision. There are always going to be "new ideas" for an AI like this one that it at the bleeding edge. Allowing unlimited reopening only ensures that nothing ever will be decided and no progress ever will be made. That's certainly the case for this AI in the last year or so. Keep in mind that to stay on schedule this thing has to be mostly frozen in little over a year. Going back to square one for every decision ensures that there is almost no chance of that happening. (The only reason IMHO that we didn't completely change parallel blocks at this last meeting is that we barely talked about them.) **************************************************************** From: Brad Moore Sent: Monday, July 11, 2017 9:18 AM >>>> ... >>> The semantics for exiting a loop early as currently proposed is that if any >>> iteration invokes an “exit,” all other iterations are abandoned. >> >> Not quite, the AI adds this note: >> >> "Note that aborting a tasklet need not be preemptive, but should >> prevent the initiation of further nested parallel blocks or parallel loops." >> >> So if the loop is running its chunks non-preemptively, then >> presumably the code would work as I suggested. > > That is certainly not something to be relied upon. And even if things run > non-preemptively, there might be tasklets that have never been started, so > certainly those would not start after an “exit.” I think if we did support some sort of non-preemptive mode, probably the most sensible behaviour for that mode is to just let the tasklets complete on their own, without any attempt to shut them down midstream. Any tasklets that hit the exit, or other transfer of control would terminate early, but otherwise tasklets generated at a POP would run to completion. ... >> I can imagine we might want to give the programmer control of this >> decision via an aspect or pragma, such as Preemptive_Aborts but >> otherwise let the compiler decide how to deal with such transfer of >> control. > > This is adding yet more complexity to the model. And as pointed out above, if > you have more tasklets than cores, some of them might not be started at all > before the "exit." I think there are a number of controls that the real time community might want, for example. I think all of these should be the subject of a different AI though, as they can be added later on their own merit, including an explicit non-preemptive mode for tasklet termination, which I think should probably allow starting after an "exit", (or continuing after an exit if tasklets get initially created to cover the full iterations of the for loop, which tends to be how I look at things.) ... >>>> >>> I guess I understand the semantics, but it just feels too complicated to me. >>> For something like this, two separate parallel loops, or an explicit array >>> of tasks might be more appropriate. Trying to synchronize across tasklets >>> is going to be complex, and the semantics will be subject to a lot of >>> caveats, I suspect. Tasklets can generally use run-to-completion >>> semantics, and that clearly won’t work here with barriers in the middle. >>> If you want a barrier, you can end the parallel loop, and then start a new >>> one, which creates a natural, highly visible, easy-to-understand barrier. >> >> To do this as two separate loops, you need to store much more state. > > You should factor in the amount of space needed to store the stacks of all > of the temporarily suspended tasklets. You are requiring that all but one > tasklet be suspendible until they all reach the barrier. Yes, but that storage is bounded, and the programmer can always do manual chunking if necessary to to bring the number of chunks down to the number of available cores, or a small multiple of that. The simulation can run indefinitely in theory without having to worry about running out of storage. This tasklet stack storage becomes insignificant for the n-body problem, compared to the real bounds of N * state, where N is the number of bodies, and state is the amount of state variables for each body. With separate loops however, the storage needed for the simulation however would be unbounded, in the sense that the longer the simulation runs, the more storage is needed, and you need much more of it. The needs become N * state * time-increments, where time-increments are the number of simulation steps that are being computed. The notion of running a simulation for an unspecified amount of time becomes problematic, because of the risk of running out of storage. I think this barrier pattern could be commonly applied to a lot of simulation/modeling scenarios. **************************************************************** From: Brad Moore Sent: Tuesday, July 11, 2017 9:45 AM > This is the crux of the issue. There clearly are multiple termination > approaches that make sense for a parallel loop, and those are mostly > independent of the form of the loop itself. (And there's more than > two!) > > Preemptive vs. not preemptive > Order significant vs. not significant In the gang of four, we had discussed the idea of declaring a local subtype for a loop iterator, and then applying aspects to that declaration to provide specific controls. eg. declare subtype Loop_Iterator is Integer range 1 .. 100000 with Executors => 10, Chunk_Size => 100, Load_Sensitivity, Preemptive_Termination => False; begin parallel for I in Loop_Iterator'range loop .... end loop; end; Adding these sort of specific controls should be a separate AI, and can be added later independent of this AI. We could say that the "all parallel" notion could be a similar aspect that is specified this way, but to me at least it seems like this would be an inconsistency, since all other iteration schemes are built right into the loop syntax. I think the "all parallel" scheme is useful enough that it ought to be visible at the same level as the parallel keyword, and the reverse keyword, regardless whether these two keywords appear adjacent to each other, or separated as we have coming out of Vienna. The question is how best to do this in the syntax. It seemed to fit nicely in with the version of the AI going into Vienna. It doesn't seem to fit very well with the version of the AI we'll have coming out of Vienna, but maybe we'll either live with that, or come up with a better idea. ... > Specifically, one possibility is the proposed rule that an exit of any > tasklet terminated the others. Another possibility is that the exit > effectively works the same as a sequential loop -- meaning that an > exit terminates any tasklets with indicies above the exiting one, and > the tasklets with indicies below continue (this would work well for > the search application among others). Since the latter is likely to > have a higher overhead, one probably would want the option to select > the termination mechanism. > > There's other details that might be worth controlling. > > My overall point is that it is unlikely that the details could be > controlled solely with keywords, there are just too many. We need > something like an aspect specification for these loops so that the > details can be controlled without introducing loads of keywords (which will be > hard to remember). I think the aspect specification like I described above would cover most of these, but "all parallel" feels like it should be more integrated with the loop syntax, for consistency, and importance for semantic behaviour. I think there are enough problems that would need to rely on these semantics. Most other controls such as number of executors, or chunk size do not really affect the semantics. They just impact the speed of the processing. > And the default should be that semantics are as close as reasonable to > sequential loops; many cases can use simpler termination but I think > the default should be fewest surprises. Agreed. **************************************************************** From: Tucker Taft Sent: Tuesday, July 11, 2017 9:55 PM >>> So if the loop is running its chunks non-preemptively, then >>> presumably the code would work as I suggested. >> >> That is certainly not something to be relied upon. And even if things run >> non-preemptively, there might be tasklets that have never been started, so >> certainly those would not start after an “exit.” > > I think if we did support some sort of non-preemptive mode, probably the most > sensible behaviour for that mode is to just let the tasklets complete on their > own, without any attempt to shut them down midstream. Any tasklets that hit > the exit, or other transfer of control would terminate early, but otherwise > tasklets generated at a POP would run to completion. ... I believe this is creating two significantly different semantics for a single syntactic construct, which is bad news in my view. You are heading toward “programming by pragma” (or "by aspect"), I fear. That is largely what OpenMP does, and I believe we should avoid it. **************************************************************** From: Randy Brukardt Sent: Wednesday, July 19, 2017 8:36 PM ... > > And the default should be that semantics are as close as reasonable > > to sequential loops; many cases can use simpler termination but I > > think the default should be fewest surprises. > > Agreed. (I think this is the only thing that I wrote that Brad agreed with, but don't look behind that curtain! :-) I had the thought that any loop exits have to be explicitly in the loop body (they can't be in some subprogram that is called). Thus, we could make the termination semantics depend upon that. (The same applies to AI12-0189-1, the loop body as anomymous procedure, BTW). The basic idea is that one could depend on the loop index value determined at the point of an exit; if the loop finishes normally or exits abnormally, you couldn't have such a dependence. Thus if there is no exit, the termination is a bit simpler (not much, really); if there is, we require the loop to get the same loop index as the similar sequential loop. You could use an exception for a non-local exit, but that is so ugly I think it would be OK to ignore it for the purposes of termination semantics. Thus a loop like the following would work in parallel: parallel for I in 1 .. Limit loop if DB(I).Name = Name_to_Find then My_PO.Save_Found_Index (I); -- Save the lowest index found. exit; -- Done looking. -- else continue end if; end loop; P.S. Unrelated aside: I have to wonder about the wisdom of considering atomic objects task-safe, since Ada doesn't have any form of atomic update. (The only safe atomic writing operation is initialization.) Almost any use of an atomic object would have a race condition. I originally wrote the above My_PO use as: if Result > I then Result := I; end if; with Result an atomic object initialized to Integer'Last (something I do sequentially all the time) -- but the above is a race condition and definitely is NOT task-safe. The same would be true for any other test of an atomic object followed by a write of that object -- but that is the natural sequential code. It seems better (given the current facilities) to insist on a PO here (since the natural code in a PO would do the right thing). **************************************************************** From: Randy Brukardt Sent: Wednesday, July 19, 2017 9:06 PM ... > >>> So if the loop is running its chunks non-preemptively, then > >>> presumably the code would work as I suggested. > >> That is certainly not something to be relied upon. And > even if things run non-preemptively, there might be tasklets that have > never been started, so certainly those would not start after an > "exit." > > > > I think if we did support some sort of non-preemptive mode, > probably the most sensible behaviour for that mode is to just let the > tasklets complete on their own, without any attempt to shut them down > midstream. Any tasklets that hit the exit, or other transfer of > control would terminate early, but otherwise tasklets generated at a > POP would run to completion. ... > > I believe this is creating two significantly different semantics for a > single syntactic construct, which is bad news in my view. You are > heading toward "programming by pragma" > (or "by aspect"), I fear. That is largely what OpenMP does, and I > believe we should avoid it. This is a real concern, but I have to wonder if we really have any choice. If our primary goal is (as I hope it is) to make it as easy as possible to convert sequential code into parallel code, the semantics have to be a close as possible to the equivalent sequential loop. That means that there can't be any erroneous execution from aborts, the loop index has be meaningful at a normal exit, and so on. That's going to be more expensive than the straight abort-everything-on-any-sort-of-completion (I'm skeptical that it would be much more expensive in usual cases, but I also know that I know just enough to be dangerous in this area). So people who need absolute performance may need some other semantics (including abort, for instance). I suspect that Brad's non-premptive semantics make the most sense for the default semantics for parallel loops (probably not for parallel blocks; the problems are very different). That means that anyone that wants preemptive semantics is going to have to use a pragma or aspect. But I don't think that is a huge problem semantically; the difference only matters if the loop has a bug already, or depends on the loop index at exit (in which case the preemptive semantics is plain wrong and we could even make it illegal to use it in that case as noted in my other message). Anyway, back to real work... **************************************************************** From: Brad Moore Sent: Thursday, July 20, 2017 12:16 AM > ... >> > And the default should be that semantics are as close as reasonable >> > to sequential loops; many cases can use simpler termination but I >> > think the default should be fewest surprises. >> >> Agreed. > > (I think this is the only thing that I wrote that Brad agreed with, > but don't look behind that curtain! :-) I disagree. :-) I'd have to check, but my overall sense was that I was agreeing with more of what you said. > I had the thought that any loop exits have to be explicitly in the > loop body (they can't be in some subprogram that is called). Thus, we > could make the termination semantics depend upon that. (The same > applies to AI12-0189-1, the loop body as anomymous procedure, BTW). > > The basic idea is that one could depend on the loop index value > determined at the point of an exit; if the loop finishes normally or > exits abnormally, you couldn't have such a dependence. Thus if there > is no exit, the termination is a bit simpler (not much, really); if > there is, we require the loop to get the same loop index as the similar sequential loop. > > You could use an exception for a non-local exit, but that is so ugly I > think it would be OK to ignore it for the purposes of termination semantics. > > Thus a loop like the following would work in parallel: > > parallel for I in 1 .. Limit loop > if DB(I).Name = Name_to_Find then > My_PO.Save_Found_Index (I); -- Save the lowest index found. > exit; -- Done looking. > -- else continue > end if; > end loop; Maybe I am not fully understanding this idea, or maybe these semantics are the ones I had in my mind all the time, and have not considered the alternate semantics that you might otherwise have assumed? > P.S. Unrelated aside: I have to wonder about the wisdom of considering > atomic objects task-safe, since Ada doesn't have any form of atomic update. > (The only safe atomic writing operation is initialization.) Almost any > use of an atomic object would have a race condition. I originally > wrote the above My_PO use as: > if Result > I then > Result := I; > end if; > with Result an atomic object initialized to Integer'Last (something I > do sequentially all the time) -- but the above is a race condition and > definitely is NOT task-safe. The same would be true for any other test > of an atomic object followed by a write of that object -- but that is > the natural sequential code. > > It seems better (given the current facilities) to insist on a PO here > (since the natural code in a PO would do the right thing). Actually, if I were to write this, I was thinking that a PO would be used to update a global volatile/atomic variable, but otherwise the global variable could be read/tested directly. Something like; parallel for I in 1 .. Limit loop if DB(I).Name = Name_to_Find and then I < Result then My_PO.Save_Found_Index (I); -- Save the lowest index found in Result. exit; -- Done looking. -- else continue end if; end loop; The extra check "and then I < Result" would potentially provide an improvement/short cut on performance, assuming it is cheaper to test an atomic/volatile integer, than to always have to access the PO. On a further aside, I recognise that one would ordinarily consider atomic as the safe choice for such a variable declaration, but I found that declaring the global variable to be volatile rather than atomic provided significant performance improvements. I'm not sure if that is just an artefact of the combination of compilers and target platforms I was using, or Ada semantics somehow demands this, but if the type of the object is such that it can be read atomically without putting memory fences or guards around the access, eg. such as via a single machine instruction, then I found I'd get consistently correct results, with improved performance, over declaring the global as atomic. Ada doesn't distinguish, but I wonder if there are certain data types that when declared to be volatile, are also atomic. We already know that variable declared to be atomic are also volatile in Ada. Or maybe there are certain volatile types that have some useful properties of atomic, but are somehow otherwise lighter-weight than full atomic? **************************************************************** From: Brad Moore Sent: Thursday, July 20, 2017 12:37 AM > I suspect that Brad's non-premptive semantics make the most sense for > the default semantics for parallel loops (probably not for parallel > blocks; the problems are very different). That means that anyone that > wants preemptive semantics is going to have to use a pragma or aspect. Not necessarily a pragma or aspect. See below. > But I don't think that > is a huge problem semantically; the difference only matters if the > loop has a bug already, or depends on the loop index at exit (in which > case the preemptive semantics is plain wrong and we could even make it > illegal to use it in that case as noted in my other message). I think the non-preemptive semantics probably also are less likely to introduce erroneous or unexpected results more generally. But that is just my overall sense. I agree here, that I think the non-preemptive behaviour might be the best default. Also, I am not advocating a programming by pragma view. The goal should be that for most cases (> 90-99%), no pragmas would be necessary. Note, we have currently have other cases in Ada where two very different semantics can be invoked with respect to preemptive and non-preemtive behaviour, and I think we should try to be consistent with those already existing syntax if possible. In particular, we have; requeue and requeue with abort I am now thinking that we could use a "with abort" on a parallel loop, to get the preemptive semantics, but otherwise the semantics would be non-preemptive. eg. parallel with abort for I in 1 .. limit loop Long_Calculation (I); end loop; Here, for example, an exception raised in Long_Calculation would be expected to preempt other parallel iterations involving the same call. This might also be a good way to tie in the "all" for the "all parallel" loops that I was describing in recent emails, with the current parallel keyword placement. a "with all", would imply that any iteration can be expected to potentially execute in parallel with any other iteration. (ie. no chunking) parallel with all for I in 1 .. limit loop Long_Calculation (I); end loop; **************************************************************** From: Randy Brukardt Sent: Thursday, July 20, 2017 1:32 AM ... > > Thus a loop like the following would work in parallel: > > > > parallel for I in 1 .. Limit loop > > if DB(I).Name = Name_to_Find then > > My_PO.Save_Found_Index (I); -- Save the lowest index found. > > exit; -- Done looking. > > -- else continue > > end if; > > end loop; > > Maybe I am not fully understanding this idea, or maybe these semantics > are the ones I had in my mind all the time, and have not considered > the alternate semantics that you might otherwise have assumed? I think you and I agree on this point, but Tucker seems to be using strict parallel block semantics for this. And that wouldn't work for this loop, since at the first exit, all of the other tasklets would be aborted (by the parallel block semantics). So it would be non-deterministic which exit was ultimately executed (assuming there is more than one). It's pretty clear to me that "exit" is not like an exception, and should be treated differently. (Not as certain about gotos, since ones clear out of the loop may or may not be like an exit.) A block doesn't have anything like exit (although it could be emulated with a goto). ... > Actually, if I were to write this, I was thinking that a PO would be > used to update a global volatile/atomic variable, but otherwise the > global variable could be read/tested directly. > > Something like; > > parallel for I in 1 .. Limit loop > if DB(I).Name = Name_to_Find and then I < Result then > My_PO.Save_Found_Index (I); -- Save the lowest index found in Result. > exit; -- Done looking. > -- else continue > end if; > end loop; > > The extra check "and then I < Result" would potentially provide an > improvement/short cut on performance, assuming it is cheaper to test > an atomic/volatile integer, than to always have to access the PO. A PO is supposed to be a monitor, and its rather abstraction breaking to put the monitored object outside of it. (Yes, you sometimes have to do that, but I don't recommend it if you don't have to do so.) > On a further aside, I recognise that one would ordinarily consider > atomic as the safe choice for such a variable declaration, but I found > that declaring the global variable to be volatile rather than atomic > provided significant performance improvements. I'm not sure if that is > just an artefact of the combination of compilers and target platforms > I was using, or Ada semantics somehow demands this, but if the type of > the object is such that it can be read atomically without putting > memory fences or guards around the access, eg. such as via a single > machine instruction, then I found I'd get consistently correct > results, with improved performance, over declaring the global as > atomic. Ada doesn't distinguish, but I wonder if there are certain > data types that when declared to be volatile, are also atomic. We > already know that variable declared to be atomic are also volatile in > Ada. > Or maybe there are certain volatile types that have some useful > properties of atomic, but are somehow otherwise lighter-weight than > full atomic? I think you just got lucky. My understanding is that fences are needed to ensure memory consistency across cores. So you can only see problems if you have two tasks running simultaneously on two different cores -- and the OS might try to avoid this particular situation. I don't think there are any memory reads or writes that work in that situation without a fence; besides, ALL atomic objects use a single instruction on most implementations. Volatile is definitely not enough for any data that needs to be synchronized between multiple tasks (or tasklets); certainly not from a language perspective. Only atomic or a PO synchronizes data (you can use one of those to synchronize volatile data). And I don't think experimental results mean much here, because some of these situations are pretty rare -- but Murphy would say that they'll only strike at the worst time. And Atomic isn't really enough, either, since real problems always require some sort of update. **************************************************************** From: Jeffrey Cousins Sent: Thursday, July 20, 2017 11:44 AM >I know in Vienna, we had taken a straw poll, which favored putting the >keyword parallel before the word "for" in the for loop. While this is a >useful data point, I think if we want to support "all parallel" loops, I >find it makes a lot more sense to have all parallel together after the word >"in" or "of". > eg > for I in all parallel 1 .. 10 loop > > or > > for Element of all parallel List loop Much as I sympathise with the logic of Brad’s argument, I think we should stick with what was agreed at Vienna for the time being, whilst we sort out what behaviour we want. Terminating loops early seems to be a particular sticking point. Only yesterday we had a case where one of our apps has M tasks checksumming N files, where N >> M. Currently when the abort button is clicked, no new files are checksummed, but checksumming the files already in progress is allowed to complete. (Non-pre-emptive). The user would like the current checksumming to be aborted immediately (pre-emptive; maybe he wants to checksum a different set of files a.s.a.p.). But I think it equally likely that there will be a case where finishing the current processing cleanly is the imperative. So I think we need to provide both options (somehow). Brad also wrote: >Note, we have currently have other cases in Ada where two very different >semantics can be invoked with respect to preemptive and non-preemtive behaviour, >and I think we should try to be consistent with those already existing syntax >if possible. > >In particular, we have; > requeue >and > requeue with abort > >I am now thinking that we could use a "with abort" on a parallel loop, to get >the preemptive semantics, but otherwise the semantics would be non-preemptive. > >eg. > >parallel with abort > for I in 1 .. limit loop > Long_Calculation (I); > end loop; No doubt someone will point out holes in the above, but I quite like the above suggestion. **************************************************************** From: Erhard Ploedereder Sent: Thursday, July 20, 2017 4:31 PM > If our primary goal is (as I hope it is) to make it as easy as > possible to convert sequential code into parallel code, the semantics > have to be a close as possible to the equivalent sequential loop. No. Particularly, if the syntax distinguishes the "new" semantics (or else, why introduce new language features?) Canonical example: > That means that .... the loop index has be meaningful at a normal exit, and so on. > parallel for I in 1 .. Limit loop > if DB(I).Name = Name_to_Find then > My_PO.Save_Found_Index (I); -- Save the lowest index found. > exit; -- Done looking. > -- else continue > end if; > end loop; NO, again, to the comment in the example. Consider an array with duplicates. Canonical semantics gives you the lowest index indeed. If you want that, do not write "parallel"! If you do write "parallel": A parallel loop surely should be allowed to give you some found index but whether it is the lowest is completely non-deterministic (I hope). (Execution-wise, I am open to both models of "last one wins" or "exit terminates the parallel execution".) **************************************************************** From: Randy Brukardt Sent: Thursday, July 20, 2017 6:26 PM > > If our primary goal is (as I hope it is) to make it as easy as > > possible to convert sequential code into parallel code, the > > semantics have to be a close as possible to the equivalent sequential loop. > > No. Particularly, if the syntax distinguishes the "new" > semantics (or else, why introduce new language features?) It's much easier to convert from one to the other if the semantics doesn't change much, if at all. Indeed, I'm not sure there is any point to having a parallel loop construct at all if this isn't true, because the existing Ada facilities are just fine (as demonstrated by Brad's libraries) for people who are already comfortable with writing parallel code. It's the majority of programmers who don't understand concepts like race conditions that need help. > Canonical example: > > That means that .... the loop index has be meaningful at a normal > > exit, and so on. > > > parallel for I in 1 .. Limit loop > > if DB(I).Name = Name_to_Find then > > My_PO.Save_Found_Index (I); -- Save the lowest index found. > > exit; -- Done looking. > > -- else continue > > end if; > > end loop; > > NO, again, to the comment in the example. Consider an array with > duplicates. Canonical semantics gives you the lowest index indeed. If > you want that, do not write "parallel"! So you think it is OK to make it impossible to write a significant proportion of real problems in parallel? You're essentially saying that this search has to run 4 (or 8 or whatever) times slower, or the programmer has to create their own solution using Ada tasks. That's insane. > If you do write "parallel": > A parallel loop surely should be allowed to give you some found index > but whether it is the lowest is completely non-deterministic (I hope). > > (Execution-wise, I am open to both models of "last one wins" > or "exit terminates the parallel execution".) I don't think that the cost of the model I'm suggesting is significantly different than either of these. (I again note that I know just enough about such things to be dangerous, so there might be some magic solution that I cannot imagine.) For each of the models, you have between 2 and 4 possible outcomes when a tasklet finishes a chunk (normal completion, exit, goto, exception; some of these might be treated the same in some models, but surely the first has to be treated differently in all models). So the only difference between a model that is index-aware and one that is not is that the indexes have to be known when the chunk finishes. They're obviously known when the chunk starts, so in the worst case, the starting index has to be saved with the other tasklet information. That's hardly a hardship, and the loop termination code is essentially the same in any of these models. Ergo, I don't see much reason to make the semantics non-deterministic. Non-determinism is evil; sometimes you can't avoid it, but it should never be employed if there is an option. And in this case, there is an option. **************************************************************** From: Brad Moore Sent: Friday, July 21, 2017 12:30 AM > I suspect that Brad's non-premptive semantics make the most sense for > the default semantics for parallel loops (probably not for parallel > blocks; the problems are very different). A parallel block is effectively just a parallel loop with a small number of iterations that has been unrolled. One should be able to convert source code from either construct to the other (at least for simpler loops with a small number of iterations). If anything, I see an even greater need for non-preemptive semantics for parallel blocks. A parallel loop is typically applying similar processing iteratively. A parallel block can also be that, (for example calculating Fibonacci recursively), but it can also be suitable for completely independent processing. In one branch you might want to exit the processing of that branch early, but (I think more typically) without affecting the processing in the other branches. eg. Note: I am not using the exact syntax that we arrived at in Vienna, only because I cannot remember exactly what it was, until I see the minutes. parallel exit when Do_A_Step1 = Failure; exit when Do_A_Step2 = Failure; exit when Do_A_Step3 = Failure; exit when Do_A_Step4 = Failure; Result := Do_A_Step5; and Do_B; end parallel -- Activity B has completed here, regardless whether Activity finished -- successfully. If one wants the processing for activity A to kill the processing for activity B, then I think the programmer ought to explicitly ask for those semantics. The default should be the more deterministic semantics. Again, the "with abort" clause seems like a consistent and natural way to do this. parallel with abort exit when Do_A_Step1 = Failure; exit when Do_A_Step2 = Failure; exit when Do_A_Step3 = Failure; exit when Do_A_Step4 = Failure; Result := Do_A_Step5; and Do_B; end parallel -- A failure in activity A, may have caused activity B to terminate early. **************************************************************** From: Brad Moore Sent: Thursday, July 20, 2017 12:40 AM > ? I know in Vienna, we had taken a straw poll, which favored putting > the ? keyword parallel before the word "for" in the for loop. While > this is a ? useful data point, I think if we want to support "all > parallel" loops, I ? find it makes a lot more sense to have all > parallel together after the word ? "in" or "of". >> >> eg >> for I in all parallel 1 .. 10 loop >> >> or >> >> for Element of all parallel List loop > > Much as I sympathise with the logic of Brad’s argument, I think we > should stick with what was agreed at Vienna for the time being, whilst > we sort out what behaviour we want. > Terminating loops early seems to be a particular sticking point. I have actually changed my view here. I am now in favour of having the parallel keyword for loops at the beginning, as we discussed in Vienna. The reason is that I find that if we do want to specify controls on the parallelism, such as "with abort", or "with all", or anything else we come up with, applying these to the keyword parallel is the best way I've seen so far to do this, in my opinion. If "parallel" is located in the middle of the loop syntax, I don't think this would work (too ugly and messy), but when parallel is at the beginning, this provides a "header" that describes the parallelism without interfering with the loop logic. **************************************************************** From: Randy Brukardt Sent: Thursday, July 20, 2017 1:22 PM > If anything, I see an even greater need for non-preemptive semantics > for parallel blocks. > > A parallel loop is typically applying similar processing iteratively. > A parallel block can also be that, (for example calculating Fibonacci > recursively), but it can also be suitable for completely independent > processing. In one branch you might want to exit the processing of > that branch early, but (I think more typically) without affecting the > processing in the other branches. Well, we don't have "exit" for blocks (it would be very incompatible); I suppose we could have it only for parallel blocks but that seems like asking for trouble with the forces of consistency. One could use a goto for that effect, but that seems less likely than an exit from a loop. So, in the block case I was thinking mainly about exceptions; and exceptions usually (not always) represent bugs. I don't think it makes much sense to continue buggy code -- indeed I thought that it made sense to use some sort of "terminate as soon as practical" semantics for exceptions in all cases. It's only exit (and similar gotos) that it makes sense to treat non-preemptively. (And I was thinking that "non-preemptive" simply meant that the tasklet ran to the next task dispatching point, not that it necessarily ran to completion. A queued and blocked tasklet might never finish otherwise, especially in the exception case.) ... > Again, the "with abort" clause seems like a consistent and natural way > to do this. > > parallel with abort > Yes, that syntax seems OK. Better than some your other recent ideas. ;-) **************************************************************** From: Tucker Taft Sent: Tuesday, July 25, 2017 4:36 PM I am quite strongly against having multiple interpretations for "exit" from a parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I have some convincing to do. The interpretation I recommend is that an “exit” or “return” (or “raise”) within a parallel loop causes all other iterations to be terminated as soon as possible. For a “return,” there is a “race” to see which iteration reaches a return first, and that is resolved using some kind of synchronization, and then only the “return” that “wins” is allowed to proceed to evaluate its return expression (presuming it is returning from a function). Similarly, if an iteration would propagate an exception, then again, a synchronization is performed against other operations that can terminate the loop, and the winner gets to determine the net effect, be it a simple exit, an exception, a return, or (if we permit it) a goto. Essentially any operation that would normally terminate a loop goes through a “two-phase” commit, where the first phase requests the “right” to decide the fate of the loop, and then the winner determines what happens next. I have not seen any convincing (to me!) argument to allow unstarted iterations to continue processing, as this seems to defeat one of the whole points of a parallel “divide-and-conquer” approach. I can see some arguments to not terminating iterations that are already executing, if there is danger of them leaving some object visible outside the iteration in an abnormal state. In ParaSail, if a loop includes an operation that could result in an early exit, it cannot manipulate a variable that is declared outside the loop unless it is protected. This doesn’t quite work in Ada since any parallel loop could have an iteration die due to the failure of a run-time check. But to some extent that is already a problem because we recognize that the failure of a run-time check can always leave certain variables in an abnormal state. For a parallel loop, a run-time check failure could leave any variable manipulated by any iteration in an abnormal state. But if you look at 11.6(6/3), even a “regular” loop that dies due to a failed run-time check can have the same effect, since the Ada RM has always permitted implicit parallelization (e.g. see NOTE 1 in chapter 9 — aka RM 9(11)) and instruction re-ordering. So we can effectively ignore failed run-time checks, and only worry about explicit “raise” as relevant to this discussion. So what it effectively means is that the action of propagating a “raise” out of an iteration is just like a “return” or an “exit” in terms of having to “compete" to decide which terminating construct gets to determine the “fate” of the parallel loop. The bottom line to me is that we should terminate parallel loops as soon as possible, be it from an exit, return, raise (or goto or requeue for that matter), using a two-phase approach that first determines which one will actually decide the fate of the loop as a whole, and then a second step that actually does the deed. **************************************************************** From: Tuillo Vardanega Sent: Wednesday, July 26, 2017 7:43 AM I think I understand the semantics as Tuck proposes and I find it to make sense in the specific context of parallel computation. To me, wanting exactly the same source code to operate identically within and without a parallel construct requires writing it to that end, which means understanding the implications of the most demanding side. In a sense, this is no different (only more complex) than it was for concurrency: the code that you embed in concurrent threads must be amenable to concurrent semantics else problems occur. I suppose that what matters most here is that the programmer has the means (for syntax and semantics) to ensure that the externally visible effect of a parallel construct "stabilizes" before it can be used outside of it. To my reading, Tuck's "two-phase commit" notion does that quite fine, including the "as-soon-as-possible" part of it. **************************************************************** From: Brad Moore Sent: Thursday, July 27, 2017 12:26 PM > I am quite strongly against having multiple interpretations for "exit" from a > parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I > have some convincing to do. Unfortunately, I am not yet convinced. I think if we want to have only one interpretation for Exit, then that interpretation should be non-preemptive. Lets see if I can convince you. First I have some findings to report first however that should be of interest. I have added preemptive termination to most of my Paraffin libraries (except the Ravenscar ones) so that if any exception occurs, it preemptively terminates other tasklets. I found that adding this, (using asynchronous transfer of control), did not add any significant performance hit to any of the libraries with GNAT at least. This applies even to problems such as adding the sum of a large range of integers, which I had mentioned previously might be a case where preemptive abort might introduce significant overhead. So, for exceptions, I think it makes sense to have preemptive termination, generally. However, I think exit, return, and goto should be handled differently. Here are some arguments 1. I think it is quite a common problem to be searching for the first or last occurrence of something in an array or container. For example, consider an array or container of events stored in chronological order. I should think answering question such as; When is the first time that conditions X occurred?, or when was the last time that conditions Y occurred? I find it quite objectionable if we have to say, sorry, but you cannot do such searches efficiently in native Ada using parallelism. You should either use libraries such as Brad Moore's Paraffin libraries, or call out to another language such as C or C++ and write that code in OpenMP. If Exit is preemptive, then we cannot continue searching in other tasklets to find better or correct answers. We effectively have to not use Exit at all, and search the full range of iterations, which can be a waste of time if we already know the answer. We can use exceptions as the mechanism for preemptive abort. If a programmer wants preemptive termination, they could raise their own exception in the loop, and write an exception handler at the appropriate level to handle that. We could even define a language defined exception, such as Parallelism_Termination_Exception, which could be defined as an exception that is implicitly handled prior to the completion of the loop, and can be explicitly raised by the user if preemptive termination is desired. We still need a mechanism for non-preemptive abort to solve problems such as the above however, and exit, return, and gotos could provide that mechanism. 2. Having Exit and Return treated effectively as an exception and causing preemptive is very different from sequential Ada, and likely to introduce a lot of unexpected errors to users programs. For instance, I found adding the preemptive termination to Paraffin introduced at least one bug where one task might be waiting on an entry while the aborted code that would have ultimately serviced that code would have left another task waiting on the entry indefinitely. Another example could be where a loop iteration creates dependent tasks, and aborting the loop iteration might shortcut interactions with those dependent tasks, leaving those tasks in a state where they cannot terminate, and where the aborted iteration is left in an abort-deferred operation, effectively deadlocking. I think when the user's code explicitly uses a select then abort statement, the user is fully aware of the asynchronous transfer of control, and can take care to ensure that the preemptive behaviour makes sense. I think it would be undesirable however to have the compiler implicitly introducing such asynchronous transfers of control, where the user is likely in many cases to not even be aware of this, which could lead to difficult to reproduce failures in deployed code. 3. I think we should also be supporting parallelism in Ravenscar compliant code. I have Paraffin libraries that do this, which shows that it can be done. However, in Ravenscar, I believe all termination needs to be non-preemptive, including exceptions, since Ravenscar has the restrictions: No_Abort_Statements, No_Dependence => Ada.Asynchronous_Task_Control, No_Select_Statements. It seems pretty clear that preemptive aborting behaviour is not supported in a simpler Ravenscar runtime. I think this is OK, as you still get the benefits of divide and conquer with non-preemptive termination, just maybe not as responsive for early exits, but that is traded off with better deterministic behaviour which is generally desired in Ravenscar. The main point here is that if we have Exit, goto, and return as non-preemptive, then we have consistent, single semantic interpretations for these constructs for all of Ada. If we make these preemptive, then we are actually forcing two interpretations if we want these to work in Ravenscar also. Another suggestion, if we really do want preemptive exit, return, or goto, then perhaps we need a different flavour of syntax for those. I had previously suggested that a "with abort" clause could be added to the parallel keyword of the loop. Rather than do that, I think it would be better to attach that clause to the actual Exit, return, or goto statement, as in; parallel for I in 1 .. N loop if Arr (I) > X then exit with abort; end if; end loop; It is clearer to tie the "with abort" closer to where the premptive termination is desired. The interpretation I recommend is that an “exit” or “return” (or “raise”) within a parallel loop causes all other iterations to be terminated as soon as possible. For a “return,” there is a “race” to see which iteration reaches a return first, and that is resolved using some kind of synchronization, and then only the “return” that “wins” is allowed to proceed to evaluate its return expression (presuming it is returning from a function). Similarly, if an iteration would propagate an exception, then again, a synchronization is performed against other operations that can terminate the loop, and the winner gets to determine the net effect, be it a simple exit, an exception, a return, or (if we permit it) a goto. Essentially any operation that would normally terminate a loop goes through a “two-phase” commit, where the first phase requests the “right” to decide the fate of the loop, and then the winner determines what happens next. A two phase commit is pretty much the only possibility that I see. Paraffin has always had such a two phase commit for handling exceptions, except until just recently, the exception handling was non-preemptive. If multiple exceptions were raised in different tasklets, the implementation decides which one wins, prior to re-raising that exception and propagating it out into the user's code. The two-phase commit idea is orthogonal to this discussion about how Exit statements are treated I think. > I have not seen any convincing (to me!) argument to allow unstarted iterations > to continue processing, as this seems to defeat one of the whole points of a > parallel "divide-and-conquer" approach. You do get benefits of divide and conquer regardless whether termination is preemptive or not. The preemptive termination can be viewed as an additional optimization that can be added on top of the divide and conquer optimization, but only if it does not introduce errors! I can see some arguments to not terminating iterations that are already executing, if there is danger of them leaving some object visible outside the iteration in an abnormal state. In ParaSail, if a loop includes an operation that could result in an early exit, it cannot manipulate a variable that is declared outside the loop unless it is protected. This doesn’t quite work in Ada since any parallel loop could have an iteration die due to the failure of a run-time check. But to some extent that is already a problem because we recognize that the failure of a run-time check can always leave certain variables in an abnormal state. For a parallel loop, a run-time check failure could leave any variable manipulated by any iteration in an abnormal state. But if you look at 11.6(6/3), even a “regular” loop that dies due to a failed run-time check can have the same effect, since the Ada RM has always permitted implicit parallelization (e.g. see NOTE 1 in chapter 9 — aka RM 9(11)) and instruction re-ordering. So we can effectively ignore failed run-time checks, and only worry about explicit “raise” as relevant to this discussion. So what it effectively means is that the action of propagating a “raise” out of an iteration is just like a “return” or an “exit” in terms of having to “compete" to decide which terminating construct gets to determine the “fate” of the parallel loop. I dont think this last sentence follows from the previous, particularly in the case of Exit. In Paraffin, an Exit statement is simply treated as an Exit statement in the user's code. It causes that tasklet to exit, without affecting any others, no special treatment than other sequential statements in Ada. There is no need to decide which exit wins prior to leaving the loop, as there no result or exception being propagated out of the loop. The implementation requires no special treatment of this. Exceptions do require special treatment, since a specific exception should be propagated out of the loop. For a return statement, there is a stronger case for preemptive behaviour, but I think considering the other arguments, it should be left non-preemptive to side with Exit statements, rather than side with the preemptive nature of exceptions. Goto's should probably not be too relevant to this discussion if we even support them at all. I doubt this feature would be very rarely used or useful in a parallel loop, but they are similar to return statements. > > The bottom line to me is that we should terminate parallel loops as soon as > possible, be it from an exit, return, raise (or goto or requeue for that > matter), using a two-phase approach that first determines which one will > actually decide the fate of the loop as a whole, and then a second step that > actually does the deed. The bottom line to me is that the user should be able to write programs that execute as fast as possible. Sometimes terminating loops as soon as possible comes at a greater expense than being able to use a more optimal algorithm. The user should be able to choose either option. Below I have execution times showing a case where the non-preemptive algorithm beats the preemptive one. A two phase approach is recommended for sure, but Exit's do not need to be involved in the second stage of the commit. They do not contribute to any result associated with the loop. Return results do need to be considered in the second phase, but I would think that any exception raised should trump any return result. Here are some more results comparing performances using preemptive vs non-preemptive termination. I have a simple program that scans an array of 600_000_000 integers looking for the value in the highest index of the array that satisifies some property. In this case, the element in Arr'Last is the one that satisfies the query. Using a Reverse parallel loop with non-premptive exit, the elapsed time for the loop is 0.13 seconds. The parallel loop looks like; Maximum_Index : Index := Index'First with Volatile; parallel for I in reverse Start .. Finish loop if I < Maximum_Index then exit; elsif Satisfies (Arr (I)) and then I > Maximum_Index then Maximum_Index := I; end if; end loop; If I take out the reverse keyword, I still get the correct result, but the elapsed time becomes 1.4 seconds, since with forward iteration, the Maximum_Index has to iterate through the full chunk of the highest chunk, since each iteration improves upon the previous value. If we assume that Exit must be preemptive, then we cannot use Exit, because it gives us an incorrect result, so I have to rewrite the loop as the following if we assume that reverse parallel loops are still allowed; parallel for I in reverse Start .. Finish loop if Satisfies (Arr (I)) and then I > Maximum_Index then Maximum_Index := I; end if; end loop; This produces an elapsed time of 1.3 seconds; If we say that reverse parallel loops are not allowed, then I have to remove the reverse keyword, and the elapsed time becomes 3.8 seconds. So in this particular example, it appears that the decision to not support non-preemptive Exit results in a program that runs 9.8 times slower on my computer than the version that does allow non-premptive Exit, for reverse parallel loops. For the non-reverse loops, the preemptive version runs 2.7 slower. The preemptive forward loop runs 29.2 times slower than the non-preemptive reverse parallel loop. **************************************************************** From: Tucker Taft Sent: Thursday, July 27, 2017 4:29 PM >> I am quite strongly against having multiple interpretations for "exit" from a >> parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I >> have some convincing to do. > > Unfortunately, I am not yet convinced. I think if we want to have only > one interpretation for Exit, then that interpretation should be > non-preemptive. Lets see if I can convince you. > ... > > Here are some arguments > > 1. I think it is quite a common problem to be searching for the first > or last occurrence of something in an array or container. For example, > consider an array or container of events stored in chronological order. > > I should think answering question such as; When is the first time > that conditions X occurred?, or when was the last time that conditions > Y occurred? > > I find it quite objectionable if we have to say, sorry, but you cannot > do such searches efficiently in native Ada using parallelism. You > should either use libraries such as Brad Moore's Paraffin libraries, > or call out to another language such as C or C++ and write that code > in OpenMP. There are many ways to search efficiently, including in parallel. If you want the lowest-numbered element that satisfies a particular criteria, then that should be explicit in the algorithm, rather than something that is hidden in an assumption about how non-preemptive chunked parallel iteration works. It seems unwise to make requirements about the order in which things are done, e.g. whether lower-numbered or higher-numbered iterations are processed first, whether "exit" ends one chunk but not all chunks, whether exit stops new chunks from starting, etc. This could be seen as defeating the purpose of parallel iteration, and forcing the semantics to be restricted in what seem somewhat unnatural ways just for this particular case. > If Exit is preemptive, then we cannot continue searching in other > tasklets to find better or correct answers. We effectively have to not > use Exit at all, and search the full range of iterations, which can be > a waste of time if we already know the answer. ... In my view, "exit" means the entire loop is done, and trying to imply that under certain circumstances it means "well not really if there is a better answer yet to be found" seems like it would require a compiler that can guess your intentions. I suspect what would be most efficient for such a problem is a more explicit divide and conquer, not something where you are implicitly relying on chunking to accomplish what you want. For example, if you want the item with the lowest index which satisfies some predicate, you would want all of the cores looking at the first few items in the array, and then if nothing is found there, you would have them all start looking at the next few items in the array. It would be less efficient to break the array into big chunks, and have half of the cores looking at the last half of the array. Here is an example algorithm that would devote all of the cores to search the beginning of the array first. This is just one example, but I suspect it would be faster than one that made various assumptions about chunking and assumed that "exit" didn't really mean “exit” when there were better answers waiting. My point is that having multiple meanings for "exit" adds a lot of complexity without providing sufficient general benefit. Assume we have "N" cores, and an array A where we are trying to find the lowest-indexed element of A that satisfies the predicate P. function Find_First(A : Elem_Arr; P : access function(E : Elem_Type) return Boolean) return Integer is N : constant Positive := Num_CPUs; Lowest_Index : array(1..N) of Integer := (others => A’Last+1) Found : Integer := A'Last+1 with Atomic; -- Atomic flag used to stop other threads early begin parallel for I in 0 .. N-1 loop -- find A(J) with lowest valued J satisfying P(A(J)) by looking -- at A(A'First+I), A(A'FIrst + I + N), A(A'First + I + 2*N), declare J : Integer := A'First + I; begin while J <= A'Last and then J < Found loop if P(A(J)) then -- This is the lowest-numbered element -- satisfying P with index of form A’First + I + K*N Lowest_Index(I) := J; Found := J; -- Stop others as soon as possible. -- Note that there is a benign race condition here, since -- we are merely trying to get other threads to stop if further -- searching would be a waste of time. "Found" is not presumed -- to be the overall lowest value. exit; end if; J := J + N; end loop; end; end loop; -- Use proposed "map/reduce" construct to find overall minimum index return (for I in 1 .. N => Integer'Min(, Lowest_Index(I)); end; **************************************************************** From: Brad Moore Sent: Tuesday, August 1, 2017 2:08 AM > There are many ways to search efficiently, including in parallel. If you want > the lowest-numbered element that satisfies a particular criteria, then that > should be explicit in the algorithm, rather than something that is hidden in > an assumption about how non-preemptive chunked parallel iteration works. It > seems unwise to make requirements about the order in which things are done, > e.g. whether lower-numbered or higher-numbered iterations are processed first, > whether “exit” ends one chunk but not all chunks, whether exit stops new chunks > from starting, etc. This could be seen as defeating the purpose of parallel > iteration, and forcing the semantics to be restricted in what seem somewhat > unnatural ways just for this particular case. I see your point, but I also think treating an exit like an exception is also asking for trouble. Your arguments have moved me somewhat though. I now feel that If we cannot have both preemptive and non-premptive exits, returns, and gotos, then I think we shouldn't allow them to transfer control out of a parallel loop at all. I think they introduce too much in the way of confusion, unexpected behaviour, and semantic problems. If one really wants to preemptively terminate a parallel loop, they can always raise an exception and handle it at the appropriate level. Otherwise I think gotos, returns, and exits should only be allowed to exit loops that do not have the parallel keyword. This also simplifies the parallelism proposal, I should think. > For example, if you want the item with the lowest index which satisfies some > predicate, you would want all of the cores looking at the first few items in > the array, and then if nothing is found there, you would have them all start > looking at the next few items in the array. It would be less efficient to > break the array into big chunks, and have half of the cores looking at the > last half of the array. That can be tricky to program though without introducing too much overhead of having the cores needing to interact with each other to get the next search region, and it depends on the probability of finding things in the bottom half of the array. If what you are looking for is rare enough, that could raise the chances that what you are looking for can be found in the upper half of the array. But yes, it could be an alternate strategy that might work well enough in certain situations. > Here is an example algorithm that would devote all of the cores to search the > beginning of the array first. This is just one example, but I suspect it > would be faster than one that made various assumptions about chunking and > assumed that "exit" didn't really mean "exit" when there were better answers > waiting. My point is that having multiple meanings for "exit" adds a lot of > complexity without providing sufficient general benefit. This approach can be summarized as manual chunking, where the programmer writes an outer parallel loop that executes a number of chunks in any order, and an inner sequential loop that calculates the chunk boundaries, where you can then use sequential semantics including reverse loops, with non-preemptive exits. I think given the syntax we are proposing, manual chunking is likely the only solution that that works for this problem, if we do not have non-preemptive exits and want to write all the code in line. A point to mention is that with manual chunking perhaps that can eliminate the need for reverse parallel loops, since one can use the inner sequential loop to provide the reverse searching. I think I have a better solution for this problem though, because there are a number of criticisms I see here. For starters, I dislike that the programmer is having to make decisions about how many chunks are needed, and the values of the chunk boundaries. The reasons are; 1) Choosing the number of cores as the number of chunks is likely too course a chunk size. If one core finishes its chunk early, there are no opportunities to steal or seek work from the other cores. It is better to choose a smaller chunk size to provide better load balancing, but having the programmer choose this, is likely not going to be as good a decision that a compiler or library writer might choose. 2) Manual chunking in this manner is a form of static parallelism, where the decisions on chunk boundaries is made before loop execution begins. One ideally shouldn't necessarily have to commit to static chunking decisions, as dynamic chunking algorithms are possible that might be beneficial in cases where the amount of time to evaluate a single iteration can vary wildly between iterations. 3) Having to write the code to convert back and forth from chunk index to actual index is messy and error prone. It is more difficult to both read and write. I think this particular problem is best solved using a library approach, in combination with the anonymous loop body AI. A library approach can provide the outer parallel loop that corresponds to the outer loop of the manual chunking solution, without the programmer having to deal with decisions about chunking, or converting from chunk numbers to real index values. A library also can provide more sophisticated parallelism strategies that might be too difficult for a programmer to write from scratch. If we take a library approach to find the maximum index that satisfies some condition, using anonymous loop bodies from AI we could write; declare package Search is new Parallel.Loops.Static_Reduce (Loop_Index => Index, Result_Type => Index, Reducer => Index'Max, Identity => Index'First, Result => Maximum_Index); begin for (Start, Finish, Maximum_Index) of Search.Parallel_Loop for I in reverse Start .. Finish loop if I < Maximum_Index then exit; elsif Satisfies (Arr (I)) and then I > Maximum_Index then Maximum_Index := I; end if; end loop; end loop; end; Note, the Maximum_Index of the Parallel_Loop anonymous procedure is an in out parameter. The value coming in, is the best value so far for that executor. Otherwise to do this with explicit manual chunking, I would write; declare Number_Of_Chunks : constant Positive := 50; Iterations : constant Positive := Positive (Arr'Last - Arr'First + 1); Chunk_Size : constant Positive := Iterations / Number_Of_Chunks; Results : array (1 .. Number_Of_Chunks) of Index := (others => Index'First); begin parallel for I in 1 .. Number_Of_Chunks loop for J in reverse Arr'First + (Index (I) - 1) * Index (Chunk_Size) .. (if I = Number_Of_Chunks then Arr'Last else Arr'First + Index (I * Chunk_Size) - 1) loop if Satisfies (Arr (J)) then Results (I) := I; exit; end if; end loop; end loop; Maximum_Index := (for Element of Results => <0> + Element); end; I can tell you that the amount of time it took me to write the second version involved quite a bit more time and mental power than the previous version. I think the previous one is also a lot more readable, because it provides a higher level of abstraction. The code only focuses on the actual algorithm, whereas the second version mixes the algorithm with a simplistic chunking algorithm. Note: The exit in the last example is non-preemptive, since the inner loop is not a parallel loop. If you look at that exit statement, one might worry for a moment whether that is a preemptive exit or a non-preemptive exit. If we disallow exits to transfer control out of a parallel loop, then suddenly that exit feels "safer" to me, since all exits would not have preemptive abortive behavior to worry about. Otherwise, I can imagine one might want to scan all the source code in a system looking for preemptive exits when doing code reviews, and feeling the need to provide extra analysis to show that any found are safe. **************************************************************** From: Tucker Taft Sent: Tuesday, August 1, 2017 12:45 PM I think you have confused the issue of "exit" meaning "exit" with the issue of whether "abort" semantics or "as soon as possible" are used to terminate the other iterations. lI can understand the concern about using "abort" semantics, but after an exit by one iteration, when any other iteration hits a “natural” stopping point, or hasn't started at all, it should terminate. So let's separate these two issues, if possible. What I mostly object to is the notion of giving meaning to multiple exits from the same parallel loop. I don’t mind waiting a bit for iterations that are already running, for the loop to actually terminate. But trying to define what it means to have multiple exits is a mistake in my view. I believe the two-phase approach is implementable, and I believe should be used for any kind of "early" termination of a parallel loop, be it exception propagation, exit, return, goto, or requeue. That is, if an iteration wants to terminate the loop, it "requests" a termination, and then if it “wins” the race, it determines the fate. How quickly the loop actually finishes may vary, but it should be as soon as possible, and certainly no additional iterations should be started. So no need to bring up the "boogie" man of "abort." I could imagine some aspect that might affect how quickly termination happens, but it shouldn’t affect the basic semantics of the loop, which are “terminate as soon as possible.” **************************************************************** From: Brad Moore Sent: Wednesday, August 2, 2017 9:03 AM > So let's separate these two issues, if possible. OK, I can buy this. We really now have 3 potential ways to exit a loop though. What I would call hard preemption, soft preemption, and non-preemption, where hard preemption is the abortive exit likely involving asynchronous transfer of control, soft preemption which is a polled exit that only occurs on iteration boundaries (grain or chunk boundaries most likely), and non-preemption which wouldn't be directly supported in syntax, but if one does manual chunking either inline, or via a library call, an exit is exiting a sequential inner loop, which then is non-preemptive. The searching example of the previous email represents a loop that must involve manual chunking that results in a non-preemptive exit. I think hard preemption should probably be reserved for exceptions, which would allow breaking out of a loop like the following on a quad-core machine, which with only soft preemption would likely not result in any savings in execution time; parallel for I in 1 .. 4 loop if Flip_Coin then raise Get_Me_Out_Of_Here; else Do_12_Hour_Calculation; end if; end loop; All parallel loops involving chunking are actually nested loops at the implementation level where the outer loop is the parallel loop that is provided by the implementation, and the inner sequential loop is a callout to user code. So in this sense, an exit from the inner loop should be treated as a non-preemptive exit. But we want to create the illusion that there is only a single loop at the syntax level, so an exit would involve soft preemption to more closely mirror sequential behaviour that when an exit statement is executed, there are no further iterations. Parallel execution is not exactly like that, as execution likely continues after an exit, and further iterations can be started, particularly if a tasklet involves multiple iterations and soft preemption polling occurs at tasklet boundaries. But we want to at least try to approximate the sequential behaviour as best we safely can. I actually have these three exit strategies implemented in Paraffin now. I had forgotten that I had implemented soft preemption some time ago, where a special exception can be raised from a user callback and is handled by the library which sets a flag that is polled at grain/tasklet boundaries to effect early termination. Any other exception raised in a Paraffin library call results in a hard preemptive abort involving asynchronous transfer of control, which is a feature I just added. Any exit statement in a user provided callback is treated as a regular exit statement in Ada, which is just a non-preemptive exit, which would also work well for manual chunking examples involving the anonymous loop body procedure AI. So I am happy now with this treatment of exit, return, and gotos if we say it has the soft-preemption model. I do think we probably want to distinguish between exceptions and the other forms of transfer of control out of the loop (exit, gotos, returns, requeues) to consider treating exceptions as the hard-abortive case, particularly for exceptions such as Constraint_Error at least, but probably all exceptions for consistency sake. **************************************************************** From: Jeff Cousins Sent: Thursday, August 3, 2017 11:16 AM It's good to see this progressing even if it's taking a lot of to and fro. Tucker's ideas sound desirable, but are the others from the implementors' side happy that the "some kind of synchronization" won't be an undue implementation burden? **************************************************************** From: Randy Brukardt Sent: Thursday, August 3, 2017 8:54 PM > It's good to see this progressing even if it's taking a lot of to and fro. > Tucker's ideas sound desirable, but are the others from the > implementors' side happy that the "some kind of synchronization" won't > be an undue implementation burden? To answer your question directly, I don't think there is any alternative. What Tucker describes is a bit more complex than the model I had in mind, but they're pretty similar. That is, there is a loop "manager" (for the lack of a better term) that determines the chunks, creates (or more likely, requests existing) tasklets for execution, gives the tasklets chunks, and then waits for the tasklets to finish their chunk (or complete some other way, such as exit or propagate an exception). That waiting implies some sort of synchronization, but it seems to me to be the cheapest kind available. The manager would generally be asleep waiting for tasklets, and would only wake up when one completes - it would then either assign another chunk (if there aren't any, just sleeping waiting for the others), or if the loop needs to complete (exit or exception), wait for the other tasklets to reach a termination point (or abort them - in Janus/Ada, these are effectively the same but probably not in most implementations). I definitely would allow iteration termination at any task dispatching point (especially for tasklets that are blocked), lest a blocked tasklet prevent loop termination indefinitely. With this sort of structure, I don't really see any issue with having exit work deterministically. The only requirement are to ensure that any iterations on the left of the iteration of the exit (that is, with lower indexes) run to completion. There certainly isn't any requirement to start new tasklets after an exit; you just have to assign chunks to tasklets in a left-to-right order (and it would be a lot of extra work to do anything else). Thus, if an exit occurs, any chunk not ever assigned to a tasklet can be immediately discarded (it necessarily comes after the exit). Only tasklets logically before the exit would need to continue. Exceptions are different, of course, they terminate everything as soon as possible as the loop is not going to provide an answer. Those are rather different use-cases. ---- I want to step back a minute and think a bit about the uses for these constructs and our actual goals for them. I'm getting concerned that there can be no real uses of the parallel loop construct. We've already eliminated reduction loops from using the parallel loop (they have their own construct now). That's a substantial number of the loops that I've written that would be sensibly parallel. Almost all of the other loops that I've written or maintained are searches of some sort. Depending upon the rules adopted, it's quite likely that many of those could not use the parallel loop construct. First of all, most of the search loops I write are intentionally not going to execute long, and many are walking through linked lists; probably the parallel overhead would be far more than any possible savings from parallel execution. (The cost to walk the list to find chunks would be more than the comparison costs.) More of the loops wouldn't work if a non-deterministic solution for exit is adopted; you wouldn't be able to find the *first* match or the *next* match as opposed to some random match. Coding this using a non-deterministic loop is very expensive - you want to avoid any further iterations above any match but continue to run the ones below. The only way I could imagine doing it would require pre-checking on every iteration if something is found below: but that would require synchronization, so a lock or fence would be needed -- most likely costing more than the iteration itself. It probably wouldn't be worth doing, meaning that one would have to search the entire space regardless of where or when any matches are found. And, depending on the likelihood of matches in the first third of the data, probably means that a parallel solution would perform worse than a sequential one. There must be some searches that would work with non-deterministic exit, but I can't think of any at the moment. So, if a loop is neither a reduction or a search (that is, no early exits), what else could it do? Something like a matrix multiply comes to mind, but that is very likely to violate the static safety check on the loop contents. (The compiler probably can't prove that the array elements being written are independent for each iteration, and certainly any fully static check based on Global alone cannot do that - you need information about how the array is indexed.) So, what can such a loop do? Is there enough to justify the extensive work on this feature (as opposed to parallel blocks and parallel reductions)?? ---- Going back to first principles. The most important thing about these constructs is that they tell the compiler that the user doesn't care about the details of which iteration runs first, or last, or in between. Because of exceptions (and finalization), an Ada compiler cannot parallelize any code itself (unless of course it can prove the absence of such things, but that's hard and unusual). Most of the time, the programmer doesn't really care about the state of the loop when an exception is raised; the loop has a bug and we're not going to use the results of the loop at all. OTOH, a "normal" exit (whether via an exit statement, return statement, or a goto) usually just means that we have the needed results and don't need to continue. We're also looking to make checks to ensure that these constructs are "safe", that is free of race conditions and conflicts. The vast majority of programmers are not going to be able to successfully avoid these things (even if they know to look for them) -- I know I can't. Ideally, I want to have a programming construct that gives the benefit of parallel execution without any of the pain. For all of these reasons, I much prefer a model where exceptions and exits are treated differently, and exits give the same answer as a sequential loop. In such a model, almost all searching loops can be written in parallel. In the "first exit wins" scenario, only a few searching loops could be written in parallel. Since most other loops are reductions or would be prohibited by the parallel safety checks, that model seems necessary for parallel loops to be useful at all. **************************************************************** From: Tucker Taft Sent: Monday, August 14, 2017 8:46 AM > ... > With this sort of structure, I don't really see any issue with having > exit work deterministically. The only requirement are to ensure that > any iterations on the left of the iteration of the exit (that is, with lower > indexes) run to completion. There certainly isn't any requirement to > start new tasklets after an exit; you just have to assign chunks to > tasklets in a left-to-right order (and it would be a lot of extra work > to do anything else). Thus, if an exit occurs, any chunk not ever > assigned to a tasklet can be immediately discarded (it necessarily > comes after the exit). Only tasklets logically before the exit would > need to continue. ... I am not convinced we should imply there is any ordering of iterations in a parallel loop. This just adds complexity to the implementation with little benefit in my mind. If you have a “many-core” machine, then it is quite possible that all of the iterations (or at least all of the chunks) get going simultaneously, so what value is there to promising ordering in some particular cases? As I suggested in an earlier e-mail, if you really want the lowest-indexed element that satisfies your predicate, then you wouldn’t want to use “normal” chunking, but rather you would want to bias things to have multiple cores looking at the early part of the range first. In any case, are there really that many cases where you want the lowest-indexed item? More likely in my view is that you would either want any item, or you would want all items that satisfy the predicate. If you want all items, then of course “exit” would not be used, and you would build up a list of all desired items. If y want any item, then the first you find should “win”. **************************************************************** From: Randy Brukardt Sent: Monday, August 14, 2017 9:23 PM At the risk of continuing to beat this same horse... > ... If you > have a "many-core" machine, then it is quite possible that all of the > iterations (or at least all of the chunks) get going simultaneously, > so what value is there to promising ordering in some particular cases? The benefit is obvious: loop algorithms work as they are written. No mental gymnastics needed. > ... As I suggested in an > earlier e-mail, if you really want the lowest-indexed element that > satisfies your predicate, then you wouldn't want to use "normal" > chunking, but rather you would want to bias things to have multiple > cores looking at the early part of the range first. That was a terrible idea: (1) It inverts the looping algorithm, obscuring the actual operative code; (2) It forces the use of complex code to get parallel operation, defeating the purpose of making getting better performance easy; (3) It assumes a particular chunking mechanism, which makes the code much less portable (at least if performance matters). To elaborate on (3): A good compiler will tailor the chunking to the iteration being executed (specifically whether it is balanced so the iterations probably take the same amount of time, or if they are obviously unbalanced, and whether or not there are any explicit exits), the hardware (the number of cores and how they are mapped to Ada tasks), and possibly do that at runtime (especially as the tasks that are running aren't known until the moment the loop is launched). An algorithm using what Brad called "manual chunking" is going to end up fighting that compiler-generated chunking, and that is going to scrub off performance. You'd be better off using a library like Parafin to implement such a loop -- but in that case, there is no longer enough value to the syntactic parallel loop to bother with defining it. But an algorithm like the one you propose makes a lot of sense as the implementation generated by the *compiler* when the code contains explicit exits. It can do that with all of the knowledge needed and adding no portability issues; there's surely no requirement that a chunk run consecutive iterations! > ... In any case, are there really that many cases where you want the > lowest-indexed item? Yes, this is pretty much the only thing I search for. Usually, I need to be able to restart the search if the found item eventually turns out to not work for some complex reason. That would be difficult if some random result was returned. > More likely in my view is > that you would either want any item, or you would want all items that > satisfy the predicate. If you want all items, then of course "exit" > would not be used, and you would build up a list of all desired items. Which you can't do with a parallel loop; you have to use a reduction expression for that. > If you want any item, then the first you find should "win". The only case where it makes sense to me to want "any item" is when you are looking for the absence of something (in such a case there shouldn't be any match). That's pretty rare in my view. I still think that "exit" should produce a deterministic answer; there's no problem if you don't need a deterministic answer -- don't use "exit" in that case. The goal here is to make it as easy as possible to write a parallel loop. If you have to know lots of unusual patterns in order to get anything to work, then only experts will be able to write parallel code. But that's already true; it's not the least bit hard for an Ada real-time expert to write parallel algorithms (as Brad has proved). The whole point of these features is to bring this programming capabilities to the rest of us. If we fail to do that, then these features become essentially useless (and a massive amount of work for that). **************************************************************** From: Brad Moore Sent: Tuesday, August 15, 2017 10:12 AM > I want to step back a minute and think a bit about the uses for these > constructs and our actual goals for them. I'm getting concerned that > there can be no real uses of the parallel loop construct. I have been thinking about this also, and I am thinking we probably should consider dropping parallel loop syntax from the proposal. The discussion about find first, find last and preemptive exits has led to show that the anonymous loop body AI coupled with a parallelism library call is a better solution to these problems than the parallel loop syntax solutions we have been discussing. Once you come to that realization, it opens a bit of a slippery slope which suggests that once you need parallelism libraries to solve certain problems, probably they are also the best way to solve other manual chunking problems as well. This includes what I was calling the "parallel all" loops where an outer sequential loop encloses an inner nested loop that is desired to be executed in parallel. The best performance for these sorts of problems, which I have encountered several, is to use barriers with manual chunking to push the parallelism out to an outer loop, that encloses the sequential loop, and use barriers to transition between parts of the algorithm that need to run sequential and other parts that need to run in parallel. It felt admittedly a bit like trying to force a square peg into a round hole to graft the "parallel all" syntax onto the parallel loop syntax. If you take out all the manual chunking problems and reductions problems out of the picture, whats left? Not much, as I would say easily more than half of the parallel loop problems I have encountered that are not reductions are of the manual chunking variety. I have only two representative problems left that would fit under parallel loop syntax. 1. Find any 2. Combining arrays/containers The Find any problems are the ones that Tucker was talking about where a preemptive exit would be needed to break out of the loop. I think these sorts problems are actually better served by quantified expressions, which as you may recall are actually a form of reduction expression. A quantified expression is a boolean expression that a compiler should break out of evaluating once it has found a case that determines the value of the expression. This also applies for parallel quantified expressions. In this case, the programmer doesn't even need to write the exit statement. It is expected that the compiler will implicitly do this. Consider determining if a number is a prime number. Is_Prime : constant Boolean := (parallel for all I in 1 .. Integer(Sqrt(N)) => N mod I != 0); Whether the parallel keyword is present or not, a compiler should stop further evaluation once if finds a factor of N. Sometimes you want a non boolean value as the result of what you are looking for. A quantified expressions works here also, except a side affect of the function is needed to store the result. As an output, you still need a boolean result also, to indicate if you found what you were looking for. Consider finding any prime value within a range of integers. Prime_Number : Integer := 0 with Atomic; function Test_And_Store_Prime (N : Integer) return Boolean is begin if Is_Prime (N) then Prime_Number := N; return True; else return False; end if; end Test_And_Store_Prime; Prime_Found : constant Boolean := (parallel for all I in Start .. Finish => Test_And_Store_Prime (I)); if Prime_Found then Put_Line (Integer'Image (Prime_Number) & " is a prime number"); end if; There will be some naysayers who dislike seeing a side-effect in a quantified expression, but I think for this particular problem it feels good enough to use here. That leaves combining arrays/containers, however those can be written as a reduction expression also. Array_Result : array (ArrA'First .. ArrA'Last) of Integer := (for I in ArrA'First .. ArrA'Last => <""> & (ArrA(I) + ArrB(I))); Where concatenation is he reducing operator. Perhaps this looks inefficient, though whose is to say how the compiler can optimize this. It is otherwise a simple enough expression to read or write, I think. A more visually appealing way to do this would be to allow the keyword parallel with the new array aggregate syntax, which eliminates the concatenation semantics. Array_Result : array (ArrA'First .. ArrA'Last) of Integer := (parallel for I in ArrA'First .. ArrA'Last => (ArrA(I) + ArrB(I))); See RM 4.3.3 5.1/5 for the array initialization syntax. I have not been able to yet think of any other cases where parallel loop syntax would be needed, so it seems to me that we should not introduce parallel loop syntax given that there are no use cases for parallelism problems that cant be better solved with reduction expressions or anonymous loop body syntax combined with parallelism library calls. **************************************************************** From: Randy Brukardt Sent: Wednesday, August 16, 2017 12:22 AM ... > The discussion about find first, find last and preemptive exits has > led to show that the anonymous loop body AI coupled with a parallelism > library call is a better solution to these problems than the parallel > loop syntax solutions we have been discussing. There's a couple of problems using the "anonymous loop body" AI in this way: (1) Without some additional mechanism, we'd lose the safety checks of the parallel loop. They don't make any sense in the context of the "anonymous loop body" proposal per-se. So I think we'd still need a "parallel" keyword or something like that to trigger the safety checks. (Without them, parallel operations are just a giant race condition; no human programmer could avoid them.) (2) There's no certainty that the "anonymous loop body" proposal even makes it into the Standard. Several people have expressed strong reservations about that proposal. As an aside, I'd like to note that the "exit" problem with the "anonymous loop body" proposal is in fact very similar to the problems that arise with the parallel loop proposal. In both case, implementing exit/return/goto is very painful, because you have to get control, somehow save whatever is supposed to happen after the loop, do some actions, and then proceed to doing the after the loop actions. The only place where the "anonymous loop body" is a bit worse is that a user-defined routine can see some of what happens there; the parallel loop is all-compiler generated. ... > If you take out all the manual chunking problems and reductions > problems out of the picture, whats left? > > Not much, as I would say easily more than half of the parallel loop > problems I have encountered that are not reductions are of the manual > chunking variety. Exactly my point. > I have only two representative problems left that would fit under > parallel loop syntax. > > 1. Find any > 2. Combining arrays/containers > > The Find any problems are the ones that Tucker was talking about where > a preemptive exit would be needed to break out of the loop. I think "Find any" per-se is rather unlikely, although they make sense in the case of an "ensure none" predicate. The array combination cases are most likely going to trigger the safety checks (unless we build in some special holes, and even then only a few special forms would work). The compiler is probably not going to be able to tell that different iterations write different components unless the array index is of a special, recognized form. (It definitely will not work with the raw static Global check, since the entire array is considered a single object, which necessarily overlaps with itself.) > I think these sorts problems are actually better served by quantified > expressions, which as you may recall are actually a form of reduction > expression. Makes sense. Certainly "ensure none" is the sort of case that quantified expressions were designed for. ... > That leaves combining arrays/containers, however those can be written > as a reduction expression also. > > Array_Result : array (ArrA'First .. ArrA'Last) of Integer := (for I in > ArrA'First .. ArrA'Last => <""> & (ArrA(I) + ArrB(I))); Where > concatenation is the reducing operator. They have to be written this way (or some other reduction expression), as otherwise the array objects would be overlapping (making a parallel loop illegal). > Perhaps this looks inefficient, though whose is to say how the > compiler can optimize this. It is otherwise a simple enough expression > to read or write, I think. > > A more visually appealing way to do this would be to allow the keyword > parallel with the new array aggregate syntax, which eliminates the > concatenation semantics. > > Array_Result : array (ArrA'First .. ArrA'Last) of Integer := (parallel > for I in ArrA'First .. ArrA'Last => (ArrA(I) + ArrB(I))); > > See RM 4.3.3 5.1/5 for the array initialization syntax. Right, and this probably is a better way to write this expression rather than writing a loop in the first place. > I have not been able to yet think of any other cases where parallel > loop syntax would be needed, so it seems to me that we should not > introduce parallel loop syntax given that there are no use cases for > parallelism problems that cant be better solved with reduction > expressions or anonymous loop body syntax combined with parallelism > library calls. Agreed, so long as we find a way to make the safety checks that we have in parallel loops. **************************************************************** From: Tucker Taft Sent: Thursday, August 17, 2017 7:43 PM > ... > >> I have only two representative problems left that would fit under >> parallel loop syntax. >> >> 1. Find any >> 2. Combining arrays/containers >> >> The Find any problems are the ones that Tucker was talking about >> where a preemptive exit would be needed to break out of the loop. > > I think "Find any" per-se is rather unlikely, although they make sense > in the case of an "ensure none" predicate. I don’t agree, based on my experience with ParaSail. YMMV. > The array combination cases are most likely going to trigger the > safety checks (unless we build in some special holes, and even then > only a few special forms would work). The compiler is probably not > going to be able to tell that different iterations write different > components unless the array index is of a special, recognized form. > (It definitely will not work with the raw static Global check, since > the entire array is considered a single object, which necessarily overlaps > with itself.) ... This is very well traveled area. Compilers have been recognizing the special case of independent loop iterations for decades. Given: parallel for I in A’Range loop C(I) := A(I) + B(I); end loop; Many, many compilers are smart enough to understand that this is safe. I agree the loop-body procedure construct is very nice, but you still end up wanting to do the safety checks, and something as simple as the above, whether the user is driving the “chunking” or the compiler does it by itself, has exactly the same safety-check requirements. We have been talking with existing Ada customers recently about planned Ada 202X enhancements, and this is the one that gets them drooling every time. They *really* want parallel loops. Of course if they are too restrictive, they won’t solve the problem, but I believe with the proposed reduction expression, the above simple parallel loop, and the nice loop-body procedure, we will give them what they need to be quite happy. I really believe we need all three. **************************************************************** From: Randy Brukardt Sent: Thursday, August 17, 2017 9:32 PM ... > > The array combination cases are most likely going to trigger the > > safety checks (unless we build in some special holes, and even then > > only a few special forms would work). The compiler is probably not > > going to be able to tell that different iterations write different > > components unless the array index is of a special, recognized form. > > (It definitely will not work with the raw static Global > check, since > > the entire array is considered a single object, which > necessarily overlaps with itself.) ... > > This is very well traveled area. Compilers have been recognizing the > special case of independent loop iterations for decades. Given: > > parallel for I in A'Range loop > C(I) := A(I) + B(I); > end loop; > > Many, many compilers are smart enough to understand that this is safe. Surely. But that's not the point. The point is that we have to have Ada Legality Rules which makes this sort of thing legal and portable. The proposed rules using the Global aspect certainly would not allow the above. Unless you have been secretly planning some much more complex rules to allow such things, loops like the above couldn't be legal Ada. Note that I'm not against allowing loops like the above, but that has to be portably defined. (The above case: a single dimensioned target that doesn't overlap anything else, and is indexed directly by the loop parameter, is clearly safe and could be defined that way, but hardly anything else could be. Is this case alone enough for the extra complication???) ... > I agree the loop-body procedure construct is very nice, but you still > end up wanting to do the safety checks, and something as simple as the > above, whether the user is driving the "chunking" or the compiler does > it by itself, has exactly the same safety-check requirements. Not exactly, as noted above. But I definitely agree that whatever construct is used needs safety checks. > We have been talking with existing Ada customers recently about > planned Ada 202X enhancements, and this is the one that gets them > drooling every time. They *really* want parallel loops. Of course if > they are too restrictive, they won't solve the problem, but I believe > with the proposed reduction expression, the above simple parallel > loop, and the nice loop-body procedure, we will give them what they > need to be quite happy. I really believe we need all three. Probably because no one will be able to grok reduction expressions. :-) It took me a three hour ARG presentation to get any idea... People naturally understand parallel loops. At least they think they do, until they find out that exit doesn't work sensibly. :-) **************************************************************** From: Brad Moore Sent: Friday, August 18, 2017 12:01 AM I have some more thoughts towards answering the question on whether we should eliminate parallel loop syntax from our proposal. Firstly, I had said in the previous email that find-first and find-last problems should be written using parallel libraries and the anonymous loop proposal. I think these can also be written well enough using reduction expressions. Consider finding the highest prime number in a range of integers; One could express this using the following reduction expression. Highest_Prime_Found : constant Integer := (parallel for I in reverse Start .. Finish => Integer'Max(, (Is_Prime(I) then I else Integer'First)); A simpler compiler might only apply divide and conquer here, which means each chunk fully executes. This should still cut the execution time down from the sequential version by a factor of the number of cores. A smarter compiler might recognize that when the reduction is 'Max or 'Min, and the expression value result is of the loop iterator in such a pattern, that once a higher index is found to be prime, there is no longer a need to keep searching the lower indices. Indeed a smarter compiler might recognize this as a special case and insert a completely custom algorithm optimized for find first/find last problems. So there appears to be no need to rely on library packages nor parallel loop syntax for these problems. Now consider another one of the manual chunking problems where I was suggesting using barriers would be the best approach when the outer loop must be a sequential loop that happens to enclose a parallel loop. Consider a Newton package that simulates the movement of heavenly bodies in a solar system. A procedure Update_Position can be called that considers the mass, gravity, direction, and movement of neighbouring bodies to calculate the position of a body for the next time increment. An Advance function might be needed to swap internal buffers for the next and current values for the next time increment. One could express this using a parallel loop inside a sequential loop that advances time. for Current_Time in Now .. Time + 100_000_000 loop for parallel all Heavenly_Body in Solar_System'Range loop Newton.Update_Position (Heavenly_Body); end loop; Newton.Advance; end loop; This doesn't work well with reduction expressions if the data arrays/containers are hidden behind an abstraction in a package body. However, if one were to modify the Update_Position subprogram to be a function that returns a boolean status to indicate the success of the calculation, then one can use a reduction expression here. The Newton.Advance function might raise an exception if the boolean input parameter is false, indicating some failure occurred, such as two planets colliding, that isn't handled by the simulation. for Current_Time in Now .. Time + 100_000_000 loop Newton.Advance (for parallel all Heavenly_Body in Solar_System'Range => Newton.Update_Position (Heavenly_Body)); end loop; So, if we did not have parallel loop syntax, it might influence people more to write functions for parallelism problems, rather than procedures, but I wouldn't say thats a bad thing necessarily. A smarter compiler might recognize the pattern of a reduction expression inside a parallel loop and treat that as a special case using barriers to get optimal performance, or it may be that the default implementation of the compiler just works well here. If the programmer finds the performance is poor, he could always write the problem himself using barriers and use the anonymous loop procedure with a parallelism library. The point again though is that I don't think we need to rely on the availability of parallelism libraries for this problem, nor do we need parallel loop syntax, unless we feel strongly that one should be able to use a procedure instead of a function for the Update_Position call. The one other manual chunking problem I have that is quite different from the others is to generate a cumulative sum of an array where the result array has the sum of the corresponding element from the input array plus all values preceding that value in the input array. This solution involves 3 loops. The first loop is a parallel loop that gets the sum of each chunk and stores that value in a partial sum intermediate array. The second loop is a sequential loop that performs a cumulative sum operation on the partial sum intermediate array. The third loop is a parallel loop that takes the corresponding partial result values as a starting point for each chunk, and then applies the cumulative sum to each chunk. Using manual chunking I can write the first loop using a nested reduction expression where the outer parallel reduction expression concatenates sum values calculated from the inner reduction expression for each chunk. However, the third loop seems to pretty much need to be a parallel loop. declare Chunk_Size : constant Natural := Natural'Min(Arr'Length, 10_000); Chunks : constant Natural := Arr'Length / Chunk_Size; function Chunk_Start (Chunk : Natural) return Index_Type is (Arr'First + (Chunk - 1) * Chunk_Size); function Chunk_Finish (Chunk : Natural) return Index_Type is (Arr'First + (if Chunk = Chunks then Arr'Last else Chunk * Chunk_Size - 1)); -- Create the Partial sums in parallel Partial : array (1 .. Chunks) of Element_Type := (parallel for I in 1 .. Chunks => <""> & (for J in Chunk_Start (I) .. Chunk_Finish(I) => <0> + Arr(J))); begin -- Sequential carry partial sums through array for I in 2 .. Partial'Length loop Partial (I) := @ + Partial (I - 1); end loop; for parallel I in 1 .. Chunks loop declare Sum : Element_Type := (if I = 1 then 0 else Partial (I - 1)); begin for J in Chunk_Start (I) .. Chunk_Finish (I) loop Sum := @ + Arr (J); Cum (J) := Sum; end loop; end; end loop; end; However, as this is a manual chunking problem, I find the code written using anonymous loop procedures and a library call is quite a bit simpler and easier to read, and requires less knowhow on the programmer to decide how the loops should be chunked. The chunking logic is hidden inside the library call. declare package Scan is new Parallel.Loops.Static (Loop_Index => Positive); Partial : array (1 .. Scan.Worker_Count) of Element_Type := (others => 0); begin -- Get the sum of each chunk for (Start, Finish) in Scan.Parallel loop declare Worker : constant Parallel.Worker_Id := Scan.Get_Worker_Id; begin for I in Start .. Finish loop Partial(Worker) := @ + Histogram (I); end loop; end; end loop; -- Sequential exclusive scan phase for I in 2 .. Worker_Id (Effective_Workers) loop Partial (I) := @ + Partial (I - 1); end loop; for (Start, Finish) in Scan.Parallel loop declare Worker : constant Parallel.Worker_Id := Scan.Get_Worker_Id; Sum : Integer := (if Worker = 1 then 0 else Partial (Worker - 1)); begin for I in Start .. Finish loop Sum := @ + Arr (I); Cum (I) := Sum; end loop; end; end loop; So I would say this is the one case where a parallel library call with anonymous loop procedure appears to be the best solution, particularly if it means that we can simplify the parallelism proposal and not have parallel loop syntax. Tucker did have a proposal involving parallel array syntax that was designed specifically for this problem. At the time we were considering using that as the reduction solution, but since reduction expressions appears to be a better solution, it seems unlikely that we would want to bring back parallel array syntax if it only is useful for this one problem. So in summary, I still think most parallel looping problems can be handled with reduction expressions. There may be a few odd cases involving manual chunking where one might want to use a parallelism library with the anonymous loop procedure. My sense is that there are not enough of these problems needing parallelism libraries to warrant standardizing a set of libraries for this, given that libraries are available, and one can also write Ada wrapper libraries around OpenMP calls, for example, although standardizing a set of parallelism libraries could be another thing to consider. While there are some special cases where parallel loop syntax could be useful for manual chunking problems, the alternative of using a reduction expression seems to be a good enough alternative, otherwise using anonymous loop body with a library call is another viable alternative. That leaves me still thinking that parallel loop syntax is not worth adding at this point, which would simplify the AI to just cover reduction expressions and parallel blocks. But it would be good to get others to weigh in as well, of course. **************************************************************** From: Brad Moore Sent: Tuesday, August 29, 2017 10:01 AM > ... >> The discussion about find first, find last and preemptive exits has >> led to show that the anonymous loop body AI coupled with a >> parallelism library call is a better solution to these problems than >> the parallel loop syntax solutions we have been discussing. > There's a couple of problems using the "anonymous loop body" AI in this way: > (1) Without some additional mechanism, we'd lose the safety checks of > the parallel loop. They don't make any sense in the context of the > "anonymous loop body" proposal per-se. So I think we'd still need a > "parallel" keyword or something like that to trigger the safety > checks. (Without them, parallel operations are just a giant race > condition; no human programmer could avoid > them.) I have been thinking about this. I think the safety checks could be provided if we could allow the global aspect on formal subprograms and formal access to subprograms, with an addition to the global syntax to indicate a subrange of an array. with global => array(x .. y) Which would indicate that there is global usage of an array from the indices x .. y. The array might be a name of an array, or just the "array" keyword which would mean any array. This might be used with a parallel loop library such as the following. with System.Storage_Elements; generic type Loop_Index is range <>; package Parallel is procedure Parallel_Loop (Process : not null access procedure (Start, Finish : Loop_Index) with global => (Input => array (Start .. Finish), In_Out => array (Start .. Finish), Out => array (Start .. Finish)); From : Loop_Index := Loop_Index'First; To : Loop_Index := Loop_Index'Last) with global => null; end Parallel; The idea would be that the compiler would reject any call to Parallel_Loop that involved a Process callback that had any globals other than array references of a slice using the parameters of the call. The compiler would also reject the code if there was an attempt to reference the array using indices that are outside the range of the specified limits, or not derived from the specified parameters. Note the Parallel_Loop library call itself does not involve any globals. The compiler would not reject calls involving a Process callback that didn't involve any globals. i.e. It doesn't necessarily expect the actual to involve globals, but if it does, it cannot involve globals outside of the specification. This should eliminate the race conditions that we are worried about I think. example usage of the above: -- Anonymous loop body call for (Start, Finish) of Parallel_Loop loop -- Sequential inner loop for I in Start .. Finish loop C(I) := A(I) + B(I); -- OK end loop; C(1) := 3; -- Illegal, Index is not derived from Start or Finish end loop; > (2) There's no certainty that the "anonymous loop body" proposal even > makes it into the Standard. Several people have expressed strong > reservations about that proposal. > > As an aside, I'd like to note that the "exit" problem with the > "anonymous loop body" proposal is in fact very similar to the problems > that arise with the parallel loop proposal. In both case, implementing > exit/return/goto is very painful, because you have to get control, > somehow save whatever is supposed to happen after the loop, do some > actions, and then proceed to doing the after the loop actions. The > only place where the "anonymous loop body" is a bit worse is that a > user-defined routine can see some of what happens there; the parallel loop > is all-compiler generated. Actually, this is not a problem for exit because with the anonymous loop body, the users code appears as a sequential loop nested inside the outer anonynmous loop. Since the inner loop is a sequential loop, exit works just as it does today, it only exits the inner sequential loop. I suspect we'd probably want to make returns and gotos illegal from inside an anonymous loop body. So the anonymous loop body needs to have no knowledge that there is parallelism involved. Its just a library call which happens to hide the parallelism inside the library call. **************************************************************** From: Tucker Taft Sent: Tuesday, August 29, 2017 2:05 PM Although SPARK only allows names of stand-alone objects in Global annotations, for Ada we allow any object name, presumably including an indexed component or a slice. We also have proposed in AI12-0079 to allow them on formal subprograms and access-to-subprogram, so I think the cases you mentioned are consistent with the current AI. **************************************************************** From: Randy Brukardt Sent: Wednesday, August 30, 2017 10:16 PM How do the static checks work in such a case, as the bounds in question would typically be non-static? For instance, in Brad's with global => array(x .. y) X and Y probably would be outer loop indexes, or parameters to the current subprogram. **************************************************************** From: Tucker Taft Sent: Wednesday, August 30, 2017 10:26 PM You just need to prove two slices don’t overlap, which is equivalent to proving the high bound of one is less than the low bound of the other. This is the kind of thing that static analysis and more advanced proof tools are pretty good at doing! This is not significantly harder than eliminating run-time checks for array indexing when the bounds are not known at compile-time, which is something that many Ada compilers can do based on other information available at the point of usage. **************************************************************** From: Randy Brukardt Sent: Wednesday, August 30, 2017 10:40 PM OK, but again these are Legality Rules, not something that compilers are doing on their own. The entire set of rules have to be defined formally and required of all Ada compilers. What "static analysis" and "advanced proof tools" can or can't do is irrelevant (unless of course we allowed what is legal or not to be implementation-defined -- when I previously proposed that for exception contracts, the idea seemed to be as effective as a lead balloon). **************************************************************** From: Randy Brukardt Sent: Wednesday, August 30, 2017 11:02 PM > I have some more thoughts towards answering the question on whether we > should eliminate parallel loop ... Let me throw another classic problem (one that I recently wrote for a hobby program), the problem of determining correlations and other statistics between two parallel arrays of data. The core of the problem is calculating 6 pieces of data: the sum of the elements of each array, the sum of the squares of the elements of each array, the product of the matching elements of the arrays, and the square of the product of the arrays. The sequential solution is a single loop creating all 6 items: for I in 1 .. MAX loop A_Sum := @ + A(I); A_Sum_Sqr := @ + A(I)*A(I); B_Sum := @ + B(I); B_Sum_Sqr := @ + B(I)*B(I); Prod_Sum := @ + A(I)*B(I); Prod_Sum_Sqr := @ := (A(I)*B(I))**2; end loop; MAX is over 3000 for this particular problem, so it would seem that a parallel solution might add some performance. These are clearly reduction expressions. However, we have to write 6 separate reduction expressions, which means 6 times the loop overhead. Also, the above is full of common subexpressions; it's likely that a compiler would calculate and load A(I) and B(I) once each, and possibly eliminate some of the other operations as well (in particular, the A*B product). None of these optimizations can be done with separate reduction expressions, so we'll end up with 5 times the memory reading as well (each A(I) and B(I) is read 5 times in the sequential loop). Thus, a parallel rewrite into 6 reduction expressions is likely to be disappointing. With the extra parallel execution overhead, it would likely take 8 cores to improve the performance over the sequential loop, and the performance would be worse for 4 cores or less. This is not a theoretical concern: Janus/Ada doesn't do many floating point optimizations; recompiling the program with GNAT (which presumably does do those optimizations) improved the performance by a large factor (more than 10, as I recall). How would you parallelize this loop?? **************************************************************** From: Tucker Taft Sent: Thursday, August 31, 2017 11:02 PM I do not believe we should eliminate parallel loop syntax. The question that relates more directly to this example is whether we need special reduction syntax beyond the “map/reduce” construct. This example argues that in cases where you have multiple accumulators, you might want some way of declaring them so that you get one per chunk. Alternatively, we encourage the use of an accumulator “record” for this sort of computation. E.g.: type Accums is record A_Sum, A_Sum_Sqr : Float; B_Sum, B_Sum_Sqr : Float; Prod_Sum, Prod_Sum_Sqr : Float; end record; function Update_Accums(Orig : Accums; A_Elem, B_Elem : Float) return Accums is begin return (A_Sum => Orig.A_Sum + A_Elem, A_Sum_Sqr => Orig.A_Sum_Sqr + A_Elem * A_Elem, B_Sum => Orig.B_Sum + B_Elem, B_Sum_Sqr => Orig.B_Sum_Sqr + B_Elem * B_Elem, Prod_Sum => Orig.Prod_Sum + A_Elem * B_Elem; Prod_Sum_Sqr => Orig.Prod_Sum_Sqr + (A_Elem * B_Elem) **2); end Update_Accums; Result : constant Accums := (for I in 1..MAX => Update_Accums(<(others => 0.0)>, A(I), B(I)); **************************************************************** From: Tucker Taft Sent: Thursday, August 31, 2017 3:19 PM >> You just need to prove two slices don't overlap, which is equivalent >> to proving the high bound of one is less than the low bound of the >> other. This is the kind of thing that static analysis and more >> advanced proof tools are pretty good at doing! This is not >> significantly harder than eliminating run-time checks for array >> indexing when the bounds are not known at compile-time, which is >> something that many Ada compilers can do based on other information >> available at the point of usage. > > OK, but again these are Legality Rules, not something that compilers > are doing on their own. The entire set of rules have to be defined > formally and required of all Ada compilers. What "static analysis" and > "advanced proof tools" can or can't do is irrelevant (unless of course > we allowed what is legal or not to be implementation-defined -- when I > previously proposed that for exception contracts, the idea seemed to > be as effective as a lead balloon). This is a general problem with data race detection, I would say. My suggestion is to define what we consider to be a "potential data race" in a way that run-time detection can be used, but make it clear that compilers are permitted to give compile-time errors if they can show that the "potential data race" exists in a given instance. This is somewhat analogous to the "potentially blocking operation," but in this case, we require the equivalent of "detect-blocking." And of course a compiler can omit the run-time check if it can prove there is no potential data race. Anything in between presumably many compilers would provide warnings. **************************************************************** From: Randy Brukardt Sent: Thursday, August 31, 2017 3:38 PM That's effectively making the checks implementation-defined (whether a particular piece of code will compile is impl-def). Not sure how well that will fly. Ada usually requires a run-time check in such instances and does not allow a compile-time check (the latter because of the mess that comes up in conditional code). Anyway, food for thought. **************************************************************** From: Tucker Taft Sent: Thursday, August 31, 2017 8:17 PM Perhaps you are right. No big reason to treat this differently than any other run-time check. Most compilers already produce a warning when a run-time check is certain to fail, and often provide a mode where certain classes of warnings are treated as errors, so that would effectively have the same net effect. In generic bodies, we often define things to raise Program_Error under certain circumstances, but for compilers that do macro-expansion, almost all of these end up as compile-time warnings, which are often treated as errors by the user. **************************************************************** From: Brad Moore Sent: Tuesday, October 10, 2017 12:15 AM I am trying to write up my homework for AI12-0119-1 (parallel operations), with an eye to possibly writing up AI12-0197-4 (Coroutines and Channels), and have come across some observations and ideas that I think improves from where we left off at Vienna but is modified to account for some new issues. The basic change is that instead of parallel "begin ... end" for parallel blocks, I am planning to substitute "do ... end do". Eg. parallel do Foo; and Bar; end do; My reasoning for this is; 1) The sequence of statements in each branch should just be a sequence-of-statements, rather than a handled-sequence-of-statements because attempting to put an exception handler in this construct I find is very confusing; I recall now that this is one of the main reasons why the gang-of-four settled on syntax modeled after the select statement rather than a begin statement for parallel blocks. For example, if one writes; parallel begin Foo; and Bar; exception when others => Put_Line ("Exception Caught!"); end; I think it is very confusing. Is the exception handler only catching exceptions for the call to Bar, or is it also catching exceptions for the call to Foo? If we say that the exception handlers are not allowed, then it forces the programmer to either enclose the construct in a block statement or equivalent, or use block statements within each arm of the construct. Either way, I find to be much clearer. eg. begin parallel do Foo; and Bar; end do; exception when others => Put_Line("Exception Caught!"); end; or parallel do begin Foo; exception when others => Put_Line ("Bad Foo!"); end; and begin Bar; exception when others => Put_Line ("Bad Bar!"); end; end do; The use of "do" has the benefit that "begin" from Vienna, had in that you can remove the keyword parallel and still have a working construct. The parallel keyword is a modifier. In the previous parallel block proposal this would not work as parallel was the name of the construct. If you remove "parallel", you have a bunch of code that needs to be cleaned up before it will compile. Secondly, I don't think there is much use in having a declarative region in this construct. Generally, you need the results to outlive the construct, so that you can do something useful with the results. This suggests again that the construct should be enclosed in another construct such as a block statement. eg. declare X : Integer; -- Do Foo & Bar get their own copies of these? Y : Float; parallel begin X := Foo; and Y := Bar; exception -- Is this handler for both Foo & Bar, or just Bar? when others => Put_Line("Exception Caught!); end; Put_Line (????); -- !!! I Cant print out the results !!! In this example, two results have been calculated, but if you want to examine both these results at the same time, you cannot, because the scope of the results have ended before you can examine both results. Also, having a declarative region on a parallel begin is confusing. I think some users would be confused by this, wondering if each arm of the parallel begin gets its own local instances of those declarations, or whether they are shared between all the arms. By eliminating the declarative region, it eliminates this confusion, and also not likely to be missed because it is not useful to begin with. If one instead writes; declare X : Integer; Y : Float; begin parallel do X := Foo; and Y := Bar; end do; Put_Line ("X=" & Integer'Image(X) & ", Y=" & FLoat'Image(Y)); exception when others => Put_Line("Exception Caught!"); end; I find this to be non-ambiguous, and provides the scope needed to examine both results of the arms. So it becomes clearer to me that this construct is quite different than a block statement, and therefore it probably should have its own distinct syntax, rather than try to make the block statement accommodate this usage, which feels like a hack to me, and also feels like it would be making a bit of a mess in the syntax. Why "do"? The reserved keyword do is relatively unused in Ada, appearing only in the selective accept statement. It seems to fit better than begin in terms of English semantics. The construct really is a list of things to be done. e.g. Do X and Y and Z. reads better than a list of things to be begun I think. Begin X and Y and Z. Begin really is more tied to a declarative region, where you declare a bunch of things, then you need to specify where to "begin" executing. Since this construct doesn't seem to need a declarative region, there no need to indicate where the execution begins. Also many syntax constructs in Ada have the form name end name; e.g. loop end loop; if end if; select end select; case end case; record end record; If one sees "end" without looking at the above, the assumtion is that it corresponds to a normal block statement. If one sees "end do" it provides better feedback that the preceding code has different semantics. I suspect that the reason "begin" is not paired with "end begin" mostly because end begin looks like a weird oxymoron that would also be confusing to readers. "end do" does not seem to have this problem. Finally, I think the coroutine concept of ai-0197 pretty much fall out of this for free. If the keyword "parallel" is not present, then the semantics could be that each arm gets its own executor, similar to when the parallel keyword is present, but each executor is tied to the same core, thus each arm of the construct executes concurrently, but not in parallel, as only one arm can be executing at one time. If you want more parallelism, simply add the parallel keyword. I dont think channels are needed or anything else, one can use existing capabilities in Ada to provide the channels. For example, Ada.Containers.Synchronous_Queues could be used to provide a channel. Here is an example of coroutines involving two producers and one consumer. The first arm is an endless loop that produces integer values, the second arm is a bounded loop that produces higher integer values, and the third arm is the consumer which will end up pulling values from both the other arms. The Exit statement of the consumer is used here to terminate the construct, which will abort the endless loop of the 1st arm. There may be reasons why adding "Exit" to block statements would not fit very well with syntax. I suspect there would be less reason to disallow Exit in a do construct. Alternatively, this could be a goto, return, or exception, which is treated as a transfer of control out of the construct, which we've already discussed. declare Queue : Integer_Queues.Queue; begin do declare X : Integer := 0; begin loop -- endless loop Integer_Queue.Enqueue (X); X := X + 1; end loop; end; and for I in 1 .. 3 loop Integer_Queue.Enqueue (1_000_000 + I); end loop; and declare Value : Integer; begin for I in 1 .. 5 loop Integer_Queue.Dequeue (Value); Put_Line ("Value=" & Integer'Image (Value)); end loop; Exit; -- or raise exception??? end; end do; I've tried this code in Paraffin, using Ada's asynchronous transfer of control mechanism, and I see the expected results; Value= 1000001 Value= 1000002 Value= 1000003 Value= 0 Value= 1 The last two values might be output before the first three depending on which arm gets to execute first. I think we could go for better determistic behaviour and say that the initial order of execution is top down. Once all have had their initial start, they proceed in a concurrent fashion dependent on the behaviour of the "channel" and the blocking of the arm branches. Anyway, I think these are compelling reasons for using "do" rather than "begin", so I will write my homework up that way, unless someone can convince me otherwise before then. If we decide to go back to "begin", I dont think it will be a big change to go that way. I just wanted to present these ideas earlier so that it wont be as much a surprise when we meet in Boston, and to possibly receive comment earlier. **************************************************************** From: Randy Brukardt Sent: Tuesday, October 10, 2017 12:42 AM ... > I just wanted to present these ideas earlier so that it wont be as > much a surprise when we meet in Boston, and to possibly receive > comment earlier. With the homework deadline about 40 hours from now, there isn't much time for comment and AI update. And last minute flurries of mail is an issue for me, too as it takes time to file e-mail threads. Ergo, at this point, just do it. (Also, do as I say, not as I do... ;-) >Finally, I think the coroutine concept of ai-0197 pretty much fall out >of this for free. If the keyword "parallel" is not present, then the >semantics could be that each arm gets its own executor, similar to when >the parallel keyword is present, but each executor is tied to the same >core, thus each arm of the construct executes concurrently, but not in >parallel, as only one arm can be executing at one time. OK, but what is providing the "yield" semantics? Without that, it's just stupid tasking. And it can't require a lot of work to write, lest it hardly solve anything that can't already be solved with an existing task. > I dont think channels are needed or anything else, one can use > existing capabilities in Ada to provide the channels. > For example, Ada.Containers.Synchronous_Queues could be used to > provide a channel. I don't think that is what Tucker had in mind -- if normal task communication would work, there'd be no need to propose such an idea in the first place. (Nor any reason for coroutines, but I digress.) In any case, I want to find out exactly what he had in mind, and trying to preempt him is not helpful. > There may be reasons why adding "Exit" to block statements would not > fit very well with syntax. I suspect there would be less reason to > disallow Exit in a do construct. Exit fits fine with the syntax of a block statement. But allowing it is wildly incompatible, consider the Ada 95 code: loop begin exit when ...; end; end loop; If block statements had exit, this exit would exit the block, rather than the loop. That's clearly not the intention of the writer of the Ada 95 code. As you note, there's no compatibility problem with "do", but I don't think it is a particularly good idea to allow that in one construct but not in a similar one. Is this really necessary? It seems like a rare need, and a goto works fine (even if clunky). **************************************************************** From: John Barnes Sent: Tuesday, October 10, 2017 1:53 AM i rather like do. **************************************************************** From: Tullio Vardanega Sent: Tuesday, October 10, 2017 2:25 AM So do I. **************************************************************** From: Brad Moore Sent: Tuesday, October 10, 2017 9:24 AM > Ergo, at this point, just > do it. Interesting choice of words. If you had said, just begin it, it would probably lower the chances of me getting it done. ;) > Finally, I think the coroutine concept of ai-0197 pretty much fall >> out of this for free. If the keyword "parallel" is not present, then >> the semantics could be that each arm gets its own executor, similar >> to when the parallel keyword is present, but each executor is tied to >> the same core, thus each arm of the construct executes concurrently, >> but not in parallel, as only one arm can be executing at one time. > OK, but what is providing the "yield" semantics? Without that, it's > just stupid tasking. And it can't require a lot of work to write, lest > it hardly solve anything that can't already be solved with an existing task. We could have two subtype attributes, 'Consume and 'Yield allowed only in "do" statements that semantically read and write to an implicit buffer or queue of that subtype. The scope of the implicit queue would be tied to the scope of the do statement. This would be expected to work regardless whether the do statement has the parallel keyword or not. The default length of the queue could be 1, but perhaps could be controlled by an aspect on the subtype declaration. e.g. do for I in 1 .. 1_000_000 loop Integer'Yield(I); end loop; and for I in 1 .. 1_000_000 loop Float'Yield(Sqrt(I)); end loop; and for I in 1 .. 10 loop Put_Line ("The Square Root of" & Integer'Image(Integer'Consume) & " is" & Float'Image(Float'Consume)); end loop; goto Done; end do <> **************************************************************** From: Brad Moore Sent: Wednesday, October 11, 2017 4:01 PM Here is my homework for this AI. [This is version /04 of the AI - Editor.] I have applied major rework to the reduction expression construct, since it was very sketchy to begin will. I have also renamed parallel blocks to concurrent blocks, using the do ... and .. end do syntax I described in a recent email. If the parallel keyword is not present, then the do statement executes concurrently within the same task, which is safer and more useful than just applying sequential execution. This will also provide a good starting point for AI12-0197, coroutines, since most of the construct is already in place. For parallel loops, the biggest change was to move the parallel keyword from in the middle of the syntax, to before the "for" keyword. **************************************************************** From: Randy Brukardt Sent: Wednesday, October 11, 2017 4:01 PM > Here is my homework for this AI. I fixed some minor issues in this AI (without making a separate version): The reference to AI12-0064-1 is stale (that alternative is abandoned), and the aspect is named Nonblocking. The paragraph starting "Any transfer of control..." had two left-over uses of "parallel sequences", which I replaced by "concurrent sequences" to match the other changes (including earlier in the same sentence). The paragraph starting "Note that the same rules..." talks about "parallel blocks", which was otherwise changed to "concurrent blocks". There was a missing ) in the first paragraph of the Legality Rules for 4.5.9. There was a stray "Modify 5.5.2", which was removed. Some of the formatting was unusually narrow (while it was unusually wide last time - can't win, I guess). === Comment: Examples in the RM need to be complete, typically by depending on declarations from previous examples. None of the examples in 4.5.9 or 5.6.1 seem to do that. (A couple might be stand-alone, which is OK, but I didn't check carefully.) That needs to be fixed. Comment: You added "reducer" aspects as needed to Ada.Strings and the like. But don't the containers need something similar? I could do that in AI12-0112-1, but I'd have to know what is needed (and I'm not the right person to figure that out). **************************************************************** From: Brad Moore Sent: Wednesday, October 11, 2017 11:26 PM ... > I fixed some minor issues in this AI (without making a separate version): ... > There was a missing ) in the first paragraph of the Legality Rules for > 4.5.9. Thanks for catching these, Randy. > There was a stray "Modify 5.5.2", which was removed. > > Some of the formatting was unusually narrow (while it was unusually > wide last time - can't win, I guess). I dont know how wide is good. I had my text editor set at a 72 character margin by default and tried to stick to that. Would 80 characters be better? === > > Comment: Examples in the RM need to be complete, typically by > depending on declarations from previous examples. None of the examples > in 4.5.9 or 5.6.1 seem to do that. (A couple might be stand-alone, > which is OK, but I didn't check carefully.) That needs to be fixed. > > Comment: You added "reducer" aspects as needed to Ada.Strings and the like. > But don't the containers need something similar? I could do that in > AI12-0112-1, but I'd have to know what is needed (and I'm not the > right person to figure that out). I thought I had gone through the containers in this AI. It should be there. For instance, see  Modify A.18.2 (15/2-18/.2) and so on.... **************************************************************** From: Randy Brukardt Sent: Wednesday, October 11, 2017 11:41 PM ... > I dont know how wide is good. I had my text editor set at a > 72 character margin by default and tried to stick to that. > Would 80 characters be better? The usual is either 79 or 80 (depends on whether a program or I am doing it). Some of the old text seemed to be about 100 characters, and the new was 72. Tough to compare. **************************************************************** From: Brad Moore Sent: Thursday, December 14, 2017 9:45 AM I have been thinking more about parallelization problems that need what I have been calling, manual chunking, where the user needs to have more control of how loops are broken down into chunks. Previously I had identified a number of such problems, including loops that involve synchronous barriers, such as gaussian elimination for solving matrices, or where multiple loops need to operate on an array using the same chunking, such as for computing a cumulative sum of an array of elements. Since then, I have identified more generally the need, which basically amounts to cases where user-provided initialization and/or finalization needs to be applied to the chunks. I think a good example for this need is the case of parallel file IO. Consider a large file of fixed-size records where all records in the file need to be processed, and where it is desired to to this in parallel. It would be too inefficient to open and close a file handle for each record, and then seek to the position of the record to be modified. What is needed here is for each chunk to open its own file handle, process the records of that chunk, then close the file handle. It is not reasonable to expect that the compiler will be able to automatically provide such chunk initialization and finalization, and I think trying to accomplish this with syntax using aspects would be too messy. A library approach seems like a better alternative. Another example is memory allocation, where when processing the elements of an array or container, a temporary data structure is needed for the processing. Again, rather than allocate and deallocate this data structure for every element of the array or container, it makes better sense to allocate the structure once for each chunk (whether it be allocated from the stack or the heap), and then free the data structure once the chunk processing is complete. It seems that there ought to be a standard library/facility that can be used to generate the chunking. This makes sense in particular because; - Such a library can be non-trivial to write - Such a library if provided by the vendor can be better tuned to how the vendor implements parallelism for parallel loops, blocks, etc. I have been prototyping what such a library might look like, and have a working implementation that also ties in to how parallel iteration can be added to all the standard containers. The basic idea is to provide a special container library that can generate a chunk-vector. The chunk-vector is an abstraction that essentially is a vector of chunk boundaries, where each chunk boundary identifies the start and end cursors of a larger set of iteration. I have two generic libraries for this. One is intended for use when iterating over discrete subtypes, and the other is intended for use when iterating over containers. The basic skeleton of the discrete chunker is; generic type Loop_Cursor is (<>); package Ada202x.Containers.Loop_Chunking.Discrete_Iteration is type Chunk_Bounds is tagged private; function Start (Chunk : Chunk_Bounds) return Loop_Cursor; function Finish (Chunk : Chunk_Bounds) return Loop_Cursor; type Chunk_Vector (<>) is tagged limited private with Constant_Indexing => Constant_Reference, Default_Iterator => Iterate, Iterator_Element => Chunk_Bounds; function Chunks (From : Loop_Cursor := Loop_Cursor'First; To : Loop_Cursor := Loop_Cursor'Last) return Chunk_Vector; type Cursor is private; package Chunk_Vector_Iterator_Interfaces is new Ada202x.Iterator_Interfaces (Cursor, Has_Element); function Iterate (Container : Chunk_Vector) return Chunk_Vector_Iterator_Interfaces.Parallel_Iterator'Class; private ... end Ada202x.Containers.Loop_Chunking.Discrete_Iteration; The basic skeleton of the container chunker is very similar, differing only in the generic formal parameters; generic type Loop_Cursor is private; with package Container_Iterator_Interfaces is new Ada202x.Iterator_Interfaces(Cursor => Loop_Cursor, Has_Element => <>); package Ada202x.Containers.Loop_Chunking.Container_Iteration is type Chunk_Bounds is tagged private; function Start (Chunk : Chunk_Bounds) return Loop_Cursor; function Finish (Chunk : Chunk_Bounds) return Loop_Cursor; type Chunk_Vector (<>) is tagged limited private with Constant_Indexing => Constant_Reference, Default_Iterator => Iterate, Iterator_Element => Chunk_Bounds; function Chunks (Iterator : Container_Iterator_Interfaces.Parallel_Iterator'Class) return Chunk_Vector; type Cursor is private; package Chunk_Vector_Iterator_Interfaces is new Ada202x.Iterator_Interfaces (Cursor, Has_Element); function Iterate (Container : Chunk_Vector) return Chunk_Vector_Iterator_Interfaces.Parallel_Iterator'Class; private ... end Ada202x.Containers.Loop_Chunking.Container_Iteration; I've also added the following parallel iterator interface to Ada.Iterator_Interfaces; use type Ada202x.Containers.Count_Type; subtype Count_Type is Ada202x.Containers.Count_Type; type Cursor_Offset is range -(Count_Type'Last - 1) .. Count_Type'Last - 1; type Parallel_Iterator is limited interface and Forward_Iterator; function Length (Object : Parallel_Iterator) return Count_Type is abstract; -- A count of the number of elements in the container to be iterated over function Jump (Object : Parallel_Iterator; Position : Cursor; Offset : Cursor_Offset) return Cursor is abstract; -- A means to generate a cursor based on an offset number of elements -- from another cursor of the container. type Reversible_Parallel_Iterator is limited interface and Parallel_Iterator and Reversible_Iterator; A chunk-vector is a container that can be iterated over in parallel using constructs such as parallel loops and reduction expressions. To see how this might be used, consider the parallel IO example I described above, where we have a file of employee records, and we want to update all the records to give everyone a 2% raise of their current salary. One could write; with Ada.Direct_IO; with Ada202x.Containers.Loop_Chunking.Discrete_Iteration; procedure Test_Chunking is subtype Last_Name is String (1 .. 12); subtype Address is String (1 .. 20); type Annual_Salary is delta 0.01 digits 15; type Employee is record Employee_Name : Last_Name; Home_Address : Address; Salary : Annual_Salary; end record; package Employee_IO is new Ada.Direct_IO (Element_Type => Employee); Data_Filename : constant String := "Employees.dat"; Record_Count : constant Employee_IO.Positive_Count := Get_Record_Count; subtype Loop_Index is Employee_IO.Positive_Count range 1 .. Record_Count; package Manual_Chunking is new Ada202x.Containers.Loop_Chunking.Discrete_Iteration (Loop_Index); procedure Give_Raise (From, To : Loop_Index) is Employee_File : Employee_IO.File_Type; Current_Employee : Employee; begin Employee_IO.Open (File => Employee_File, Mode => Employee_IO.Inout_File, Name => Data_Filename, Form => "shared=no"); Employee_IO.Set_Index (File => Employee_File, To => From); for Position in From .. To loop Employee_IO.Read (File => Employee_File, Item => Current_Employee); -- 2% Raise to everyone! Current_Employee.Salary := @ + @ * 0.02; Employee_IO.Write (File => Employee_File, Item => Current_Employee); end loop; Employee_IO.Close (Employee_File); end Give_Raise; begin -- Test_Chunking parallel for Chunk of Manual_Chunking.Chunks loop Give_Raise (From => Chunk.Start, To => Chunk.Finish); end loop; end Test_Chunking; One might further enhance this example to use the anonymous loop body syntax, if that AI is approved. A similar example could be concocted for manually chunking parallel iteration of a container. Further to this, once we've defined the parallel iterator interface, we could update all the standard containers to use this interface. For example, I have prototyped the changes to Ada.Containers.Doubly_Linked_Lists; The relevant bits are; with Ada202x.Iterator_Interfaces; generic ... package Ada202x.Containers.Doubly_Linked_Lists is ... package List_Iterator_Interfaces is new Ada202x.Iterator_Interfaces (Cursor, Has_Element); ... function Iterate (Container : List) return List_Iterator_Interfaces.Reversible_Parallel_Iterator'Class; function Iterate (Container : List; Start : Cursor) return List_Iterator_Interfaces.Reversible_Parallel_Iterator'Class; ... private end Ada202x.Containers.Doubly_Linked_Lists; If a loop is written without the parallel keyword or reverse keyword, you get a forward iterator. If you have the reverse keyword, you get the reverse iterator, or if you have the parallel keyword, you get the parallel iterator semantics. Note also that when user initialization/finalization of chunks is not needed, the implementation could fall back on its own chunking strategies, but a vendor may find that the chunking libraries I've described above are also useful for implementing its own parallelism for these cases. Does this seem like ideas worth pursuing? If so, should this be added to the parallel operations AI (AI12-0119-1), or should this be a separate AI? *************************************************************** From: Tucker Taft Sent: Thursday, December 14, 2017 4:56 PM The paper about Generalized Parallel Iterators from Ada 2016 seems relevant: http://www.ada-europe.org/archive/auj/auj-37-2.pdf (scan to physical page 31, logical page 95) The accompanying presentation is here: http://www.cister.isep.ipp.pt/ae2016/presentations/taft.pdf It proposes a syntax that would allow the following: declare package String_Hyp_Objs is new Indefinite_Hyper_Objects (String, Identity => “”, Reducer => “&”); String_Accum : String_Hyp_Objs.Accumulator (Count => Num_Chunks); begin for Elem of parallel(C in 1 .. String_Accum.Count) My_Str_Vector loop String_Accum.Update (Index => C, Element => Elem); -- Concatenate Elem onto end of growing accumulated result end loop; Put_Line (String_Accum.Reduce); -- Put concatenation of all strings from My_Str_Vector end; Since we have moved the location of the "parallel" reserved word in newer proposals, we would have to move the part that declares the "chunk index" C, namely "C in 1..String_Accum.Count" (using syntax reminiscent of an entry-family index). But the basic idea of providing an explicit chunk index still seems viable. Providing a chunk index to the body of the loop might enable many of the capabilities you are suggesting. Would such a chunk index be enough? **************************************************************** From: Randy Brukardt Sent: Thursday, December 14, 2017 5:17 PM ... > Does this seem like ideas worth pursuing? If so, should this be added > to the parallel operations AI (AI12-0119-1), or should this be a > separate AI? I can't really answer the first part (again, it makes me wonder if there is sufficient value to the "usual" parallel loop), but the second part is pretty clear: it has to be a separate AI. If we don't buckle down and finish some version of the parallel loop stuff ASAP, there is no chance that any of it will be ready the summer after next (when our work on Ada 2020 completes). If this really isn't sufficiently mature to standardize (and the fact that you have a radically different proposal every 2-3 months strongly suggests that), then we should completely forget it for this cycle and spend our time on other proposals that actually have a chance to converge. We've only got 18 months to finish this!!! **************************************************************** From: Tucker Taft Sent: Thursday, December 14, 2017 5:21 PM > Does this seem like ideas worth pursuing? If so, should this be added > to the parallel operations AI (AI12-0119-1), or should this be a separate AI? I would recommend a separate AI. I believe we have also recommended splitting AI12-0119 into two AIs, one about reduction expressions and one about parallel loop/bock. Perhaps you could do that sooner rather than later, so we will have a new AI for reduction expressions that we can reference. Ed Schonberg at AdaCore has already implemented a version of reduction expressions, and it is pretty nice. **************************************************************** From: Randy Brukardt Sent: Thursday, December 14, 2017 5:29 PM Yes, all of the above is true. And with a January 26th homework deadline, all homework needs to be done sooner rather than later. Hint, Hint!! :-) BTW, the "new idea" deadline for ARG members is June 15th, so keep that in mind. After June, we're supposed to ignore new feature ideas from everyone, including each other. (For the general public, it is January 15th.) **************************************************************** From: Tucker Taft Sent: Thursday, December 14, 2017 5:41 PM > If this really isn't sufficiently mature to standardize (and the fact > that you have a radically different proposal every 2-3 months strongly > suggests that), then we should completely forget it for this cycle and > spend our time on other proposals that actually have a chance to > converge. We've only got 18 months to finish this!!! This is not a good outcome from my perspective. Our prime directive should be to get parallel blocks and parallel loops done. So I agree with Randy that we should finish the basic capabilities that we started, before inventing anything new. Splitting the AI into two parts would be a good start, and then refining the wording of each as needed. **************************************************************** From: Jeff Cousins Sent: Tuesday, December 19, 2017 4:48 AM How active is the “Gang of Four” concept? Probably naively, I’ve been assuming that they would have been having e-mail discussions, and possibly even a meeting, outside of the full ARG. Or is the correspondence that we’ve been seeing on the ARG list everything? **************************************************************** From: Tucker Taft Sent: Tuesday, December 19, 2017 8:08 AM No regular "gang-of-four" meetings these days. I think we are largely past that stage, and moved into the "AI" stage, where the ARG is the center of action. **************************************************************** From: Brad Moore Sent: Wednesday, December 27, 2017 11:00 PM Here is a first chunk of my homework. Here we have what remains of AI12-0119-1. [This is version /05 - ED.] As you may recall, reduction expression syntax is being separated from this AI, to be placed in a new so-far unknown AI. Coroutine capabilities are also similarly deferred to a different AI. I have hopefully addressed the comments from the last meeting. Some notable new bits are; - I have added parallel iterator interfaces to Ada.Iterator_Interfaces to facilitate container iteration, and updated all the containers to have parallel iterators (in addition to the iterator schemes currently supported). - I have also attempted to add legality rules to eliminate data races in parallel loops and parallel blocks, and a new suppressible check called Loop_Parameter_Check, which dynamically checks that the loop parameters are not used non-sequentially between tasklets to eliminate that as a source of erroneousness. - I am hoping that the global AI means that most data races can be eliminated as a legality check, rather than a dynamic runtime check, or static warning. - I think the only ones that need dynamic checking are those that update global variables that also mention the loop parameter variable. Cheers, Merry Christmas, and Happy New Year! *************************************************************** From: Jean-Pierre Rosen Sent: Thursday, December 28, 2017 12:07 AM > On the other hand, calls by different > tasklets of the same task into the same protected object are treated > as different calls resulting in distinct protected actions; therefore > synchronization between tasklets can be performed using protected > operations. What if a tasklet calls an entry of the enclosing task, and another one accepts it? It would work with real parallelism, but it would be a sure deadlock if executed sequentially. Maybe forbid accepts in parallel statements? *************************************************************** From: Jeff Cousins Sent: Thursday, December 28, 2017 2:25 AM Thanks for doing that Brad, and Happy Christmas and New Year everyone. *************************************************************** From: Brad Moore Sent: Thursday, December 28, 2017 11:03 AM > Maybe forbid accepts in parallel statements? Good point, I agree that probably a legality rule should be added to forbid accept statements in parallel blocks and loops. Note, a similar problem can occur for protected objects, for example if one side of a parallel block is a producer and another is a consumer. If executing sequentially, the consumer might block waiting for the producer, which could deadlock, whereas this would work fine for parallel execution. I think what we'd want is to fall back to coroutine-like behaviour if executing sequentially where if a tasklet blocks, it implicitly yields allowing other tasklets to execute. Just prior to last meeting, the AI dealt more with sequential behaviour for the case of a parallel block without the parallel keyword. It was considered to be a coroutine. We decided at last meeting to move the coroutine part of it to a separate AI, but these AIs are still closely intertwined because even if the parallel keyword is present we need to consider the case where the implementation decides to ignore that hint from the programmer (perhaps due to oversubscription of cores from other parallel statements) and execute the parallel block sequentially. I think I need to add some wording to this AI that talks about implicit yields when tasklets are blocked. The other AI can probably focus more on explicit yield statements. *************************************************************** From: Randy Brukardt Sent: Friday, December 29, 2017 12:35 AM > > Maybe forbid accepts in parallel statements? > > Good point, I agree that probably a legality rule should be added to > forbid accept statements in parallel blocks and loops. I thought the intent was to not allow blocking in bodies of parallel blocks and loops. Otherwise, a tasklet could hardly be a lightweight entity, as it would require the entire mechanisms of an Ada task (including queuing, priority inheritance, and the like). While that certainly would be more capable, it also would be much more dangerous. We have a mechanism for compile-time enforcement of nonblocking (just finished), and I had thought part of the reason for defining that mechanism was to use it in parallel blocks and loops. Otherwise, it seems like it was a lot of work for minimal gain. > Note, a similar problem can occur for protected objects, for example > if one side of a parallel block is a producer and another is a > consumer. If executing sequentially, the consumer might block waiting > for the producer, which could deadlock, whereas this would work fine > for parallel execution. I think what we'd want is to fall back to > coroutine-like behaviour if executing sequentially where if a tasklet > blocks, it implicitly yields allowing other tasklets to execute. Again, I'd expect this sort of thing to be banned. Protected operations still could be used to get mutual exclusion, but entry calls would not be allowed. I don't see any reasonable way to reason about the behavior of tasklets if scenarios like the above are allowed, since neither the number of tasklets nor how they "chunk" iterations is known to the author of the parallel loop. The only portable way to write such a loop is to avoid blocking altogether (since there can be no reliable way to ensure that deadlock can't happen). Communication between tasklets has to be prevented (outside of reduction expressions, of course). > Just prior to last meeting, the AI dealt more with sequential > behaviour for the case of a parallel block without the parallel > keyword. It was considered to be a coroutine. > We decided at last meeting to move the coroutine part of it to a > separate AI, but these AIs are still closely intertwined because even > if the parallel keyword is present we need to consider the case where > the implementation decides to ignore that hint from the programmer > (perhaps due to oversubscription of cores from other parallel > statements) and execute the parallel block sequentially. > > I think I need to add some wording to this AI that talks about > implicit yields when tasklets are blocked. The other AI can probably > focus more on explicit yield statements. You can't be serious. This is way worse than my original fear; not only do the supposedly lightweight tasklets have to carry all of the machinery of Ada tasks, but they also have to act like coroutines, with an entire additional level of complication. Assuming that the coroutine proposal actually get fleshed out and into the Standard, a user could write such a mechanism if they wanted it. But paying for it all the time seems mad to me, and you still wouldn't be able to reason in any useful way about how tasklets execute (since you don't know the mapping). Communication between parallel loop iterations (and between parallel block "limbs") should be banned. Otherwise, it means that the supposedly "parallel" operations aren't the independent entities that can actually be run in parallel in the first place. I could imagine relaxing such restrictions in the case when the number of tasklets and their mapping is specified via some future mechanism. (In that case, one could actually portably reason about the behavior.) But it seems to me that we have Ada tasks for such needs; parallel blocks and loops are supposed to be easy-to-use, safe-by-construction facilities to that make it simple to introduce parallelism into a program. If things are complex enough, a library like your (Brad) Parafin makes more sense. It can't have safety-by-construction but can allow more complex structures. *************************************************************** From: Brad Moore Sent: Friday, December 29, 2017 12:33 AM > We have a mechanism for compile-time enforcement of nonblocking (just > finished), and I had thought part of the reason for defining that > mechanism was to use it in parallel blocks and loops. Otherwise, it > seems like it was a lot of work for minimal gain. One of the later developments of the gang of four was that we determined that it would be OK to support protected actions in tasklets. That doesn't necessarily mean we should support that, but that it is an option, so we included it in our proposal. I'd have to look back, but it might have been that we were more keen on supporting protected function calls and protected procedure calls, but not so much blocking calls. I agree that disallowing blocking calls would simplify the proposal, and would be OK with doing that if that is the general consensus. We could always add that in later if we felt it was needed, but it is much harder to disallow something later once it gets in the standard, as you know. *************************************************************** From: Randy Brukardt Sent: Friday, December 29, 2017 6:44 PM ... > One of the later developments of the gang of four was that we > determined that it would be OK to support protected actions in > tasklets. That doesn't necessarily mean we should support that, but > that it is an option, so we included it in our proposal. I'd have to > look back, but it might have been that we were more keen on supporting > protected function calls and protected procedure calls, but not so > much blocking calls. Protected subprogram calls can be considered nonblocking assuming the protected type is declared that way (meaning that static nonblocking checking is applied to the operations). These are important for mutual exclusion, and so they would have to be allowed (and they are with suitable declarations, which applies to everything). > I agree that disallowing blocking calls would simplify the proposal, > and would be OK with doing that if that is the general consensus. We > could always add that in later if we felt it was needed, but it is > much harder to disallow something later once it gets in the standard, > as you know. Right, that's part of my thinking here. If we find some compelling examples where some limited blocking would work and make sense even if the mapping of iterations to tasklets is unknown, then we can revisit. But the goal is safety-by-construction: if the loop compiles, it will execute safely without deadlocks or races. That's a lot easier if we don't have to worry about blocking (it's not clear to me that it is even possible if blocking is involved). And I want tasklets to be a light-weight as possible. Indeed, I would like to view tasks as being constructed on top of tasklets by adding blocking capabilities and the like. *************************************************************** From: Brad Moore Sent: Saturday, December 30, 2018 1:34 PM I have been looking more closely at Tuckers paper from the Ada Users Journal on Container Iteration, and I think it would be worth incorporating some of those ideas. http://www.ada-europe.org/archive/auj/auj-37-2.pdf (scan to physical page 31, logical page 95) In particular, providing a Split primitive on the Parallel_Iterator_Interface instead of a Jump function to Jump to an arbitrary cursor relative to another cursor, gives the container writer more flexibility in determining which cursors of the container should be used for the chunks, as well as possibly the number of chunks to be applied. This also provides a common, better defined interface that can be used by both the implementation, and the user. The compiler would use the interface under normal circumstances, but the user could also use the interface when manual chunking is needed, which would eliminate the need to provide a separate interface for that purpose. I think this can be accomplished with some fairly minor tweaks to the existing AI. As it stands, I would replace the addition to Ada.Iterator_Interfaces of; use type Ada.Containers.Count_Type; subtype Count_Type is Ada.Containers.Count_Type; type Cursor_Offset is range -(Count_Type'Last - 1) .. Count_Type'Last - 1; type Parallel_Iterator is limited interface and Forward_Iterator; function Length (Object : Parallel_Iterator) return Count_Type is abstract; function Jump (Object : Parallel_Iterator; Position : Cursor; Offset : Cursor_Offset) return Cursor is abstract; with; type Chunk_Array is limited interface; function Length (Object : Chunk_Array) return Natural is abstract; function Start (Object : Chunk_Array; Index : Natural) return Cursor is abstract; function Finish (Object : Chunk_Array; Index : Natural) return Cursor is abstract; type Parallel_Iterator is limited interface and Forward_Iterator; function Split (Object : Parallel_Iterator) return Chunk_Array'Class is abstract; The Chunk_Array interface would provide the means to extract the cursors out of the Split result. The Length function indicates how many chunks were returned by the Split call, and Start, Finish return the start and end cursors for a particular chunk. Then for each of the standard containers, I would add a type definition for the Chunk_Array. For example, for Ada.Containers.Doubly_Linked_Lists, I would add; type Chunk_Array (<>) is new List_Iterator_Interfaces.Chunk_Array with private; I believe the Chunk_Array definition would be the only additional line needed in each container. Then, if a user wanted to manually chunk a list container, one could write something like; Chunks : constant My_Lists.List_Iterator_Interfaces.Chunk_Array'Class := List.Iterate.Split; begin parallel for Chunk in 1 .. Chunks.Length loop Inner_Loop : for Position in List.Iterate (Start => Chunks.Start(Chunk)) loop ... exit Inner_Loop when Position = Chunks.Finish (Chunk); end loop Inner_Loop; end loop; end; Note that no additional syntax support is needed than is already proposed in the AI. Does this seem like an improvement? If so, I will update the AI to include these changes. *************************************************************** From: Randy Brukardt Sent: Saturday, December 30, 2018 3:55 PM Tucker's idea certainly makes more sense than your old idea, which requires adding indexed access to containers that don't naturally have such access. But I would have expected the Split interface to at least take a requested number of chunks. Otherwise, the container has no way of knowing how many chunks would be optimal (since that depends on the number of tasklets available and other implementation details). It shouldn't be guessing. And it would seem that a sensible implementation of the interface you have would be to simply return a single chunk for every query, since that is easy to implement and makes as much sense as any other result. I think you'd want to prevent that somehow, and the best I can think of is to have a suggested number of chunks. Of course, it might make sense for the container to return some other number of chunks (if the container doesn't have enough elements for the requested number of chunks, for instance). So that part of the interface makes sense. But we definitely don't want the container interface to be tightly tied to the parallel loop implementation, which it would have to be if there is no number of chunks in the Split request. Hopefully both things can be implemented separately (quite possibly by different groups for larger compiler vendors). P.S. You might want to consider whether the routines of this interface ought to be declared nonblocking. One can't do that with the existing iterators because of compatibility, but you have no such concern. *************************************************************** From: Brad Moore Sent: Sunday, December 31, 2018 3:05 PM > Tucker's idea certainly makes more sense than your old idea, which > requires adding indexed access to containers that don't naturally have such > access. I have actually modified Tuck's ideas from his paper in several ways, which I think are improvements. In particular, he had Parallel_Interfaces in a generic child package of Iterator_Interfaces, using a Cursor_Array formal as an incomplete type. I started off with that but couldn't get it to work, as the Cursor_Array had to be a private type, which then nobody, (Implementations or Users) could do anything with the private type. By defining the Chunk_Array interface in the same place as the iterator, I was able to eliminate the generic formal, allowing the interface definition to be moved into iterator_interfaces, which means only one instantiation is needed to get all iterator interfaces, which I think is more convenient. He also had the Cursor array being passed into Split as an out parameter. I think returning the Chunk_Array as a function result will be more convenient to users. Lastly, his Split operation was returning an array of cursors, whereas my Split returns an array of chunks where each chunk has a start and end cursor. I think this is safer, particularly for user usage, since with an array of cursors, the user has to be sure to not dereference the cursor representing the end of the chunk, because that is being used as the start of another chunk. Also, it eliminates the user having to deal with the end case for the last chunk where the last cursor would have been a No_Element cursor. With the current scheme, all cursors from first to last of a chunk are valid and non overlapping with other chunks. I think it also fits nicer for user usage, using the Iterate calls that accept a start parameter, where one can exit the inner loop by comparing to the last cursor of the chunk, as in my example from the previous email. > But I would have expected the Split interface to at least take a > requested number of chunks. Otherwise, the container has no way of > knowing how many chunks would be optimal (since that depends on the > number of tasklets available and other implementation details). It > shouldn't be guessing. And it would seem that a sensible > implementation of the interface you have would be to simply return a > single chunk for every query, since that is easy to implement and > makes as much sense as any other result. I think you'd want to prevent > that somehow, and the best I can think of is to have a suggested number of > chunks. I agree, I have prototyped a new version of Split that looks like; function Split (Object : Parallel_Iterator; Advised_Split : Natural) return Chunk_Array'Class is abstract; Where Advised_Split is the recommended number of chunks passed in from the implementation, but can be ignored by the container writer if he/she so chooses. I also added another function to the Parallel_Iterator interface function Iterations (Object : Parallel_Iterator) return Ada.Containers.Count_Type is abstract; Because I think the implementation will need to query the iterator to determine the number of iterations associated with the iterator so that the implementation can make a better informed decision about determining the number of chunks to suggest. I think for user chunking, the user often would prefer to not have to decide the number of chunks, so I was thinking of adding the following definitions to System.Multiprocessors. type Iteration_Count is range -(2 **63) .. +(2 **63 - 1); for Iteration_Count'Size use 64; type Chunk_Count is range 0 .. 2 **32 - 1; function Advised_Chunk_Count (Iterations : Iteration_Count) return Chunk_Count witn Non_Blocking => False; Which provides a routine that the user can call to query the implementation for a recommended chunk count to use, given a number of iterations. I decided to make this routine Non_Blocking => False because the routine might want to query the processor load to see how many processors are idle. On Linux, I have found this can be done by reading system files from the /proc filesystem, but presumably that would be blocking behaviour. Lastly, I know we last thought another AI would be recommended, but now that container iterators can be defined this way, it makes it a lot simpler to define a package that can similarly support chunking for discrete type iteration. with Ada202x.Iterator_Interfaces; with Ada202x.Containers; generic type Loop_Index is (<>); package Ada.Discrete_Chunking is pragma Preelaborate; pragma Remote_Types; function Has_Element (Position : Loop_Index) return Boolean is (True); package Chunk_Array_Iterator_Interfaces is new Ada202x.Iterator_Interfaces (Loop_Index, Has_Element); type Chunk_Array (<>) is new Chunk_Array_Iterator_Interfaces.Chunk_Array with private; function Iterate (From : Loop_Index := Loop_Index'First; To : Loop_Index := Loop_Index'Last) return Chunk_Array_Iterator_Interfaces.Parallel_Reversible_Iterator'Class; private ... end Ada.Discrete_Chunking; This allows the user to write something like; declare package Manual_Chunking is new Ada.Discrete_Chunking (Natural); Chunks : constant Manual_Chunking.Chunk_Array_Iterator_Interfaces.Chunk_Array'Class := Manual_Chunking.Iterate.Split (Advised_Split => 10); begin parallel for Chunk in 1 .. Chunks.Length loop for I in Chunks.Start (Chunk) .. Chunks.Finish (Chunk) loop ... end loop; end loop; end; Which is very similar to the Container loop example. The only thing that would be nice to have with the Chunk_Array type returned by Split, is if it could be treated as a container so one could replace the loop above with; parallel for Chunk of Chunks loop for I in Chunks.Start .. Chunks.Finish loop ... end loop; end loop; But I suspect this is not currently possible to do, at least with something that returns Interface'Class because I think you'd need to apply the Iterator_Element aspect to the interface which cannot be done, since the interface doesn't know the Iterator_Element type. Maybe there is a way to work around this, but I think its probably good enough to require using "in" iterator loop syntax as above for these cases instead of "of" syntax. Other than manual chunking, it will be the implementation using the interface, so the "prettyness" of the syntax is not needed. *************************************************************** From: Jeff Cousins Sent: Monday, January 1, 2018 6:15 PM Thanks Brad for all that. Some comments, mostly minor/typos: !proposal 2nd para notational -> notional ? !proposal 5th para 9.10 ([23]{11???})), -> simply 9.10(11) ? !proposal 6th para I don’t think that you can say that tasklets are orthogonal to tasks, particularly as the very next sentence covers how they relate; tasklets are a sort of tasks--. Parallel Blocks 3rd para last sentence Following on from Jean-Pierre’s and Randy’s comments, this would reduce to something like “The appropriate use of atomic or protected subprogram calls can be used to avoid erroneous execution.” Parallel Blocks 4th para Potentially_Blocking has been superseded by Nonblocking. Parallel Blocks 5th para “contgrol” -> “control” !wording New paragraph after 5.5 (9/4) This is out of order as 5.5 (7) follows. “an global” -> “a global” Modify to 5.5 (9/4) “decreasing order {or unless” -> “decreasing order{, or unless” “Examples after 5.5(20)” -> “Examples after 5.5(21)” Modify 5.5.1 (6/3) “decended” -> “descended” Separate sentences for parallel iterator object and parallel reversible iterator object would be more consistent wording. Modify 5.5.2 (4/3) “iterator_specificaiton” -> “iterator_specification” Modify 5.5.2 (7/5) “the nominal subtype of the loop parameter{s are} [is]” -> “the nominal subtype of the loop parameter{s} is” – parameters may be plural but their subtype is singular. Modify 5.5.2 (10/3) “the denoted iterator object{s} becomes” -> “the denoted iterator object{s} become[s]” “5.6.1 Parallel Block Statements -> Insert new paragraph 5.6.1 Parallel Block Statements ? Why the opening quotes, and isn’t all of this part an insertion? Why the [] around the first paragraph? Add new paragraph after 11.5 (20) “elememnt” -> “element” Modify A.18.2 (230.2/3) Modify A.18.3 (144.2/3) Modify A.18.6 (94.2/3) Modify A.18.9 (113.2/3) “notes” -> “nodes” Modify A.18.2 (230.4/3) Modify A.18.3 (144.4/3) Modify A.18.6 (94.4/3) Modify A.18.9 (113.4/3) Modify A.18.10 (159/3) “simulataneously” -> “simultaneously” Modify A.18.10 (159/3) “or or” -> “or” ************************************************************** From: Brad Moore Sent: Monday, January 1, 2018 12:33 AM Thanks Jeff for your comments. Since there are now quite a lot of changes being discussed compared to the version of the AI that was sent out, I have attached a new version of the AI. [This is version /06 of the AI - ED.] This version includes the following changes; 1) Addressing Jeff's Comments. 2) Potentially Blocking operations are now disallowed in parallel loop and parallel block statements 3) Added Nonblocking aspect to the new routines where appropriate 4) Modified the Parallel Iterator interface to have Tuck's Split function idea, instead of the Jump function. 5) Added an Advised_Split parameter to the Split function so that the implementation can recommend the number of splits to be applied to the iterator, allowing container implementors to override the recommendation if desired. 6) Added function in System.Multiprocessors to query the implementation for a recommended split count, that can be called by User code if desired. 7) Added a package Ada.Discrete_Chunking, which basically contains a single Split function that returns a Chunk_Array that can be used for parallel iteration involving manual chunking of discrete types, similar to what is also possible using containers. Happy New Year everyone! ************************************************************** From: Tucker Taft Sent: Tuesday, January 2, 2018 9:35 AM > ... > > He also had the Cursor array being passed into Split as an out > parameter. I think returning the Chunk_Array as a function result will be more > convenient to users. ... Convenience to users seems somewhat irrelevant in a routine that is intended to be called by compiler-generated code. Furthermore, the OUT parameter avoided the need for doing dynamic allocation and deallocation on a secondary stack, something which is painful in some environments, and not something you want to require every time you do something with a container. Finally, the size of the passed-in array provided the specification of the number of chunks desired. *************************************************************** From: Brad Moore Sent: Tuesday, January 2, 2018 10:36 AM > Convenience to users seems somewhat irrelevant in a routine that is > intended to be called by compiler-generated code. I agree that your original design made perfect sense for the original intent outlined in the paper, which was as something called only by compiler-generated code. I say the newer design has improvements in the context of now also allowing users to call the same routines, since there seems to be a number of cases where that would be needed. Since the user can now be involved, more thought is needed to making the interface more user-friendly. > Furthermore, the OUT parameter avoided > the need for doing dynamic allocation and deallocation on a secondary > stack, something which is painful in some environments, and not > something you want to require every time you do something with a container. Keep in mind that this occurs only for parallel usage, and for parallel usage for the parallelism to be worthwhile in the first place, I would think the amount of processing needed to do the allocation and deallocation of the chunk array would be insignificant compared to the amount of time needed to do the processing on the elements in the container. How much pain is there to do such allocations for such environments? I would think regardless how painful, it is still something that generally needs to be supported in such environments for other use cases. > Finally, the size of > the passed-in array provided the specification of the number of chunks desired. Yes, except I thought the idea was also that the container implementor might want to override the number of desired chunks. As an out parameter, the container writer is more tied to the number passed in by the implementor. For a binary tree container for instance, maybe some power of two representing the top nodes in the tree, makes for a better division of work, and that might be less or more than the number of elements in the out parameter array. So long as the number of chunks is less than or equal to the number of elements in the container, it should work for parallel usage. *************************************************************** From: Tucker Taft Sent: Tuesday, January 2, 2018 12:22 PM > I agree that your original design made perfect sense for the original > intent outlined in the paper, which was as something called only by > compiler-generated code. > > I say the newer design has improvements in the context of now also > allowing users to call the same routines, since there seems to be a number > of cases where that would be needed. Since the user can now be involved, more > thought is needed to making the interface more user-friendly. If that is really true, then I think we should consider making improvements on the syntactic sugar, rather than spending energy trying to make the compiler-oriented operations friendlier. > Furthermore, the OUT parameter avoided >> the need for doing dynamic allocation and deallocation on a secondary >> stack, something which is painful in some environments, and not >> something you want to require every time you do something with a container. > > ... > How much pain is there to do such allocations for such environments? I > would think regardless how painful, it is still something that generally > needs to be supported in such environments for other use cases. I just don't think we want to insert this kind of unnecessary dynamic allocation inside of compiler-generated code. We could provide a "user-friendly" version as well, though functions that return unconstrained arrays are not terribly friendly, in that you often have to call them as part of a declaration, which may require the insertion of yet another declare block in the middle of things. > Finally, the size of >> the passed-in array provided the specification of the number of chunks desired. > > Yes, except I thought the idea was also that the container implementor > might want to override the number of desired chunks. As an out > parameter, the container writer is more tied to the number passed in > by the implementor. For a binary tree container for instance, maybe > some power of two representing the top nodes in the tree, makes for a better > division of work, and that might be less or more than the number of elements > in the out parameter array. I agree that some chunks at the end might end up empty, but producing more chunks than what the caller has specified as a "max" number of chunks seems undesirable. *************************************************************** From: Randy Brukardt Sent: Wednesday, January 10, 2018 10:43 PM ... > Since there are now quite a lot of changes being discussed compared to > the version of the AI that was sent out, > > I have attached a new version of the AI. Here's a list of editorial changes that I made to this version (there were enough so I posted this as a separate version, version /07): RM section/paragraph numbers don't include a space, and should not be prefixed by anything (in particular, by "RM"). AARM section/paragraph numbers should always be prefixed by "AARM". (This is part of the "Kiyoshi consistency rules", and as such is not negotiable). Only one space after a period ending a sentence. Text that Tuck wrote is almost always wrong about this, but don't copy that style! The rules for concurrent access includes a header and three bullets, and thus would best be described as "9.10(11-14)". The AI number of reduction expressions will be AI12-0242-1, so I plugged that in. The "rules for shared variables" are at a minimum 9.10(11-14); there is a specific change to 9.10(13) being recommended, so it seems best to mention both. "(see 9.10(11-14), and specifically 9.10(13))". The headers for wording are inconsistent or missing in various places, and the wording needs to be in paragraph order. Use either "Replace xx(xx)", "Modify xx(xx)", "Add after xx(xx)", or "Add new subclause xx". Normally, I'd end all of these with a colon (:) but there is some flexibility here as they're not mentioned in the "Kiyoshi rules". But we don't want them to vary for different paragraphs. One thing the "Kiyoshi" rules" does mention is that one never talks about "paragraphs", one just says "add after". Some specific issues: The grammar change at 5.5(3/3) has to be a "Replace" since there are no modification marks in the following text (and using those in grammar changes is always questionable, since [] and {} have a syntax meaning). There is no location at all specified for the Legality Rules. I added a header to insert them after 5.5(6/5). (That seems like the best place, since we want the basic definition of a "loop parameter" to occur before we start talking about the rules for such things.) The Legality Rules refer to specific paragraph numbers; that's not allowed in wording. In those rules: ... *parallel* reserve{d} word. A change at 5.5(7) has to go before a change at 5.5(9/4). (Jeff noted this.) "Modify [to] 5.5(9/4)" This section also has quote marks around the text for no known reason. The "with Ada.Containers" at 5.5.1(1/3) is not needed with the new specification, since it doesn't reference anything from that package. I deleted it. I added a separate header for the added paragraph after 5.5.2(10/3), and deleted the insertion marks (they're way too easy to miss). And there were no insertion marks for the new AARM note. Changed the header of 5.5.2(13/3) to: "Modify 5.5.2(13/3) [into three paragraphs]" because that's not obvious to a causal reader (me). "Section" is what we once might have called a chapter. These added things are subclauses. Some of the wording text has rather long lines. 80 character lines is generally the maximum. The "Iterate" paragraphs are especially bad. P.S. Brad, if you are going to make changes to this AI, please make sure you have version /07 first. If it isn't posted yet, please ask me for a copy. *************************************************************** From: Randy Brukardt Sent: Wednesday, January 10, 2018 10:42 PM ... > I have attached a new version of the AI. Here are my substantive comments on this AI. (See the other message for the editorial comments, all of which I've already applied.) > Legality Rules, added after 5.5(6/5): > A loop_statement with the *parallel* reserve word shall not update > variables global to the loop unless either the action is sequential > (see 9.10), or the expression for the update mentions the loop parameter. An action being "sequential" is a dynamic concept (actually, it is defined for "erroneous execution", but that is obviously a dynamic concept). We can't base Legality Rules on dynamic concepts, because they cut through privacy and the like (thus violating all of the good Ada principles). I thought the idea was to use Global to define this properly for a Legality Rule, essentially not allowing any call or other use of a global variable unless it is "synchronized" (using the meaning of that from Global). Exactly how to word this isn't obvious to me; perhaps there is some wording defining Global that could be reused. (The wording would also need to allow the exception you noted your wording above.) > In that same wording: > The sequence_of_statements of a loop_statement with the *parallel* > reserve word shall not consist of a statement that can invoke a > potentially blocking operation (see 9.5). Wrong. "Potentially blocking" is a dynamic concept (there shortly will be an AI on this mistake). You want to use the Legality Rules defined for Nonblocking (specifically, 9.5(58-65/5). Unfortunately, there is no term specifically for this (maybe there should have been). And Standard wording cannot use paragraph numbers. So perhaps say something like: The sequence_of_statements of a loop_statement with the *parallel* reserved word are considered to be contained in a nonblocking program unit for the purposes of Legality Rules (see 9.5). AARM Ramification: This meaning that calling subprograms that allow blocking, or directly blocking statements like an entry call, accept statement, or delay statement, are not allowed in the body of a parallel loop. This is true even if the surrounding unit allows blocking. > Add after 5.5(7): > When the reserved word parallel is present and a transfer of control > out of the loop occurs, an attempt is made to cancel further parallel > execution of the sequence_of_statements that has not yet started. I thought the intent was that we'd try to cancel further parallel execution of the sequence_of_statements that HAS started, and to prevent any further iterations from starting. In any case, we need to be more specific about what "cancel" means. (Is it "abort", with all of the baggage of that? I'd hope not, but then precisely what is required to be canceled, and what *can* be canceled?) > Add after 5.5.2(10/3): > For a parallel generalized iterator, the operation Iterations of the > iterator type is called first to determine the number of iterations > associated with the iterator. I note in passing that this doesn't make any sense if we really add "when" filters, as the only way to know that actual number of iterations is to partially execute them (assuming the filter is an arbitrary expression). One probably would have to do chunking based on the number of iterations expected ignoring any filter, but that of course makes the filter essentially a lie (the iteration will have to happen regardless of the filter value; only the executed statements would change. Moreover, the chunking would often be suboptimal in such a case). > Modify 5.5.2(13/3) [into three paragraphs] The new third paragraph is: > If the loop parameter is a constant (see above), then the indexing > uses the default constant indexing function for the type of the > iterable container object for the loop; otherwise it uses the default variable > indexing function. The intent is that this applies to both forward/reverse container element iterators and parallel container element iterators (and no others), but as a separate paragraph following the parallel container element iterator this is not obvious. Probably it needs a lead-in like: For any kind of container element iterator, if the loop parameter is ... >New subclause 5.5.3 has: > generic > type Loop_Index is (<>); > package Ada.Discrete_Chunking is > > pragma Preelaborate; > pragma Remote_Types; Use aspects, not pragmas, in new packages. (Otherwise, I'd have to complain that the unit name was not repeated as it is in all of the other language-defined packages.) So: generic type Loop_Index is (<>); package Ada.Discrete_Chunking with Preelaborate, Remote_Types is > New subclause 5.6.1. The Legality Rules here need reworking just like the ones in the loop cases. For some reason, the execution of a parallel block is described in a "Static Semantics" section. Put a "Dynamic Semantics" header in there somewhere! > Add after 11.5(20) > > Loop_Parameter_Check > For a parallel loop, check that updates to a component of an array or > to an element of a container via the loop parameter are sequential. See > 9.10(11). This implies that there is a dynamic check somewhere. But there is no such check defined in 5.5 or anywhere else. (This is the pragma Suppress definition, but the actual check has to be defined, too, in particular as to what exception is raised. Look up any check in the RM index to see the appropriate form of the check wording.) > Add after D.16(5/3) > > type Iteration_Count is range 0 .. implementation-defined; > for Iteration_Count'Size use implementation-defined; > > type Split_Count is range 0 .. implementation-defined; > > function Advised_Split_Count > (Iterations : Iteration_Count) return Split_Count > witn Non_Blocking => False; I don't see the point of the Iterations parameter here. An appropriate split would take an estimate as to how expensive the loop body (or block body, for that matter) is as well as the number of iterations. As such, it doesn't seem sensible to have such a function here (in package Multiprocessing). What seems appropriate here is a recommended number of executors; I'd suggest something like: function Advised_Executors return CPU_Range witn Non_Blocking => False; I used CPU_Range because it doesn't make much sense to have more executors than execution resources (processors). (Getting rid of the types simplifies this a lot.) Additionally, it might make sense to specify the task and/or CPU that is involved (as I can imagine the answer being different for different tasks). One hopes that you could call this for some other task as part of setup (and not just be restricted to querying this for yourself). However, since System.Multiprocessors seems to be designed to sit below the level of tasks (it has no task-based queries). So perhaps the more specific version belongs in Dispatching_Domains (along with Get_CPU). function Advised_Executors (T : Ada.Task_Identification.Task_Id := Ada.Task_Identification.Current_Task) return CPU_Range witn Non_Blocking => False; That's enough for tonight. :-) *************************************************************** From: Tucker Taft Sent: Saturday, January 6, 2018 11:16 AM Here is some unsolicited "rah, rah" from the sidelines. I think this shows that the parallel and reduction features are worth all the effort, and back-and-forths, we are investing in them. Begin forwarded message: From: "Mcbeth, FletcherX" Subject: Would love to see tasklets, parallel loops and parallel arrays/reductions thereof within Ada 2020 Date: January 5, 2018 at 8:41:06 PM EST Hello Tucker, My name is Fletcher McBeth. I am working with the HPC group here at Intel Hudson and would very much enjoy receiving any and all updates you may be aware of towards getting your proposed constructs describing tasklets, parallel loops and parallel arrays/reductions thereof into the next version of Ada (hopefully 2020). Thank you in advance. *************************************************************** From: Brad Moore Sent: Wednesday, March 28, 2018 12:03 AM Here is an update to AI12-0119. [This is version /08 - ED]. It is now much smaller, with manual chunking, container iterators, and data-race/deadlocking specific items moved into new separate AI's. Otherwise, the text hasn't changed much except addressed a few typos that were mentioned at the last teleconference call. *************************************************************** From: Tucker Taft Sent: Wednesday, March 28, 2018 9:10 PM Thanks, Brad. As I mentioned in my comments on one of the other AIs, probably should remove the no-data-race, no blocking from this AI, since it is covered by a separate one. Separating out these pieces has really helped, and I think the wording is starting to converge. However, I was struck by this bit of wording: "A parallel block statement requests that two or more sequences of statements should execute in parallel with each other." I find it a bit odd to describe a construct as "requesting" something. We generally just try to describe what a construct does, as opposes to what it "requests." I would suggest you take a look at the introduction to chapter 9. I suspect we will need to introduce the notion of "tasklet" somewhere. I am thinking that perhaps all of these things should be described somewhere in chapter 9, even if some of the syntax appears earlier. If we don't coordinate the definition of parallel block and tasklet with the introduction to tasks in chapter 9, the result is probably going to be very confusing, since chapter 9 is where we talk about threads of control. So I think that argues for having a separate section somewhere in chapter 9 on tasklets, and in that section introduce parallel blocks and loops. And the introduction to chapter 9 will need to set the stage for the discussion on tasklets, even if we don't go into the details until a bit later in a section devoted to tasklets. *************************************************************** From: Randy Brukardt Sent: Thursday, March 29, 2018 6:15 PM I'm not sure I agree. He does define the tasklet semantics in chapter 9 (in 9.12, specifically). It probably wouldn't hurt to mention that in the introduction to chapter 9, too. [Oops, "Clause 9" if we're following current ISO-speak. :-)] OTOH, I don't think we want to hide the actual parallel blocks and loops in Clause 9. Everyone "knows" that writing programs with tasks is hard, and putting the parallel stuff there says that the same is true of them. Putting them with the basic statements signals that these are "easy" (there's nothing easier in Ada than Chapter 5!). Yes, that's marketing more than any major semantic consideration -- but we all know that marketing and appearances matter. That does require deferring some of the rules until the tasklet subclause, and/or forward references. But it would seem very odd to have basic blocking and looping constructs away from the rest of those constructs. (And we hope/expect that parallel programming will get to be very basic during the lifetime of Ada 2020.) In some sense, this is very similar to your argument that eventually compilers will handle a lot of parallelism under the covers. A parallel block or loop itself is just an assertion that the block or loop can be executed in parallel (without regard to whether that "works" in an as-if optimization, but possibly with checking that it is OK). Otherwise, they are just another control structure. So I think the current organization is approximately right (details to be determined). This obviously is a "squishy" argument and this organization is not a big deal -- but we need to consider how this will look in 30 years. We don't want to hide basic control flow in an otherwise little-used clause (explicit tasks would be rare in a world where a lot of parallelism is applied automatically or by simple declarations). *************************************************************** From: Jeff Cousins Sent: Tuesday, April 3, 2018 6:19 AM We agreed to send these offline. !problem 2nd para “independent processing where” -> “independent processing, whereas” would seem better. Typos – “perforamance” -> “performance”, “mainain” -> “maintain”. Add after 5.5(7), 5.6.1 Static Semantics 2nd para It seems odd too talk of cancelling something that hasn’t yet started, maybe “inhibit” instead? 5.6.1 Static Semantics It seems odd to describe transfer of control under Static Semantics rather than Dynamic Semantics – after all, the equivalent wording for parallel loops is under Dynamic Semantics. Modify 9.10(13) Typo – “part of a;” -> “part of a:”. *************************************************************** From: Tucker Taft Sent: Monday, April 2, 2018 3:06 PM Here is an attempt to revise the introduction to "Tasks and Synchronization" to incorporate mention of "parallel constructs" without being more specific than talking about "multiple logical threads of control." Comments welcome. The vocabulary of this introduction will probably form the basis for the descriptions of the semantics of parallel constructs, so it seems important to agree on it first. I purposely did not show the actual edits, so you can just read it to see if it makes sense. The number of edits is actually relatively small. ---------- Modify section 9, Tasks and Synchronization: The execution of an Ada program consists of the execution of one or more tasks. Each task represents a separable activity that proceeds independently and concurrently between the points where it interacts with other tasks. A single task, when in the context of a parallel construct, can represent multiple logical threads of control which can proceed in parallel. The various forms of task interaction are described in this clause, and include: 1.a To be honest: The execution of an Ada program consists of the execution of one or more partitions (see 10.2), each of which in turn consists of the execution of an environment task and zero or more subtasks. 2 the activation and termination of a task; 3 a call on a protected subprogram of a protected object, providing exclusive read-write access, or concurrent read-only access to shared data; 4 a call on an entry, either of another task, allowing for synchronous communication with that task, or of a protected object, allowing for asynchronous communication with one or more other tasks using that same protected object; 5 a timed operation, including a simple delay statement, a timed entry call or accept, or a timed asynchronous select statement (see next item); 6 an asynchronous transfer of control as part of an asynchronous select statement, where a task stops what it is doing and begins execution at a different point in response to the completion of an entry call or the expiration of a delay; 7 an abort statement, allowing one task to cause the termination of another task. 8 In addition, tasks can communicate indirectly by reading and updating (unprotected) shared variables, presuming the access is properly synchronized through some other kind of task interaction. Static Semantics 9 The properties of a task are defined by a corresponding task declaration and task_body, which together define a program unit called a task unit. Dynamic Semantics 10 Over time, tasks proceed through various states. A task is initially inactive; upon activation, and prior to its termination it is either blocked (as part of some task interaction) or ready to run. While ready, a task competes for the available execution resources that it requires to run. In the context of a parallel construct, a single task can utilize multiple processing resources simultaneously. 10.a/3 Discussion: {AI05-0229-1} The means for selecting which of the ready tasks to run, given the currently available execution resources, is determined by the task dispatching policy in effect, which is generally implementation defined, but may be controlled by aspects, pragmas, and operations defined in the Real-Time Annex (see D.2 and D.5). NOTES 11 1 Concurrent execution may be implemented on multicomputers, multiprocessors, or with interleaved execution on a single physical processor. On the other hand, whenever an implementation can determine that the required semantic effects can be achieved when parts of the execution of a single logical thread of control are performed by multiple physical processors acting in parallel, it may choose to perform them in this way. *************************************************************** From: Randy Brukardt Sent: Monday, April 2, 2018 11:01 PM > I purposely did not show the actual edits, so you can just read it to > see if it makes sense. The number of edits is actually relatively > small. As far as I can tell, you only added one sentence to the end of the first paragraph. I couldn't find anything else changed (but I didn't try a word-for-word comparison). Based on your earlier discussion, I was expecting more than that. > Modify section 9, Tasks and Synchronization: > > The execution of an Ada program consists of the execution of one or > more tasks. Each task represents a separable activity that proceeds > independently and concurrently between the points where it interacts > with other tasks. A single task, when in the context of a parallel > construct, can represent multiple logical threads of control which can > proceed in parallel. This one new sentence is fine. But it seems strange that "threads of control" are never mentioned again. I realize that so long as we don't allow blocking in parallel constructs, there isn't much overlap -- but that makes me wonder why you were so concerned about rewriting this. Ergo, I have to think I'm missing something (or you are). *************************************************************** From: Tucker Taft Sent: Tuesday, April 3, 2018 8:15 AM The changes were subtle, and it is interesting that you didn't notice them. The first change was to describe a task as a "separable activity" rather than as a "logical thread of control." Now that it can represent multiple threads of control, we needed a different way to describe a task -- hence, "separable activity." Furthermore, the term "task" is omitted completely from the note about using multiple physical processors, and the term "logical thread of control" appears there as the thing that can be implemented using multiple physical processors. So the key shift here is that parallel constructs result in additional logical threads of control, and that a single task can represent multiple logical threads of control. ... > As far as I can tell, you only added one sentence to the end of the > first paragraph. I couldn't find anything else changed (but I didn't > try a word-for-word comparison). Based on your earlier discussion, I > was expecting more than that. Here is one additional sentence, at the end of paragraph (10): "... In the context of a parallel construct, a single task can utilize multiple processing resources simultaneously. " ... > This one new sentence is fine. But it seems strange that "threads of > control" are never mentioned again. Threads of control are mentioned in the Note, and the term will presumably appear elsewhere when we talk about the dynamic semantics of parallel constructs. > I realize that so long as we don't allow blocking in parallel > constructs, there isn't much overlap -- but that makes me wonder why > you were so concerned about rewriting this. Ergo, I have to think I'm > missing something (or you are). There is a shift in vocabulary, which is subtle but important in my view. A task is no longer considered a single logical thread of control. Instead, parallel constructs can result in multiple logical threads of control in a single task, while a task is considered, as mentioned above, a "separable activity." *************************************************************** From: Tullio Vardanega Sent: Tuesday, April 3, 2018 11:42 AM I can see the angle in the new opening paragraph, and the taxonomy that it implies. A task is a program entity that represents an activity whose execution can progress independently of (hence, separable from) its outside environment. That progress is realized by logical threads of control that can be assigned to physical processors. In the context of a parallel construct, a single separable activity may be assigned as many independent logical threads of control as the available physical processors. *************************************************************** From: John Barnes Sent: Tuesday, April 3, 2018 11:58 AM Hmm. I haven't been following this in detail but I can see that (deo volente) I might have to rewrite my wretched book some day. I am not sure about In the context of a parallel construct, a single task can utilize multiple processing resources simultaneously. I could read this to mean that the task is inside some parallel construct rather than the other way around. *************************************************************** From: Tucker Taft Sent: Tuesday, April 3, 2018 12:10 PM How about: While the body of a task is executing a parallel construct, the task can utilize multiple processing resources simultaneously. *************************************************************** From: John Barnes Sent: Tuesday, April 3, 2018 4:07 PM That would be crystal clear. *************************************************************** From: Randy Brukardt Sent: Tuesday, April 3, 2018 6:13 PM > The changes were subtle, and it is interesting that you didn't notice > them. The first change was to describe a task as a "separable > activity" rather than as a "logical thread of control." Now that it > can represent multiple threads of control, we needed a different way > to describe a task -- hence, "separable activity." Furthermore, the > term "task" is omitted completely from the note about using multiple > physical processors, and the term "logical thread of control" > appears there as the thing that can be implemented using multiple > physical processors. I see. So the big deal is to redefine tasks (I didn't pay much attention to them as they aren't changed, at least in how we think of them). I wouldn't have expected that is necessary, but it's clear if you want to do that, you need to do that immediately and not as some sort of bolt on. In any case, what you had read well and made sense. Indeed, I think it makes more sense even for tasks themselves than the original wording (which probably was overspecified) did. *************************************************************** From: Tucker Taft Sent: Tuesday, April 3, 2018 9:12 PM > That would be crystal clear. Probably should adjust the last sentence in paragraph (1) similarly: "... A single task, when in the context of a parallel construct, can represent multiple logical threads of control which can proceed in parallel." Might want to be instead: "... A single task, when executing a parallel construct, can represent multiple logical threads of control that can proceed in parallel." *************************************************************** From: Tucker Taft Sent: Wednesday, April 4, 2018 2:11 PM Indeed - something like that. ***************************************************************