Version 1.10 of ai12s/ai12-0119-1.txt

Unformatted version of ai12s/ai12-0119-1.txt version 1.10
Other versions for file ai12s/ai12-0119-1.txt

!standard 5.5.2 (2/3)          17-10-11 AI12-0119-1/04
!class Amendment 14-06-20
!status work item 14-06-20
!status received 14-06-17
!priority Medium
!difficulty Hard
!subject Parallel operations
!summary
New syntax and semantics to facilitate parallelism via Parallel Loops, Concurrent Blocks, and Parallel Reduction Expressions.
!problem
The increased presence of parallel computing platforms brings concerns to the general purpose domain that were previously prevalent only in the specific niche of high-performance computing. As parallel programming technologies become more prevalent in the form of new emerging programming languages and extensions of existing languages, safety concerns need to consider the paradigm shift from sequential to parallel behavior.
Ada needs mechanisms whereby the compiler is given the necessary semantic information to enable the implicit and explicit parallelization of code. After all, the current Ada language only allows parallel execution when it is semantically neutral (see 1.1.4(18) and 11.6(3/3)) or explicitly in the code as a task.
!proposal
This proposal depends on the facilities for aspect Global (AI12-0079-1) and for aspect Nonblocking (AI12-0064-2). Those proposals allow the compiler to statically determine where parallelism may be introduced without introducing data races.
This proposal informally introduces the semantic notion of a Parallel OPportunity (POP) and a tasklet to the language. POPs are places in program text where work can be distributed to parallel executing workers that work in concert to correctly execute the algorithm. Tasklets are the notational (logical) units within a task that are executed in parallel with each other.
The goals of this proposal are: - To permit existing programs to benefit from parallelism through
minimal restructuring of sequential programs into ones that permit more parallel execution;
- To support the development of complete programs that maximize
parallelism;
- To efficiently guide the compiler in the creation of effective
parallelism (without oversubscription) with minimal input from the programmer;
- To support the parallel reduction of results calculated by parallel
computation;
- To avoid syntax that would make a POP erroneous or produce noticeably
different results if executed sequentially vs in parallel.
An important part of this model is that if the compiler is not able to verify that the parallel computations are independent, then a warning will be issued at compile time (See AI12-0079-1 and AI12-0064-1 for ways on how this can happen).
Note that in this model the compiler will identify any code where a potential data race occurs (following the rules for concurrent access to objects as specified in the Language Reference Manual RM 9.10(23)), and point out where objects cannot be guaranteed to be independently addressable. If not determinable at compile-time, the compiler may insert run-time checks to detect data overlap.
Another issue is the underlying runtime. In this model, tasklets are orthogonal to tasks. Regardless of implementation, tasklets are considered to execute in the semantic context of the task where they have been spawned, which means that any operation that identifies a task, such as those in Task_Identification, will identify the task in which the tasklet is spawned. On the other hand, calls by different tasklets of the same task into the same protected object are treated as different calls resulting in distinct protected actions; therefore synchronization between tasklets can be performed using protected operations. Note that this is consistent with the current standard which already supports multiple concurrent calls by a single task in the presence of the asynchronous transfer of control capability (RM 9.7.4 (23)).
There are three areas of this proposal; one that introduce capabilities for parallel/concurrent blocks, another that introduces capabilities for parallel loops, and another that introduces capabilities for parallel reduction.
Concurrent Blocks ---------------
Concurrent blocks may be used to specify that two or more parts of an algorithm may be executed concurrently, and possibly in parallel with each other.
Semantics: A concurrent block statement encloses two or more sequences of statements (two or more "concurrent sequences") separated by the reserved word "and". Each concurrent sequence represents a separate tasklet, but all within a single Ada task. task identity remains that of the enclosing Ada task, and a single set of task attributes is shared between the tasklets. Each sequence of statements is assigned a unique and independent executor to execute the tasklet. if the parallel keyword is present, then each executor can execute on any core or migrate independently to other cores and the tasklets can execute in parallel with each other. if the parallel keyword is absent, then each executor associated with the concurrent block is assigned to the core of the enclosing Ada task, unless the compiler can determine implicitly that the tasklets can safely be executed on different cores in parallel.
With respect to the rules for shared variables (see RM 9.10(13)), two actions occurring within two different concurrent sequences of the same concurrent block are not automatically sequential, so execution can be erroneous if one such action assigns to an object, and the other reads or updates the same object or a neighboring object that is not independently addressable from the first object. The appropriate use of atomic, protected, or task objects (which as a group we will call synchronized objects) can be used to avoid erroneous execution.
In addition, the new Global and Potentially_Blocking aspects may be specified to enable the static detection of such problems at compile time (see AI12-0079-1 and AI12-0064-1).
Any transfer of control out of one concurrent sequence will initiate the aborting of the other concurrent sequences not yet completed. Once all other concurrent sequences complete normally or abort, the transfer of control takes place. If multiple concurrent sequences attempt a transfer of control before completing, one is chosen arbitrarily and the others are aborted.
If an exception is raised by any of the concurrent sequences, it is treated similarly to a transfer of control, with the exception being propagated only after all the other sequences complete normally or due to abortion. If multiple concurrent sequences executing in parallel raise an exception before completing, one is chosen arbitrarily and the others are aborted. The concurrent block completes when all of the concurrent sequences complete, either normally or by being aborted.
Note that aborting a tasklet need not be preemptive, but should prevent the initiation of further nested concurrent blocks or parallel loops.
parallel Loops --------------
A common approach to parallelizing loops is to break the loop iterations into chunks where each chunk consists of a subset of the entire iteration set, and may be as small as a single iteration, up to as large as the entire number of iterations, or somewhere in-between. In Ada, each chunk becomes a separate tasklet.
The implementation is expected to determine appropriate values for the number of chunks and the size of each chunk for each parallel loop, without these having to be specifed by the programmer.
To indicate that a loop is a candidate for parallelization, the reserved word "parallel" may be inserted immediately before the word "for" in a "for" loop. Such a loop is broken into chunks of iterations, where each chunk is processed sequentially, but potentially in parallel with the other chunks of iterations.
Note that the same rules presented for concurrent blocks above apply to the update of shared variables and the transfer of control to a point outside of the loop, and for this purpose each iteration (or chunk) is treated as equivalent to a separate sequence of a concurrent block.
Reduction Expressions ---------------------
A Reduction Expression is a syntactic construct, similar to a quantified expression that can be used to combine a set of values into a single result. This mechanism is called reduction.
Indeed, a quantified expression can be viewed as being a special purpose Reduction Expression that applies an operation, the Predicate, to a set of values to reduce the predicate results to a single Boolean result. Similarly, an array aggregate in the form of an iterated_component_association can also be viewed as being a special purpose Reduction Expression that applies an operation, in place concatenation, to a set of values to reduce into an array result.
A Reduction expression is a more general syntactic form where there is less constraint on the type of operation to be applied to the set of values, and the operation can produce result values of types other than Boolean values, or array creation.
Semantics: A Reduction Expression looks like a quantified expression, except that the quantifier keyword ("all" or "some") is not present, and the predicate expression is replaced with an expression that evaluates to a function call returning an object of a nonlimited type, that is called the combiner_function_call.
The combiner function call must have at least one parameter of the same type as the type of the Reduction expression. The combiner function call is called iteratively to accumulate a result, which ultimately becomes the result of the Reduction Expression. The accumulation of the result occurs as the result of each iteration is implicitly fed back as an input parameter to the combiner function call. The implicit parameter is identified syntactically based on the "<>" box notation syntax, except that the box notation can optionally enclose the initial value to be used for the first call to the function, since at that point there would otherwise be no existing value to pass into the function. The initial value also serves to be the result of the reduction expression for the case when there are no iterations performed (E.g. For the case when a when a loop_parameter_specification for a Reduction Expression specifies a null range).
If the initial value is not specified, then the initial value is assumed to be a special value called the identity value that is associated with the combiner function call. The identity value is specified by either applying an Identity aspect to the declaration of the function denoted by the combiner function call, or indirectly to another function called a reducer function that is similarly associated with the combiner function call via the use of a Reducer aspect applied to the function denoted by the combiner function call. The Identity aspect of a function identifies a value of the same type as the combiner function call result. If the combiner_function_call box parameter does not specify an initial value and the combiner_function_call is not associated with an identity value, then the compilation is illegal.
To indicate that a Reduction Expression is a candidate for parallelization, the reserved word "parallel" may be inserted immediately before the reserved word "for". Similar to the way Parallel Loops are broken into chunks, a Parallel Reduction expression will also have chunking applied, where each chunk is processed sequentially, but potentially in parallel with other iteration chunks of the expression.
For each chunk, an accumulator result is generated that is local to the tasklet. The local accumulator result for each chunk is initialized to the identity value, except the first chunk is initialized to the initial value if specified, otherwise the first chunk is also initialized to the identity value. As the chunk results are calculated in parallel, the result of the Reduction Expression is generated by combining/reducing the final function results of each chunk by applying a Reducer function for the Reduction Expression. A parallel reduction expression is illegal if the combiner function call is not associated with an identity value, regardless whether an initial value is specified in the box parameter or not.
The Reducer function for a parallel Reduction expression is a function that accepts two parameters where both parameters are of the same type, and returns a result of that same type, which is also the type of the Reduction expression.
The Combiner function must have a Reducer aspect specified for its declaration if the function itself is not a Reducer function. The Reducer aspect identifies another function that is to be used to combine multiple chunk results into the final result of the Reduction expression. The multiple results are combined two at a time.
The Reducer function is expected to generate an associative result based on the input parameters. A Reducer function does not need to be commutative (e.g. vector concatenation), as it is expected that the implementation will ensure that the results are combined in a manner that is consistent with sequentially applying the Reducer function to the chunk results in iteration order. If the parallel keyword is not present in the Reduction Expression, then sequential computation is assumed and the Reducer aspect does not need to be specified for the function declaration denoted by the combiner function call.
For parallel Reductions Expressions, it is important that the value of the Identity aspect associated with a function does not affect the result of the computation. For example, if the reduction result type is Integer and the reducer is addition, then the Identity value should be zero, since adding zero to any value does not affect the value. Similarly, if the reducer is multiplication then the Identity value should be one, since multiplying any Integer value by 1 does not affect the value. If the result type is a vector and the reducer is concatenation, then the Identity value should be an empty vector, since concatenating an empty vector to another vector does not affect the value of the other vector.
Note that the same rules presented for parallel blocks above apply to the update of shared variables, and for this purpose each iteration (or chunk) is treated as being a separate tasklet. is treated as equivalent to a separate sequence of a parallel block.
!wording
Append to Introduction (28) "A concurrent block statement requests that two or more sequences of
statements should execute concurrently with each other, and in parallel with each other if the parallel keyword is specified."
Modify 2.9(2/3) Add "parallel" to the list of reserved words.
Modify 4.3.3 (5.1/5)
iterated_component_association ::= [parallel] for defining_identifier in discrete_choice_list => expression
Modify 4.4(1/3)
"In this International Standard, the term "expression" refers to a construct of the syntactic category expression or of any of the following categories: choice_expression, choice_relation, relation, simple_expression, term, factor, primary, conditional_expression, quantified_expression {, reduction_expression}."
Modify 4.4(7/3)
"primary ::=
numeric_literal | null | string_literal | aggregate | name | allocator | (expression) | (conditional_expression) | (quantified_expression) {| (reduction_expression)}"
Add 4.4(15.h)
"Wording Changes from Ada 2012
Added Reduction_expression to primary."
Replace 4.5.1(2-3) with
"The following logical operators are predefined for every boolean type T
function "and"(Left, Right : T) return T with Identity => True function "or" (Left, Right : T) return T with Identity => False function "xor"(Left, Right : T) return T with Identity => False"
The following logical operators are predefined for every modular type T
function "and"(Left, Right : T) return T function "or" (Left, Right : T) return T with Identity => 0 function "xor"(Left, Right : T) return T with Identity => 0"
The following logical operators are predefined for every one-dimensional array type T whose component type is a boolean type:
function "and"(Left, Right : T) return T function "or" (Left, Right : T) return T function "xor"(Left, Right : T) return T"
Add AARM note
"The Identity aspect for the "and" function for modular types is not specified
because it is difficult to statically specify a value that would work for all types. eg. type X is mod 97; We could say that it is T'Last for types where the modulus is a power of 2, for example, but that is not general enough, similarly, we could say that the Identity for the "and" function for a one-dimensional array is T'(others => True), but that only works if T is a constrained array type, so we dont bother trying to specify the Identity aspect for these operations."
Modify 4.5.3(2)
"function "+"(Left, Right : T) return T{ with Identity => T(0)}
function "-"(Left, Right : T) return T{ with Reducer => "+"}
"
Modify 4.5.5(2)
"function "*" (Left, Right : T) return T{ with Identity => 1}
function "/" (Left, Right : T) return T{ with Reducer => "*"} function "mod"(Left, Right : T) return T function "rem"(Left, Right : T) return T"
Modify 4.5.5(12) "function "*"(Left, Right : T) return T{ with Identity => 1.0}
function "/"(Left, Right : T) return T{ with Reducer => "*"}
"
Modify 4.5.5(14) "function ""(Left : T; Right : Integer) return T{ with Reducer => ""}
function ""(Left : Integer; Right : T) return T{ with Reducer => ""} function "/"(Left : T; Right : Integer) return T{ with Reducer => "*"}
"
Modify 4.5.5(19) "function "*"(Left, Right : universal_fixed) return universal_fixed{ with Identity => 1.0}
function "/"(Left, Right : universal_fixed) return universal_fixed{ with Reducer => "*"
"
Modify 4.5.8(1/3)
quantified_expression ::= [parallel] for quantifier loop_parameter_specification => predicate | [parallel] for quantifier iterator_specification => predicate
Modify 4.5.8 (6/4)
For the evaluation of a quantified_expression, the loop_parameter_specification or iterator_specification is first elaborated. The evaluation of a quantified_expression then evaluates the predicate for the values each value of the loop parameter. {If the parallel keyword is not specified, these}[These] values are examined in the order specified by the loop_parameter_specification (see 5.5) or iterator_specification (see 5.5.2). {Otherwise these values are examined in an arbitary order consistent with parallel execution.}
Add Section 4.5.9
"Reduction Expressions"
Reduction expressions provide a way to write a reduction that combines a set of values into a single result.
Syntax
reducer_function ::= function_specification
reduction_expression ::= [parallel] for loop_parameter_specification => combiner_function_call | [parallel] for iterator_specification => combiner_function_call
combiner_function_call ::= function_call
Wherever the Syntax Rules allow an expression, a reduction_expression may be used in place of the expression, so long as it is immediately surrounded by parentheses.
Discussion: The syntactic category reduction_expression appears only as a primary that is parenthesized. The above rule allows it to additionally be used in other contexts where it would be directly surrounded by parentheses. This is the same rule that is used for conditional_expressions; see 4.5.7 for a detailed discussion of the meaning and effects of this rule.
Name Resolution Rules
The expected type of a reduction_expression is any nonlimited type. The combiner_function_call in a reduction_expression is expected to return a result of that same type.
A reducer_function is a function that has exactly two parameters where both formal parameters and the result are of the same type. A reducer_function is called implicitly to combine results from multiple executions of the combiner_function_call of a reduction expression when the reduction expression has parallel execution. The function denoted by a combiner_function_call can be associated with a reducer_function. If such an association exists, either the function is itself a reducer_function or its declaration has the Reducer aspect specified which indicates the associated reducer_function.
An identity_value is a value that can be used to initialize implicit declarations of variables that accumulate the results of a reduction_expression. These accumulator variables are passed as the actual parameter associated with the reduction_expression_parameter of a combiner_function_call. The identity_value for the function denoted by a combiner_function_call is determined from either; the Identity aspect specified on the declaration of the denoted function, or by the Identity aspect specified on the declaration of another function named by the the Reducer aspect of the denoted function.
Legality Rules
The combiner_function_call of a reduction_expression shall have exactly one explicit_actual_parameter that is a reduction_expression_parameter (See 6.4(6)).
If the parallel keyword is specified on the reduction_expression then the combiner_function_call shall either denote a function that is a reducer_function, or the denoted function shall have the Reducer aspect specified on its declaration.
If the parallel keyword is specified on the reduction_expression or the reduction_expression_parameter of the combiner_function_call does not specify an initial_reduction_value then the combiner_function_call shall either denote a function that has the Identity aspect specified on its declaration, or the denoted function shall have the Reducer aspect specified on its declaration and the function named by the Reducer aspect shall have the Identity aspect specified on its declaration.
The type of an initial_reduction_value or associated identity_value specified for a reduction_expression_parameter shall be of the same type as the reduction_expression.
The result type and the type of both formal parameters of an associated reducer_function shall be of the same type as the reduction_expression.
Static Semantics
For a function_specification, the following language-defined operational aspects may be specified with an aspect_specification (see 13.1.1):
Identity
The aspect Identity denotes a value that is to specified as the identity_value associated with a Reduction_Expression. The aspect shall be specified by a static expression, and that expression shall be explicit, even if the aspect has a boolean type.
Identity shall be specified only on a function_specification declaration.
Reason: The part about requiring an explicit expression is to disallow omitting the value for this aspect, which would otherwise be allowed by the rules of 13.1.1.
Aspect Description for Identity: Identity value for a function.
The expected type for the expression specified for the Identity aspect is the result type of the function_specification declaration on which it appears.
Only one Identity aspect may be applied to a single function declaration;
Reducer
The aspect Reducer shall denote a reducer_function that is to be associated with a declared function.
Reducer The aspect Reducer denotes a function with the following specification:
function reducer(L, R : in result_type) return result_type
where result_type statically matches the subtype of the result of the declared function.
Only one Reducer aspect may be applied to a single function declaration;
Aspect Description for Reducer: Reducer function associated with the combiner_function_call of a reduction expression.
Dynamic Semantics
For the evaluation of a reduction_expression, the loop_parameter_specification or iterator_specification is first elaborated, and accumulator variables of the type of the reduction_expression are implicitly declared. Each accumulator variable corresponds to a unique non-overlapping subset of the iterations to be performed where all accumulators together cover the full set of iterations to be performed. For the accumulator dedicated to the first iterations, the accumulator is initialized to the initial_reduction_value, if specified, otherwise it is initialized to the associated identity_value of the combining_function_call. Any other accumulator values are initialized to the associated identity value of the combining_function_call.
The evaluation of a reduction_expression then evaluates the combiner_function_call for the values of the loop parameter in the order specified by the loop_parameter_specification (see 5.5) or iterator_specification (see 5.5.2) within each iteration subset. For each subsequent evaluation of the combiner_function_call within an iteration subset, the result from the previous evaluation is passed as an actual parameter to the reduction_expression_parameter of the current evaluation.
The value of the reduction_expression is determined as follows: As accumulator results are determined they are reduced into a single result two at a time by implicit calls to the reducer_function associated with the combiner_function_call. The accumulator results passed to the reducer_function are always for two adjacent iteration subsets where the result for the lower iteration subset is passed as the first actual parameter to the reducer_function and the result for the higher iteration subset is passed as the second actual parameter to the reducer_function. If the Reduction_Expression does not evaluate any iterations, then the value of the Reduction_Expression is the initial_reduction_value is specified, otherwise it is the identity_value associated with the combiner_function_call.
Notes: The accumulator results must represent adjacent iteration subsets as described above to ensure that non-commutative reductions will produce consistent results for parallel execution.
Examples
A reduction expression to calculate the sum of elements of an array
(parallel for Element of Arr => <0> + Element)
A reduction expression to calculate the minimum value of an array
(parallel for X of Arr => Integer'Min(<Integer'Last>, X))
A reduction expression to create an unbounded string containing the alphabet
(for Letter in 'A' .. 'Z' => <Null_Unbounded_String> & Letter)
A reduction expression to create a string containing the alphabet
(for Letter in 'A' .. 'Z' => <""> & Letter)
A reduction expression to determine how many people in a database are 30 or older
ThirtySomething : contant Natural := (parallel for P of Personel => <0> + (if Age(P) > 30 then 1 else 0));
An expression function that returns its result as a Reduction Expression
function Factorial(N : Natural) return Natural is (for J in 1..N => <1> * J);
An expression function that computes the Sin of X using Taylor expansion
function Sin(X : Float; Num_Terms : Positive := 5) return Float is (for I in 1..Num_Terms => <0.0> + (-1.0)**I * X**(2*I-1)/Float(Fact(2*I-1)));
A reduction expression that computes taxable income
Taxable_Income : constant Float := Total_Income +
(for Deduction of Deductions => <0.0> - Deduction);
A reduction expression that outputs the sum of squares
Put_Line ("Sum of Squares is" & Integer'Image(for I in 1 .. 10 => <0> + I**2));
A reduction expression to compute the value of Pi in parallel
Number_Of_Steps : constant := 100_000; Step : constant := 1.0 / Number_Of_Steps;
Pi : constant Long_Float := Step *
(parallel for I in 1 .. Number_Of_Steps =>
<0.0> + (4.0 / (1.0 + ((Long_Float (I) - 0.5) * Step)**2)));
Wording Changes from Ada 2012 Reduction expressions are new.
Modify 5.5(3/3)
Iteration_scheme ::= while condition | [parallel] for loop_parameter_specification | [parallel] for iterator_specification
Modify 5.5 (6/5)
A loop_parameter_specification declares a {set of }loop parameter{s}, {where eash loop parameter corresponds to a unique non-overlapphing subset of iterations that together cover the full set of iterations to be performed.} [which is an ]{These} object{'s} [whose ]subtype{s} (and nominal subtype{s}) [is]{are} that defined by the discrete_subtype_definition.
Modify to 5.5(9/4)
"For the execution of a loop_statement with the iteration_scheme being for
loop_parameter_specification, the loop_parameter_specification is first elaborated. This elaboration creates the loop parameter{s} and elaborates the discrete_subtype_definition. {Multiple loop parameters may be created where each loop parameter is associated with a unique non-overlapping range of the iterations if the keyword parallel is present, otherwise a single loop parameter is assumed. Each loop parameter can execute concurrently with other loop parameters of the same loop. Each thread of control proceeds independently and concurrently between the points where they interact with other tasks and with each other.} If the discrete_subtype_definition defines a subtype with a null range, the execution of the loop_statement is complete. Otherwise, the sequence_of_statements is executed once for each value of the discrete subtype defined by the discrete_subtype_definition that satisfies the predicates of the subtype (or until the loop is left as a consequence of a transfer of control). Prior to each such iteration, the corresponding value of the discrete subtype is assigned to the loop parameter. These values are assigned in increasing order unless the reserved word reverse is present, in which case the values are assigned in decreasing order. "
AARM - An implementation should statically treat the sequence_of_statements as being executed by separate threads of control, but whether they actually execute in parallel or sequentially should be a determination that is made dynamically at run time, dependent on factors such as the available computing resources.
Examples after 5.5(20)
Example of a parallel loop
parallel for I in Buffer'range loop Buffer(I) := Arr1(I) + Arr2(I); end loop;
"5.6.1 Concurrent Block Statements
[A concurrent_block_statement encloses two or more sequence_of_statements where all the sequence_of_statements can execute concurrently or possibly in parallel with each other.]
Syntax
concurrent_block_statement ::= [parallel] do sequence_of_statements and sequence_of_statements {and sequence_of_statements} end do;
Static Semantics
Each sequence_of_statements represents a separate thread of control that proceeds independently and concurrently between the points where they interact with other tasks and with each other. If the parallel keyword is present, then each threads of control can execute on any available processor in parallel with other threads. Otherwise, the threads of control execute concurrently on the same processor as the enclosing task.
AARM - An implementation should statically treat each sequence_of_statements as a separate thread of control, but whether they actually execute concurrently or sequentially should be a determination that is made dynamically at run time, dependent on factors such as the available computing resources.
Examples
Example of a concurrent block statement:
do
Foo(Z);
and
Bar(X, Y);
and
Put_Line ("Executing Foo and Bar in parallel with Other_Work"); Other_Work;
end do;
Modify 6.4(6) "explicit_actual_parameter ::= expression | variable_name {| reduction_expression_parameter}"
reduction_expression_parameter ::= <[initial_reduction_value]>
initial_reduction_value ::= simple_expression
Add 6.4(7.1) Legality Rules
A reduction_expression_parameter shall only be supplied as an actual parameter to a combiner_function_call of a reduction expression
Change 9.10 (13)
"Both actions occur as part of the execution of the same task {unless
either are part of a;
- different sequence_of_statements of a concurrent - block statement, - parallel loop statement, - parallel quantified expression, - parallel array aggregate, or - parallel reduction expression.}"
New section 9.12 Executors and Tasklets
A task may distribute execution across different physical processors in parallel, where each execution is a separate thread of control that proceeds independently and concurrently between the points where they interact with other tasks and with each other. Each separate thread of control of a task is an Executor, and the execution that each executor performs between synchronization is a tasklet. When a task distributes its execution to a set of executors, it cannot proceed with its own execution until all the executors have completed their respective executions.
A concurrent block statement, parallel loop statement, parallel quantified expression, parallel aggregate, or parallel reduction expression may assign a set of executors to execute the construct, if extra computing resources are available.
Modify A.1 (9.1)
-- function "and" (Left, Right : Boolean'Base) return Boolean'Base {with Identity => True}; -- function "or" (Left, Right : Boolean'Base) return Boolean'Base {with Identity => False}; -- function "xor" (Left, Right : Boolean'Base) return Boolean'Base {with Identity => False};
Modify A.1 (17)
-- function "+" (Left, Right : Integer'Base) return Integer'Base {with Identity => 0}; -- function "-" (Left, Right : Integer'Base) return Integer'Base {with Reducer => "+"}; -- function "*" (Left, Right : Integer'Base) return Integer'Base {with Identity => 1}; -- function "/" (Left, Right : Integer'Base) return Integer'Base {with Reducer => "*"};
Modify A.1 (25)
-- function "+" (Left, Right : Float) return Float {with Identity => 0.0}; -- function "-" (Left, Right : Float) return Float {with Reducer => "+"}; -- function "*" (Left, Right : Float) return Float {with Identity => 1.0}; -- function "/" (Left, Right : Float) return Float {with Reducer => "*"};
Modify A.1 (29-34)
function "*" (Left : root_integer; Right : root_real)
return root_real {with Identity => 1.0};
function "*" (Left : root_real; Right : root_integer)
return root_real {with Identity => 1.0};
function "/" (Left : root_real; Right : root_integer)
return root_real {with Reducer => "*"};
-- The type universal_fixed is predefined. -- The only multiplying operators defined between -- fixed point types are
function "*" (Left : universal_fixed; Right : universal_fixed)
return universal_fixed {with Identity => 1.0};
function "/" (Left : universal_fixed; Right : universal_fixed)
return universal_fixed {with Reducer => "*"};
Modify A.4.4 (13-17)
function Append (Left, Right : in Bounded_String; Drop : in Truncation := Error) return Bounded_String{ with Reducer => "&"};
function Append (Left : in Bounded_String; Right : in String; Drop : in Truncation := Error) return Bounded_String{ with Reducer => "&"};
function Append (Left : in String; Right : in Bounded_String; Drop : in Truncation := Error) return Bounded_String{ with Reducer => "&"};
function Append (Left : in Bounded_String; Right : in Character; Drop : in Truncation := Error) return Bounded_String{ with Reducer => "&"};
function Append (Left : in Character; Right : in Bounded_String; Drop : in Truncation := Error) return Bounded_String{ with Reducer => "&"};
Modify A.4.4 (21-26)
function "&" (Left, Right : in Bounded_String)
return Bounded_String{ with Identity => Null_Bounded_String};
function "&" (Left : in Bounded_String; Right : in String)
return Bounded_String{ with Reducer => "&"};
function "&" (Left : in String; Right : in Bounded_String)
return Bounded_String{ with Reducer => "&"};
function "&" (Left : in Bounded_String; Right : in Character)
return Bounded_String{ with Reducer => "&"};
function "&" (Left : in Character; Right : in Bounded_String)
return Bounded_String{ with Reducer => "&"};
Modify A.4.5 (15-19)
function "&" (Left, Right : in Unbounded_String)
return Unbounded_String{ with Identity => Null_Unbounded_String};
function "&" (Left : in Unbounded_String; Right : in String)
return Unbounded_String{ with Reducer => "&"};
function "&" (Left : in String; Right : in Unbounded_String)
return Unbounded_String{ with Reducer => "&"};
function "&" (Left : in Unbounded_String; Right : in Character)
return Unbounded_String{ with Reducer => "&"};
function "&" (Left : in Character; Right : in Unbounded_String)
return Unbounded_String{ with Reducer => "&"};
Modify A.18.2 (15/2-18/.2)
function "&" (Left, Right : Vector) return Vector
{ with Identity => Empty_Vector};
function "&" (Left : Vector;
Right : Element_Type) return Vector
{ with Reducer => "&"};
function "&" (Left : Element_Type;
Right : Vector) return Vector
{ with Reducer => "&"};
function "&" (Left, Right : Element_Type) return Vector
{ with Reducer => "&"};
Add after A.18.3 (32/2)
function "&" (Left, Right : List) return List
{ with Identity => Empty_List};
function "&" (Left : List;
Right : Element_Type) return List
{ with Reducer => "&"};
function "&" (Left : Element_Type;
Right : List) return List
{ with Reducer => "&"};
function "&" (Left, Right : Element_Type) return List
{ with Reducer => "&"};
Modify A.18.7(55/2)
function Union (Left, Right : Set) return Set { with Identity => Empty_Set};
Modify A.18.7(67/2)
function Symmetric_Difference (Left, Right : Set) return Set {with Identity => Empty_Set};
Modify A.18.8(27/2)
function Union (Left, Right : Set) return Set { with Identity => Empty_Set};
Modify A.18.8(35/2)
function Symmetric_Difference (Left, Right : Set) return Set { with Identity => Empty_Set};
Modify A.18.9(28/2)
function Union (Left, Right : Set) return Set { with Identity => Empty_Set};
Modify A.18.9(37/2)
function Symmetric_Difference (Left, Right : Set) return Set { with Identity => Empty_Set};
Add C.7.1 (5.1)
The Task_Id value associated with each sequence_of_statements of a concurrent_block_statement, parallel quantified expression, parallel reduction expression, parallel array aggregate, or of a parallel loop statement is the same as that of the enclosing statement.
AARM - Each sequence_of_statements of a concurrent block, parallel quantified expression, parallel reduction expression, parallel array aggregate, or parallel loop are treated as though they are all executing as the task that encountered the parallel construct.
!discussion
There is a continuing trend of exponential growth of computational elements embedded on a single chip. This has led to significant challenges for software designers and implementers. Prior to 2005, the increase in transistor count corresponded to a similar increase in CPU clock speed, which boosted performance of sequential algorithms. Since then, CPU clock speeds have leveled off for reasons of practicality, and chip manufacturers have instead moved towards multicore technologies as a means of achieving performance increases.
As parallel technologies become more prevalent in the form of new emerging hardware platforms, safety concerns need to consider the paradigm shift from sequential to parallel behaviour. It is a paradigm that has been for a long time contained within a specialized domain of high-performance computing, but that is now required in all domains of computing systems.
It is important for the programmer to indicate places in the code where parallelism is desired, which allows the reader to better understand the algorithm, and allows the programmer to receive confirmation from the compiler whether the desired parallelism is safe, or even worthwhile.
This proposed model does not define syntax for the explicit parallelization of individual subprogram calls, since such parallelization can be performed implicitly by the compiler, when it knows that the calls are free of side-effects. This is facilitated by annotations identifying global variable usage on subprogram specifications (see AI12-0079-1).
Concurrent Blocks ---------------
example: declare X, Y : Integer; Z : Float; begin
parallel do X := Foo(100); and Z := Sqrt(3.14) / 2.0; Y := Bar(Z); end do;
Put_Line("X + Y=" & Integer'Image(X + Y)); end;
In this example, the calculation of Z and Y occur sequentially with respect to each other, but in parallel with the calculation of X. Note that the compiler, using the rules specified in AI12-0079-1, may complain if the parallel sequences might have conflicting global side-effects.
The concurrent block construct is flexible enough to support recursive usage as well, such as:
function Fibonacci (N : Natural) return Natural is X, Y : Natural; begin if N < 2 then return N; end if;
parallel do X := Fibonacci (N - 2); and Y := Fibonacci (N - 1); end do;
return X + Y; exception when others => Log ("Unexpected Error"); end Fibonacci;
We considered allowing the concurrent block to be preceded with an optional declare part, and followed with optional exception handlers, but it was observed that it was more likely to be useful to have objects that are shared across multiple parallel sequences to outlive the parallel block, and that having exception handlers after the last parallel sequence could easily be misconstrued as applying only to the last sequence. Therefore we reverted to the simpler syntax proposed above. This simpler syntax is also more congruous with the syntax for select statements. Because there are no local declarations, there was also no point in having a statement_identifier (block label) for a concurrent block. This is actually not entirely true. Allowing an exit statement to replace a goto that targets the location following the end of the select statement seems useful. To allow such a statement, a block label might be needed to identify the concurrent block that is being exited. It was felt that the need for allowing an exit statement in a concurrent block could be the subject of a separate AI.
We considered whether a concurrent block statement without the parallel keyword should be sequential instead of concurrent. A concurrent construct seems more useful as it allows for abstractions such as coroutines where producers and consumers produce and consume data that is shared between tasklets. Such usage would work whether the pararallel keyword is present or not in our model, but if instead the construct was sequential, then removing the parallel keyword from a working program could cause the application to deadlock, which is a safety concern. Also, there is not much point to a sequential block abstraction, since one could simply remove the construct and execute all the sequences sequentially.
parallel Loops --------------
In most compute-intensive applications, a significant proportion of the computation time is spent in loops, either iterating over arrays/container data structures, or systematically searching a large solution space. To benefit from parallel hardware, the computation associated with a loop should be spread across the available processors.
One approach, presuming the iterations of the loop have no data dependences between them, is to treat each iteration of the loop as a separate tasklet, and then have the processors work away on the set of tasklets in parallel. However, this introduces overhead from the queuing and de-queuing of work items, and the communication of results from each work item. Furthermore, there often are data dependences between iterations, and creating a separate work item for each iteration can introduce excessive synchronization overhead to deal safely with these interdependences. Therefore, it is common to break large arrays, and/or the loops that iterate over them, into chunks (or slices or tiles), where each chunk is processed sequentially, but multiple chunks can be processed in parallel with one another.
Note: Allowing the parallel keyword on while loops was considered, but was discarded since while loops cannot be easily parallelized, because the control variables are inevitably global to the loop.
For example, here is a simple use of a parallelized loop, to add two arrays together to produce a result array:
parallel for I in Loop_Iterations'range loop Result (I) := A(I) + B(I); end loop;
Note that the compiler, using the rules specified in AI12-0079-1, may complain if the parallel sequences might have conflicting global side-effects. We considered extending the syntax to allow reductions to be performed with parallel loops, but we discovered that parallel reduction expressions can provide the same capability more concisely, so for now we allow parallel loops to work if there are no global conflicts. A separate AI could be created to add reduction loop capability if it is decided that this is needed. In the meantime, Reduction Expressions appear to better fill this need. The main purpose of a loop is to apply processing iteratively. The main purpose of a reduction however, is to generate a result value. Iteration is a means to achieve that end, but a reduction expression seems better suited since it is designed to produce a result value.
Reduction Expressions ---------------------
A reduction expression provides a concise way to use iteration to combine a set of values into a single result.
We already have some special purpose reduction expressions in Ada in the form of quantified expressions and array aggregates of the form using an iterated_component_association.
Consider:
All_Graduated : constant Boolean := (parallel for all Student of Class => Passed_Exam (Student));
Someone_Graduated : constant Boolean := (parallel for some Student of Class => Passed_Exam (Student));
Note that the keyword parallel would now be allowed in quantified expressions.
Here we are in both cases effectively reducing a set of Boolean values into a single result using quantified expressions.
A similar effect can be written using the more generalized Reduction expression syntax.
All_Graduated : constant Boolean := (parallel for Student of Class => <True> and Passed_Exam (Student));
Someone_Graduated : constant Boolean := (parallel for Student of Class => <False> or Passed_Exam (Student));
Here we use "and" and "or" as the combining_function_call of the Reduction expression. The initial value plays an important role, since the Identity for "and" must be true, and false for "or".
Another concern might be whether parallel loops can generally be written as expressions.
Consider the calculation of Pi:
Number_Of_Steps : constant := 100_000; Step : constant := 1.0 / Number_Of_Steps;
Pi : constant Long_Float := Step * (parallel for I in 1 .. Number_Of_Steps => <0.0> + (4.0 / (1.0 + ((Long_Float (I) - 0.5) * Step)**2)));
One might feel that the readability of this example might be improved if temporary variable declarations were to be added for intermediate results. For instanced, it might be nice to replace the sub-expression;
(1 + ((Long_Float (I) - 0.5) * Step)**2)
with
(1.0 + X**2)
One ordinarily cannot declare variables in an expression, but one can always write a function which can be called from an expression.
We might instead decide to write:
function Pi_Slicer (R : Long_Float; Index : Natural) return Float with Reducer => "+" is X : constant Long_Float := (Long_Float (Index) - 0.5) * Step; begin return R + 4.0 / (1.0 + X ** 2); end Pi_Slicer;
Pi : constant Long_Float := Step * (parallel for I in 1 .. Number_Of_Steps => Pi_Slicer (<0.0>, I));
Which leaves a simpler looking Reduction Expression to do computation.
Another concern might be about the need to perform multiple reductions at the same time.
Consider:
Sum : Integer := (parallel for X of Arr => <0> + X); Min : Integer := (parallel for X of Arr => Integer'Min(<Integer'Last>, X)); Max : Integer := (parallel for X of Arr => Integer'Max(<Integer'First>, X));
Here we have three calculations that occur in parallel, but sequentially with respect to each other. The performance benefits of parallelism should be noticeable for larger arrays, but one might want to calculate these three results iterating only once through the array.
This can be accomplished by creating a composite result type, and writing a user defined reducer function.
type Summary is record Sum : Integer; Min : Integer; Max : Integer; end record;
-- Identity value for the Reduce function Identity : constant Summary := Summary'(Sum => 0, Min => Integer'Last, Max => Integer'First);
-- Reducer function for the Reduction expression function Reduce (L, R : Summary) return Summary with Identity => Identity is (Summary'(Sum => L + R, Min => Integer'Min (L, R), Max => Integer'Max (L, R)));
-- Combiner function for the Reduction expression function Process (L : Summary; X : Integer) return Summary with Reducer => Reduce is (Summary'(Sum => L.Sum + X, Min => Integer'Min (L.Min, X), Max => Integer'Max (L.Max, X)));
-- Reduction expression to compute all 3 results at once Result : constant Summary := (for parallel X of Arr => Process (<Identity>, X));
!ASIS
** TBD.
!ACATS test
ACATS B-Test and C-Test are needed for these features.
!appendix

From: Tucker Taft
Sent: Tuesday, June 17, 2014  10:18 AM

Here is the paper the "gang of 4" is submitting to HILT 2014 with a relatively
detailed proposal for parallel programming in Ada 202x:

     http://bit.ly/parallada-gangof4

If there is time at the upcoming ARG meeting, some discussion and feedback from
ARG members would be appreciated.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, June 17, 2014  12:52 PM

Are we going to get formal proposals for some/all of these features? For
instance, we have an open AI on global in/out (AI12-0079-1) which is certainly
more broadly applicable than just parallelism. There is also an AI for
Potentially_Blocking, AI12-0026-1, misnamed "Task_Safe", which is useful for any
program including tasking. In both of these cases, an important part of the
proposals will be how they interact with predefined and language-defined
subprograms. (If they're not specified for predefined "+", nothing useful could
be done, for example.)

As with all of these sorts of things, the devil is in the details. The basic
ideas look fine, but we'll never really know without detailed proposals that can
be tested against.

****************************************************************

From: Tucker Taft
Sent: Tuesday, June 17, 2014   1:54 PM

True, but it is not worth making a more detailed proposal if the overall
reaction is "yuck"!  ;-)

****************************************************************

From: Randy Brukardt
Sent: Tuesday, June 17, 2014   3:10 PM

Well, it looks great to me, but that's mainly because all of the details are
missing. A lot of stuff looks great without the details (it's how modeling
works, after all). The "yuck" only comes when the details are fleshed out...

****************************************************************

From: Bob Duff
Sent: Tuesday, June 17, 2014   5:52 PM

Sounds like there are some good ideas in the paper, but I am very strongly
opposed to ARG getting "detailed proposals" about how to do parallelism, or even
wasting time discussing the matter.  ARG should be moving more to the
"standardize existing practise" way of doing.  And that goes double for features
whose ONLY purpose is efficiency -- such as parallelism.

We (ARG) don't need detailed proposals in that area; we need an experimental
implementation, and MEASUREMENTS of the efficiency of realistic programs.  (And
don't try to show me speedups of the recursive Fibonacci -- that's downright
dishonest!)  Until then, ARG should avoid wasting any time on the matter.  Once
we have proven practise, then ARG should take charge of standardizing it.

My comments above don't apply so much to things like Globals, which are more
about correctness/safety than efficiency.  ARG could reasonably discuss Globals,
although even there, an experimental implementation would be helpful.

Regarding those: You can put Globals and Potentially_Blocking on a package.
That's good.  I think it should apply to child and nested packages of that
package. I think you also want a way to give a default for Overlaps_Storage: "no
overlaps unless specified, in this package and children".

****************************************************************

From: Robert Dewar
Sent: Wednesday, June 18, 2014  2:52 AM

> Sounds like there are some good ideas in the paper, but I am very
> strongly opposed to ARG getting "detailed proposals" about how to do
> parallelism, or even wasting time discussing the matter.  ARG should
> be moving more to the "standardize existing practise" way of doing.
> And that goes double for features whose ONLY purpose is efficiency --
> such as parallelism.

I am in full agreement with this.

At the last RTW, which I attended, I was struck by the fact that the RTW was
debating removing some features in Annex D which no one has ever implemented,
yet alone found useful in actual practice.

The RM is not a research proposal :-)

****************************************************************

From: Alan Burns
Sent: Wednesday, June 18, 2014  5:59 AM

The policy currently in Ada has been implemented by the Ada run-time from
Santandar, and is supported in EDF operating systems (not Ada), though I agree
these are mainly research and experimental platforms, not mainstream industrial
platforms.

****************************************************************

From: Robert Dewar
Sent: Wednesday, June 18, 2014  10:39 AM

But I am talking about the specific Ada features, which for sure have not been
used by anyone in an Ada program :-)

****************************************************************

From: Tucker Taft
Sent: Wednesday, June 18, 2014  8:38 AM

> The RM is not a research proposal :-)

I agree that many language standards come after the fact, and are used to try to
tame a wildly varying set of ad hoc extensions.  That has never been the story
with Ada.  There are relatively few extensions (other than as packages, pragmas,
or attributes) that Ada vendors or users try in any serious way, before the Ada
Rapporteur Group begins to talk seriously about them.  The ideal is that the ARG
gets the ball rolling, indicates a direction we are serious about, and then
vendors try prototyping the ideas, and then we come around to picking among the
"serious" ideas based on prototyping results.  I really don't see anyone beside
the ARG being taken seriously at this point as a fountain for "ideas worth
prototyping" (apologies to TED ;-).

****************************************************************

From: Stephen Michell
Sent: Sunday, June 22, 2014  10:24 PM

Attached is a report that ARG requested from Brad, Tucker  and myself in
investigating how OpenMP could be used to implement Ada parallelism. For
discussion at the ARG meeting this week.

-----

				Report to WG 9/ARG
 		  on the use of OpenMP for Parallelism in Ada

Stephen Michell
with contributions from
Luis Miguel Pinho
Brad Moore
Tucker Taft


Foreward

At the ARG meeting November 2014, ideas were presented about ways to
implement parallelism in Ada, and ways to interoperate with other
language systems while using the multicore capabilities of modern
computers. We were requested to investigate OpenMP from a technical
perspective as well as a business perspective as a potential solution
to implementing multicore parallelism in Ada.


OpenMP Characteristics

OpenMP is a set of libraries and language extensions that permit a
user to access the capabilities of multicore systems from within
certain languages - namely Fortran, C and C++.

OpenMP's parallelism is "explicit parallelism" - i.e. all aspects of
the parallelism are specified by compiler directives in the code that
invoke library routines at the place of the directive through OpenMP API's.

OpenMP lets the user specify code segments (usually "for loops",
blocks and functions) that can be executed by threads (possibly
operating on multiple cores) that are co-ordinated to implement an
algorithm. The language extensions are converted into library calls to
interact with a runtime to implement the algorithm.

In trying to implement OpenMP on an Ada system, one would note that
the concurrency model is quite different than Ada's. The
"threads"  that are created interact with other threads that use OpenMP
primitives. This means that they will not interact with Ada tasks
using Ada tasking primitives.

For the interoperability with Fortran, C and C++ systems that are
using OpenMP, an Ada OpenMP capability would be useful, whetehr or not
it interacted with Ada tasks. This brings us to the business model.

Business Model

To implement An OpenMP for Ada, one must participate in the OpenMP
community. One could develop compiler directives, either as aspects or
pragmas and convert calls to meet the API, but to have it
"standardized", one must be a full participant in the OpenMP group.

One can participate in OpenMP as a university, as an individual or as
a corporate participant. AdaCore was approached to see if they were
interested in joining OpenMP (and paying fees), but declined. There is
no point joining as individuals, as individuals cannot propose new
language specifications for standardization.

This leaves participation through a university. Barcelona
Supercomputing Group is a participant, and Miguel Pinho has approached them about
participating through Barcelona. Barcelona is receptive, but
participation requires that Miguel find a PhD student willing to
participate. To date there has been no final resolution.

There is a GNU OpenMP binding (GOMP) that supports the GNU Fortran, C
and C++ GNU compilers. Ada could define a thin binding to that
interface. This would still leave the challenge of defining Ada syntax
for the Open MP compiler directives and determining how such a library
would interact with the Ada runtime (Tasking is not the only issue -
elaboration issues are also a potential issue).

Conclusion

There are business reasons and technical reasons for creating an
OpenMP Ada implementation. If an interface with OpenMP through
Barcelona Supercomputing Group becomes possible, further exploration should
happen. Otherwise, no further action in this area is anticipated.

****************************************************************

From: Brad Moore
Sent: Monday, October 3, 2016  10:09 AM

[This is version /02 of the AI - Editor.]

Here is my attempt to refine ideas from the gang of 4 working on parallelism,
so we might have something more to talk about on this topic at the upcoming
ARG meeting. There are many ideas to choose from, so I chose what I thought
were the most promising, I'm not sure if all 4 of us would agree, though it is
possible that we might be in agreement, as we haven't yet gotten around to
deciding on the best way forward. Most of the writing related to parallel
blocks is unchanged from the previous version of this AI, and was mostly
written by Tucker. The updates I have are mostly to do with parallel loops
and reduction.

I have attempted to put in wording for these changes, but the wording is not
polished. The wording hopefully does capture and describe the intent at least.

The main hope is to gain some feedback, to see if this direction makes sense,
or if other directions should be pursued.

It would probably also be good to discuss whether this parallel loop proposal
is an improvement over what was written in the previous version of the AI,
which discussed the use of parallel arrays.

****************************************************************

From: Brad Moore
Sent: Saturday, June 10, 2017  1:18 AM

[This is version /03 of the AI - Editor.]

Here is a bit of a redesign in the area of parallel loops, and a new approach,
parallel reduction expressions. The parallel block part is mostly unchanged.

The parallel loop design in much simpler. It now just does parallel iteration,
leaving parallel reductions to reduction expressions.

We can add reduction back to loops later, possibly in a different AI if we
decide we need it, but for the time being, it seems like reduction expressions
are a better fit for the problem.

Parallel reduction expressions are very much similar to quantified expressions,
which is where I started thinking about this idea. It turns out that Tucker was
also thinking along these lines in Parasail, as Parasail apparently already has
this feature.

After some email exchange with the gang of 4, the more this idea evolved, the
more similar it became to the Parasail approach.

The basic idea is that a Reduction expression applies an operation using
iteration, where the previous result of the last iteration is fed back as an
input into the next iteration.

Special syntax, "<value>" is used to indicate where this feedback occurs, which
also encloses the initial value to be used for the first iteration.

Here are a few examples that are in the AI, to give a basic idea.

    -- Sum of the elements of an array
    Sum : Integer := (for parallel Element of Arr => <0> + Element);

    -- Minimum value of an array
    Min : Integer :=
     (for parallel X of Arr => Integer'Min(<Integer'Last>,  X));

    -- Construct the alphabet
    Alphabet : Unbounded_String :=
      (for Letter in 'A' .. 'Z' => <Null_Unbounded_String> & Letter);

    -- Number of people who are 30 or older
    ThirtySomething : contant Natural :=
      (for P of parallel database =>
          <0> + (if Age(P) > 30 then 1 else 0));

    -- Factorial function
    function Fact(N : Natural) return Natural is
      (for J in 1..N => <1> * J);

    -- Taxable_Income : constant Float := Total_Income +
      (for Deduction of Deductions => <0.0> - Deduction);

    --  Compute Taylor expansion for Sin of X
    function Sin(X : Float; Num_Terms : Positive := 5) return Float is
      (for I in 1..Num_Terms =>
         <0.0> + (-1.0)**I * X**(2*I-1)/Float(Fact(2*I-1)));

    -- Compute Pi
    Number_Of_Steps : constant := 100_000;
    Step : constant := 1.0 / Number_Of_Steps;

    Pi : constant Long_Float := Step *
      (for I in parallel 1 .. Number_Of_Steps =>
          <0.0> + (4.0 / (1.0 + ((Long_Float (I) - 0.5) * Step)**2)));

    -- Print the sum of squares
    Put_Line ("Sum of Squares is" &
              Integer'Image(for I in 1 .. 10 => <0> + I**2));

To see the similarity between Quantified expressions and the more general form,
Reduction expressions, consider:

    -- Quantified expressions about the graduating class

    All_Graduated : constant Boolean :=
      (for all Student of parallel Class => Passed_Exam (Student));

    Someone_Graduated : constant Boolean :=
      (for some Student of parallel Class => Passed_Exam (Student));


    -- The same result can be obtained by writing these as
    -- reduction expressions

    All_Graduated : constant Boolean :=
      (for Student of parallel Class =>
         <True> and Passed_Exam (Student));

    Someone_Graduated : constant Boolean :=
      (for Student of parallel Class =>
         <False> or Passed_Exam (Student));

I've taken a first crack at writing this up, but didn't spend a lot of time
wordsmithing. I'd hope to first get some feedback in Vienna to see this idea has
merit.

****************************************************************

From: Randy Brukardt
Sent: Monday, June 12, 2017  5:35 PM

> Here is a bit of a redesign in the area of parallel loops, and a new
> approach, parallel reduction expressions. The parallel block part is
> mostly unchanged.

Here's a couple barely informed comments on this.

First, a bit of a rant. It's disconcerting that we get a wildly different
proposal at each meeting. There's no opportunity to get the details right if we
change the basic proposal completely between every meeting.

In Pittsburgh, we spent a significant amount of time figuring out the right
model for reducers. I see here that all of the conclusions of that meeting have
completely been discarded in favor of what appears on first glance the
reducer-de-jour. Why should we even spend time trying to figure out this wildly
different proposal when it seems obvious that whatever we decide will be
discarded all over again next time?

I want to at least feel like we are making progress toward a solution, rather
than starting from nearly ground zero with each proposal. End of mini-rant.

...
> Special syntax, "<value>" is used to indicate where this feedback
> occurs, which also encloses the initial value to be used for the first
> iteration.

Regardless of the merits of the proposal itself (which I haven't thought about),
I don't think angle brackets is going to fly as syntax in Ada. There would be
substantial confusion with relational operators, which would likely make such
text unparsable. (We discussed this extensively with some of the assignment
target proposals; it didn't fly there and I don't see what could have changed in
8 months.)

You could use square or curly brackets for this, but in any case, such syntax
doesn't seem very Ada-like. I'd suggest using an attribute or "magic" function
call as the appropriate syntax.

> Here are a few examples that are in the AI, to give a basic idea.
>     -- Minimum value of an array
>     Min : Integer :=
>      (for parallel X of Arr => Integer'Min(<Integer'Last>,  X));

     Min : Integer :=
      (for parallel X of Arr => Integer'Min(Integer'Last'Initial,  X));

or

     Min : Integer :=
      (for parallel X of Arr => Integer'Min(Initial(Integer'Last),  X));

Maybe there is a better idea out there (neither of these look very promising),
but whatever we do needs to look and feel like Ada.

****************************************************************

From: Brad Moore
Sent: Tuesday, June 13, 2017  9:21 AM

> First, a bit of a rant. It's disconcerting that we get a wildly
> different proposal at each meeting. There's no opportunity to get the
> details right if we change the basic proposal completely between every
> meeting.
>
> In Pittsburgh, we spent a significant amount of time figuring out the
> right model for reducers. I see here that all of the conclusions of
> that meeting have completely been discarded in favor of what appears
> on first glance the reducer-de-jour. Why should we even spend time
> trying to figure out this wildly different proposal when it seems
> obvious that whatever we decide will be discarded all over again next time?

I started with that proposal, but the more I tried to apply it to real problems,
the more problems I saw.

To describe the problem in a nutshell, for parallel reduction problems there are
two mostly independent algorithms to be expressed, which I call the "what" and
the "how"

The "what" answers the question: What is the application doing?

- This includes the application logic that mostly remains the same regardless
  whether parallelism is being applied or not. It also includes an expression of
  the desire that the application code should be optimized to run in parallel.

The "how" answers the question: How is the application code implemented?
e.g. How does the optimization such as parallelism happen? How is chunking
performed, how are the results being reduced and combined? With parallelism,
this is complex enough that it can be viewed as a separate algorithm being
overlaid on top of the application's algorithm.

I found that syntax that tries to express both "how" and "what" ends up being
messy, and too constraining.

One problem I could see, is that by using array syntax to hold chunk results,
implies that there is an array of chunk results that exists in time that can be
iterated over and examined. In the parallelism libraries I've been working with,
that isn't the case, the chunk results come and go and get combined during the
computation. There typically never is a point in time where all the chunk
results exist.

Having an array of chunk results implies that the reduction does not occur until
after the looping has completed for all the chunks. This is eliminating a
parallelism possibility where the reduction occurs in parallel during the
computation of the loop, rather than sequentially after the looping of all the
chunks is complete.

There are some algorithms where reducers can be written that depend on the
reduction occurring in parallel during the loop processing.

If we look at the example from the minutes, to sum the elements of an array we
have;

    My_Sum := 0;

    parallel [(Num_CPUs*2)]
       Sum : array (<>) of Integer := 0;
    for I in 1 .. 1_000_000 loop
       Sum (<>) := @ + A(I);
    then
       for S of Sum loop
          My_Sum := @ + S;
       end loop;
    end parallel loop;

The middle loop expands into two loops:
    for Chunk_Index in Sum’Range loop
       for I in Split(1..1000000, Chunk_Index, Num_CPUs*2) loop
          Sum(Chunk_Index) := @ + A(I);
       end loop;
    end loop;

The parallel array of Sum, the "then" keyword, and the explanation for the
middle loop demonstrates the intent that the reduction occurs after the main
loop processing.

The point of parallelism is to make the applications logic run as fast as
possible. Why are we attempting to go with syntax that is injecting sequential
processing when it isn't needed? It would be better more parallelism can be
applied, if it is possible, and it is.

Having an array of chunk results also implies that the chunking is more static
in nature, suggesting that the computation of chunks occurs before the loop is
executed, which rules out other implementation possibilities where the chunking
is more dynamic in nature, such as some of the work stealing algorithms where
chunks can be dynamically resized as workers become free, they steal work from
other chunks being processed by other workers. Why are we choosing syntax that
forces implementors to apply a limited range of possible approaches?

If the chunking is computed before the loop begins executing, then it implies
there can be quite a large number of chunk results, if there are a lot of loop
iterations to be chunked, as the chunks should be smaller to allow for better
load balancing between the cores. The number of chunks would be a calculation
similar to;

     Number_of_Cores * x, where x is some factor based on the number of iterations.

This could require larger storage needs, for the case where the reduction result
is a larger data type such as a large matrix.

With the parallelism libraries I have been working with, using dynamic chunking,
the only needs to be a single chunk result for each core.

These are just some of the problems I see with the design we came up with last
meeting.

I also find that there is a lot we expecting the reader to understand. -
- The concept of the parallel arrays
- That chunking is being applied
- That each chunk is going to a different element in the array
- That a second loop that looks like a nested loop is not actually nested, but
  is executed after the outer loop, etc


Being able to express the above problem as an expression such as;

    (for I of Arr => <0> + I)

Seems more useful, a lot simpler to parse for the reader, allows implementors
more freedom, allows for better parallelism, etc

It also seems like an expression is a better tool for the job, when it comes to
reduction problems.

The point of looping is to apply some processing multiple times.
The point of an expression is to produce a result.

I would argue that the point of reduction problems is mainly to produce a
result. Looping is needed as part of the implementation to produce the result,
but looping is not the main goal here. It is secondary.

Further, this reduction expression syntax can be compatible with whatever
parallel loop syntax we might decide to go with, including what we came up with
at the last ARG meeting. This syntax raises the question about whether we even
need reduction syntax to be combined with loop syntax however.

> I want to at least feel like we are making progress toward a solution,
> rather than starting from nearly ground zero with each proposal. End
> of mini-rant.

I feel like is considerable progress toward a solution. It's not that this is
just another alternative that more or less accomplishes the same thing as what
was proposed during the ARG meeting. To me, this feels like a new idea that is a
significant improvement over what we had before.

> ...
>> Special syntax, "<value>" is used to indicate where this feedback
>> occurs, which also encloses the initial value to be used for the
>> first iteration.
>
> Regardless of the merits of the proposal itself (which I haven't
> thought about), I don't think angle brackets is going to fly as syntax
> in Ada. There would be substantial confusion with relational
> operators, which would likely make such text unparsable. (We discussed
> this extensively with some of the assignment target proposals; it
> didn't fly there and I don't see what could have changed in 8 months.)

But we already have angle brackets in Ada, <> box notation appears in several
places in the syntax. This is just another case of box, except that we stick a
value in between.  Tucker has already indicated several other possible syntax
replacements here, so if we dont like the box notation, it shouldn't be too hard
to come up with something else.

I think we could potentially just use <> also, without providing an intial
value, and just say that there has to be at least one iteration, and the intial
value is the value of the first iteration.

> You could use square or curly brackets for this, but in any case, such
> syntax doesn't seem very Ada-like.

Its hard to see how <> isnt Ada-like, since its already in Ada.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, June 13, 2017 12:59 PM

...
> One problem I could see, is that by using array syntax to hold chunk
> results, implies that there is an array of chunk results that exists
> in time that can be iterated over and examined. In the parallelism
> libraries I've been working with, that isn't the case, the chunk
> results come and go and get combined during the computation. There
> typically never is a point in time where all the chunk results exist.

Arrays in this context is just a syntax, there's no reason that it has to be
materialized. So I find this concern mainly one about reading too much into the
syntax. One hopes that the array *isn't* materialized, and the actual semantic
rules allow that.

However, we want the syntax to be as easy as possible to understand by a normal
sequential programmer, and they certainly will understand the use of arrays in
this context.

...
> Being able to express the above problem as an expression such as;
>
>     (for I of Arr => <0> + I)
>
> Seems more useful, a lot simpler to parse for the reader, allows
> implementors more freedom, allows for better parallelism, etc

This seems very limited to me, as it limits what can be parallelized with
reductions to things that can be written as a single expression. It seems as
thought people will end up writing helper functions (which necessarily reduce
parallelism as the compiler can't see the contents) or extremely complex single
expressions (hard to read, hard to debug).

It also seems to indicate (along with the thought to add anonymous subprograms,
lets, and the like) that some people want to force Ada into being a functional
language. Since there is a vocal part of the Ada community that finds functional
languages to be evil, that could be a very controversial direction. (I'm
undecided on this point.)

> > ...
> >> Special syntax, "<value>" is used to indicate where this feedback
> >> occurs, which also encloses the initial value to be used for the
> >> first iteration.
> >
> > Regardless of the merits of the proposal itself (which I haven't
> > thought about), I don't think angle brackets is going to fly as
> > syntax in Ada. There would be substantial confusion with relational
> > operators, which would likely make such text unparsable. (We
> > discussed this extensively with some of the assignment target
> > proposals; it didn't fly there and I don't see what could have
> > changed in
> 8 months.)
>
> But we already have angle brackets in Ada, <> box notation appears in
> several places in the syntax. This is just another case of box, except
> that we stick a value in between.

Umm, no. <> is a lexical symbol. If you try to put "something in between", you
have (at least) three lexical symbols: "<" expression ">", and that has to be
parsed to be understood.

> Tucker has already indicated several other possible syntax
> replacements here, so if we dont like the box notation, it shouldn't
> be too hard to come up with something else.
>
> I think we could potentially just use <> also, without providing an
> intial value, and just say that there has to be at least one
> iteration, and the intial value is the value of the first iteration.

Almost anything is better than "<expr>". Maybe not "+expr+". :-)

****************************************************************

From: Brad Moore
Sent: Thursday, June 15, 2017  2:04 AM

>> One problem I could see, is that by using array syntax to hold chunk
>> results, implies that there is an array of chunk results that exists
>> in time that can be iterated over and examined. In the parallelism
>> libraries I've been working with, that isn't the case, the chunk
>> results come and go and get combined during the computation. There
>> typically never is a point in time where all the chunk results exist.
>
> Arrays in this context is just a syntax, there's no reason that it has
> to be materialized. So I find this concern mainly one about reading
> too much into the syntax. One hopes that the array *isn't*
> materialized, and the actual semantic rules allow that.

Its not just about the storage layout though. If it looks like an array, people
will tend to use it that way. This includes doing slices, copying, bit
operations (if its an array of bits), etc

I can imagine cases where one will want to iterate through the array initially
after loop processing to perhaps do intermediate processing and adjustments to
the values before applying the reduction. Nested functions may be needed to do
this. For example, one might want to apply data smoothing the values. The more
one tries to use this as an array, I think the danger is the more that ends up
having to be the implementation.

> However, we want the syntax to be as easy as possible to understand by
> a normal sequential programmer, and they certainly will understand the
> use of arrays in this context.

That part is pretty easy to understand, since it looks like existing Ada.
However, the chunking, the meaning of (<>), and the overall construct, I would
think will be quite foreign to sequential programmers, and existing Ada
programmers.

> ...
>> Being able to express the above problem as an expression such as;
>>
>>     (for I of Arr => <0> + I)
>>
>> Seems more useful, a lot simpler to parse for the reader, allows
>> implementors more freedom, allows for better parallelism, etc
>
> This seems very limited to me, as it limits what can be parallelized
> with reductions to things that can be written as a single expression.
> It seems as thought people will end up writing helper functions (which
> necessarily reduce parallelism as the compiler can't see the contents)
> or extremely complex single expressions (hard to read, hard to debug).

Similar to the rewriting of array syntax, I would think the compiler could also
do some rearranging of at least certain helper functions, such as inline
functions, to further optimize what actually gets generated by the compiler, if
it is needed.

> It also seems to indicate (along with the thought to add anonymous
> subprograms, lets, and the like) that some people want to force Ada into
> being a functional language. Since there is a vocal part of the Ada
> community that finds functional languages to be evil, that could be a very
> controversial direction. (I'm undecided on this point.)

I think an important point to consider is whether this syntax is appealing
enough to add to Ada. It seems to be useful for other things than parallelism,
for example it is convenient to be able to express factorial as an expression
function, where parallelism likely isn't needed or wanted.

It doesn't necessarily need to be the only way to write reductions, just as
quantified expressions can also be expressed in Ada in other ways, such as via
looping and function calls. We can also have a parallel loop solution, such as
the Pittsburgh proposal as well.

I think this important to discuss the parallel reduction syntax however, because
  a) If we do want to support this, this likely will impact the syntax proposal
     for parallel loop reductions
  b) Even if we decide to not support this, (at least not now), it offers some
     ideas that still could impact the parallel loop proposal.

In particular, I'm referring to the idea for the Reducer aspect, and Identity
aspect that can be applied to function declarations. This is useful to separate
the "what" from the "how" as I described in my previous email, where the "what"
syntax shows what application code to be parallelized, and the "how" syntax
which is separate, provides details on how this is to happen.

The reducer for "+" is always going to be "+" itself, so why force the
programmer to write this out every time one want to add numbers in parallel,
this can be baked into the standard in the form of aspects applied to the
function declarations?

This could eliminate the need for both the "then" clause as well as the need for
the parallel array, which could result in a much simpler syntax construct
involving only adding the work parallel to a loop. This also makes sense because
adding parallelism I think in a lot of cases will be a later step in the
development process of an application - The optimisation phase.

In that case you have existing sequential code to be parallelized, and it is
desirable to have to make minimal changes to the code to get the parallelism.

Further, one might find that adding the parallelism doesn't help performance, or
may even make things worse, in which case the programmer will want to back out
those changes. It would be desirable to be able to undo changes with minimal
effort. For example, just adding or removing the keyword "parallel" would be
appealing.

If we take the sequential code:

   declare
      Sum : Integer := 0;
   begin
      for I in 1 .. 1_000_000 loop
         Sum := @ + I;
      end loop;

      Put_Line ("Sum:" & Integer'Image (Sum));
   end;

And convert it to parallel just by adding the parallel keyword

   declare
      Sum : Integer := 0;
   begin
      for I in parallel 1 .. 1_000_000 loop
         Sum := @ + I;
      end loop;

      Put_Line ("Sum:" & Integer'Image (Sum));
   end;

I think that is closer to what programmers would want to be able to do.

We could say that as written, the compiler would reject this, because the global
aspect could indicate this is a data race for parallel operations.

We could then have a Boolean aspect, "Reduction" that can be applied to a
variable declaration that indicates that variable involves parallel reduction.
This would then allow the compiler to accept the loop, and provides better
feedback to the user that parallel reduction is occurring.

i.e.

   declare
      Sum : Integer with Reduction := 0;
   begin
      for I in parallel 1 .. 1_000_000 loop
         Sum := @ + I;
      end loop;

      Put_Line ("Sum:" & Integer'Image (Sum));
   end;


The compiler and the reader can see that the operation being applied to the
global result is

   function "+" (L, R : Integer) with Identity => 0;

Both can also see that this is also a reducer function because it accepts two
parameters of the same type, and returns a result of that same type, and because
it has the Identity aspect. So this shows how the results of the chunks are
combined. The compiler now has the necessary inputs to be able to implement the
parallelism, and the reader who is mostly only concerned with the application
logic can see that the code is to run in parallel, and in many cases, the
programmer will not care how this happens, but if he does he can look up the
reducer function for the loop, as well as the identity value for that function.

I would think the sequential programmer which prefer to see the above solution
for parallel loops, over the syntax proposed in Pittsburgh.

This also is more harmonious with the parallel reduction expression syntax as
currently written in the AI.

Note that writing as a reduction expression, isn't a data race because the
result is computed before it is assigned to the global value, so in that case,
the Reduction aspect would not be needed on the Sum declaration, though it could
be optionally added as a confirming aspect.

i.e. the above code could be replaced with;

   declare
      Sum : constant Integer := (for I in parallel 1 .. 1_000_000 => <> + I);
   begin
      Put_Line ("Sum:" & Integer'Image (Sum));
   end;

or even just;

   Put_Line ("Sum:" & Integer'Image (for I in parallel 1 .. 1_000_000 => <> + I));

****************************************************************

From: Brad Moore
Sent: Saturday, June 17, 2017 11:53 PM

I think we can further simplify and clarify the Reduction Expression syntax, and
at the same time eliminate some problems.

Specifically, we can eliminate the Reducer aspect and Identity aspect by saying
that the top level operation of the reduction expression must be a reducer
function. (i.e. a function that takes two arguments of the same type, and
returns a result of that same type)

In that case, we would not need to know of an identity value.
Rather, the result of the first iteration of any chunk effectively becomes the
initial value for that chunk. The accumulation back into the same result only
occurs if there are at least two iterations for a chunk.

The expression inside the box is then not the initial value, but instead is the
default value, whose only purpose is to provide the result of the expression if
the reduction expression iterates over a null range, i.e. 0 iterations.

A <> with no initial value would signify that there is no default value for the
expression, which implies that an exception would be raised if a null range were
specified.

This actually is needed for operations such as set intersection, where the
intersection of 0 sets is not defined mathematically, or at least difficult to
produce a value for, if we wanted it to mean the universe.

consider:

We want to find the intersection of animals that can be found in a set of zoos.

Set_Array(Vienna_Zoo).Include("Gorilla");
Set_Array(Vieena_Zoo).Include("Giraffe")
...
Set_Array(Calgary_Zoo).Include("Giraffe");
Set_Array(Calgary_Zoo).Include("Buffalo");


Common_Animals : constant Set := (for Set of Set_Array => Intersection(<>, Set));

It would be pretty much impossible to create an initial value for a zoo that
contains all known animals.

Since that case is only needed when there are zero zoos being examined, we do
not need to supply that value. Also in this case, the programmer knows that
there are at 1 or more zoos in the array, so the default value will not be
needed, and and exception would not be raised.

Probably it would make sense that the exception raised would be
Constraint_Error, if one did try to iterate with a Reduction expression that did
not provide a default value.

This solves another problem. Yesterday Erhard brought up the example of matrix
to the power of N.

Now that the Identity aspect is not needed, the Identity Matrix is only needed
as a default value for the reduction expression. This can now be trivially
supplied by the programmer by a function call.

eg.

   Matrix_To_The_Nth_Power : constant Matrix := (for I in 1 .. N => <Identity (A)> * A;

Where, Identity (A) is a function of a Matrix abstraction that returns the
Identity matrix for the matrix, A.

If we know that N is of subtype Positive, we can further simplify this to,

   Matrix_To_The_Nth_Power : constant Matrix := (parallel for I in 1 .. N => <> * A;

Since, the Identity matrix would not be needed to evaluate the expression.

You may ask, what about the case in the previous examples as currently written
in the AI where the top level operation is not a reducer function.

The answer is that it should be trivial to rewrite the expression so that a
reducer function is used.

Instead of;

    Alphabet : constant Unbounded_String := (for Letter in 'A' .. 'Z' => Append(<>, Letter));

One could write;

    Alphabet : constant Unbounded_String := (for Letter in 'A' .. 'Z' => <> & To_Unbounded_String ("" & Letter));

Does this seem like a way forward?

****************************************************************

From: Brad Moore
Sent: Sunday, June 18, 2017  12:26 AM

Another problem that would benefit from this simplication is bitwise "and" for
modular integer types.

It otherwise would be difficult to produce an initial value for modular integer
types where the modulus is not a power of two. Attempting to set all bits to 1
could result in a value that is outside the range of values for the type.

With this simplification, it should not be a problem anymore, since we are only
"and"ing together supplied values that are all valid, and the result will also
be valid and belong to the subtype.

eg.

   (for Value of Arr => <> & Value)

****************************************************************

From: Tucker Taft
Sent: Sunday, June 18, 2017  2:02 AM

I would rather leave it the way we discussed yesterday, as these changes in my
view are making the construct more complex from the user’s point of view.  It is
easy enough for the compiler to recognize special cases.  It is a pain for the
user to have to obey additional restrictions.

****************************************************************

From: Brad Moore
Sent: Sunday, July  9, 2017  1:23 AM

I have been thinking about iteration schemes, for for loops, and I think there
are 5 that we'd want to support in Ada. By iteration scheme, I am referring to
the combination of chunking and iteration direction.

The 5 are;

- Forward Sequential
- Reverse Sequential
- Forward Parallel
- Reverse Parallel
- All     Parallel


- Forward Sequential
Already exists in Ada. (A for loop without the reverse keyword) Sequential and
Forward are both implicit. Semantically, this is a single chunk, executed by a
single tasklet, with iteration from lowest to highest.

- Reverse Sequential
Already exists in Ada. (A for loop with the reverse keyword) Sequential is
implicit, Reverse is explicit. Semantically, also a single chunk, executed by a
single tasklet, with iteration from highest to lowest.

- Forward Parallel
Forward is implicit, parallel is explicit.
The for loops with the parallel keyword we were discussing in Vienna
Semantically, multiple chunks, each corresponding to a tasklet where iteration
within each chunk is from lowest to highest.

- Reverse Parallel
Both Reverse and Parallel are explicit.
A for loop with both the parallel keyword, and the reverse keyword.
Semantically, multiple chunks, each corresponding to a tasklet where iteration
within each chunk is from highest to lowest.

- All Parallel
Both all and Parallel are explicit.
A for loop with the parallel keyword, and the "all" keyword.
Semantically, each iteration is a separate chunk, or alternatively viewed as
there not being any chunks. There is no order to the iteration, any iteration
can occur before any other.

The case for reverse parallel would admittedly be quite rare, just as Reverse
Sequential loops are quite rare today in Ada, however it is not hard to come up
with an example.

Consider finding the highest index of an array whose element satisfies some
property of interest. If chunking is involved, it makes sense to want to search
in reverse order, because once you've found a candidate, you know that there is
no need to search further within the chunk, and processing can be abandoned.
With a forward sequential loop, one cannot be sure that they've found a
candidate without searching the entire chunk.

Eg.

    --  Note: We do not want exit to pro-actively terminate processing in
    --  other chunks here. We just want to exit processing for the
    --  current chunk if any of the exit conditions hold true.
    --  Updater is a protected object in this example.

    --  Find maximum array index whose element satisfies some property

    for I in reverse parallel Arr'Range loop

       -- This first check causes other chunks to stop processing
       -- if a higher chunk has already found a result
       if I < Updater.Value then
          exit;

       -- This second check abandons the current chunk if a
       -- result is found, since any other finds in the same
       -- chunk will have a lower index
       elsif Satisfies (Arr (I)) then
          Updater.Update (I);
          exit;
       end if;
    end loop;

    Put_Line ("The answer is" & Integer'Image (Updater.Value));

The all parallel case is useful when the programmer wants to do manual chunking.
This is useful for instance in problems such as generating a cumulative array
sum of another array, as two loops are needed and generally both loops should be
chunked the same way.

Another useful case for all parallel loops is when one wants to mix sequential
processing within the parallel processing, for instance if one wants to output
some results at the end of each iteration through the chunks. Consider the
following example to compute a 7 day weather forecast for various cities in the
world, as well as an average of all these values for each day.

    declare
       type Cities is (London,
                       Tokyo,
                       New_York,
                       Beijing,
                       Calgary,
                       Paris,
                       Los_Angeles);

       Today : constant Calendar.Time := Calendar.Clock;
       Forecast : array (Cities) of Float := (others => 0.0);

       Done_Parallelism, Done_Sequential : Synchronous_Barrier
         (Release_Threshold => Forecast'Length);

       function Max_Temp
          (City : Cities; Day : Calendar.Time) return Float is
       begin
         ...
          --  Presumably some complex computer intensive calculation....
          return Result;
       end Max_Temp;

       Notified : Boolean := False;
       use type Calendar.Arithmetic.Day_Count;
    begin

       Day_Loop :
       for Day in 1 .. 7 loop -- Compute 7 day forecast

          City_Loop :
          for City in all parallel Cities'Range loop

             -- Calculated in parallel for each city
             Forecast (City) :=
               Max_Temp (City,
                         Today + Calendar.Arithmetic.Day_Count (Day));

             Synchronous_Barriers.Wait_For_Release (Done_Parallelism,
                                                    Notified);

             -- Transitioning from parallel to sequential
             if Notified then

                Output_Loop :
                for City in Cities'Range loop
                   Put (Cities'Image (City) &
                          Float'Image (Forecast (City)) & ' ');

                   -- Note use of parallel reduction expression

                   Put ("World Avg" &
                          Float'Image
                          ((for Temp of parallel Forecast =>
                            <0.0> + Temp)
                           / Float (Forecast'Length)));
                end loop Output_Loop;
             end if;

            -- Transitioning from Sequential back to Parallel
             Synchronous_Barriers.Wait_For_Release (Done_Sequential,
                                                    Notified);
          end loop City_Loop;
       end loop Day_Loop;

    end;

The synchronous barriers are used to transition between parallel and sequential
processing, and then back to to parallel processing for each day of the
forecast. For this to work, each iteration in City_Loop must iterate in parallel
with each other since the Release_Threshold of the barriers needs to be known in
advance. The calculation of each forecast for each city for each day is done in
parallel, but presumably, the weather calculations are such that one cannot
proceed to the next day until all the weather forecasting is complete for the
previous day for all the cities.

I know in Vienna, we had taken a straw poll, which favored putting the keyword
parallel before the word "for" in the for loop. While this is a useful data
point, I think if we want to support "all parallel" loops, I find it makes a lot
more sense to have all parallel together after the word "in" or "of".

eg
    for I in all parallel 1 .. 10 loop

or

    for Element of all parallel List loop

rather than;

all parallel for i in 1 .. 10 loop
or
parallel all for I in 1 .. 10 loop


as it reads more like an English sentence, similar to reverse parallel loops.

for I in reverse parallel 1 .. 10 loop

instead of;

parallel for I in reverse 1 .. 10 loop

The other point that was not made in Vienna, is that there is a case for saying
that the syntax for specifying iteration scheme should be grouped together in
the same place in the syntax.

Having parallel at the beginning of the loop statement, and reverse in the
middle of the loop statement means you have to look in two places for the reader
to determine the iteration scheme.

To recap some of my other arguments that were made in Vienna, which might
resonate better with the above considerations

1. for ... loop  identifies the control construct which is the most important
   part of the syntax. That the iterations occur in parallel, reverse, forward,
   etc is secondary, which suggests that parallel is not the most important part
   of the syntax and therefore should not appear first.

2. This point was not specifically made in Vienna, but the parallel block is
   also a control structure, so in that case, it makes sense that parallel is
   the first word.

3. In Vienna, we decided that since parallel for loops with parallel at the
   front of the syntax would have some ambiguity with parallel blocks, we felt
   we had to add the Begin keyword to the parallel blocks. I did not like that
   we were being forced to alter our other syntax on account of the choice of
   parallel keyword placement in for loops. In the examples I have used I have
   found that any declarations typically need to have greater scope than the
   parallel block, so it is not clear to me that having a begin declaration area
   as part of the parallel block is particularly useful.

Comments?

****************************************************************

From: Tucker Taft
Sent: Sunday, July  9, 2017  8:08 AM

...
> Both Reverse and Parallel are explicit.
> A for loop with both the parallel keyword, and the reverse keyword.
> Semantically, multiple chunks, each corresponding to a tasklet where
> iteration within each chunk is from highest to lowest.
>

Given where we are placing the “parallel” word in the latest proposal discussed
by the ARG (at the beginning), there seems no particular reason to disallow
“reverse.”  But I don’t find your example very compelling:

...
> Consider finding the highest index of an array whose element
> satisfies some property of interest. If chunking is involved, it makes
> sense to want to search in reverse order, because once you've found a
> candidate, you know that there is no need to search further within the
> chunk, and processing can be abandoned. …

>
> Eg.
>
>   --  Note: We do not want exit to pro-actively terminate processing in
>   --  other chunks here. We just want to exit processing for the
>   --  current chunk if any of the exit conditions hold true.
>   --  Updater is a protected object in this example.
>
>   --  Find maximum array index whose element satisfies some property
>
>   for I in reverse parallel Arr'Range loop
>
>      -- This first check causes other chunks to stop processing
>      -- if a higher chunk has already found a result
>      if I < Updater.Value then
>         exit;
>
>      -- This second check abandons the current chunk if a
>      -- result is found, since any other finds in the same
>      -- chunk will have a lower index
>      elsif Satisfies (Arr (I)) then
>         Updater.Update (I);
>         exit;
>      end if;
>   end loop;
>
>   Put_Line ("The answer is" & Integer'Image (Updater.Value));

The semantics for exiting a loop early as currently proposed is that if any
iteration invokes an “exit,” all other iterations are abandoned.  So you
wouldn’t end up with the highest index overall.  You would end up with the one
that is the highest within its chunk, but of which chunk would be somewhat
random.  And along the way you have added a synchronized operation in each
iteration (if I < Updater.Value) which seems highly likely to slow things down
significantly.

One place where "reverse parallel" definitely makes sense is in the proposed
map/reduce construct, where the “parallel” is really about doing the “map” part
in parallel, and then doing the “reduction” part in some kind of parallel tree,
preserving order.  So “reverse” has a well-defined meaning.  I don’t see
“reverse" having much meaning in the more conventional   parallel iteration
schemes.  It just produces a bit of a bias within each chunk, but has no
particularly well-defined overall semantic effect.

...
> - All Parallel
> Both all and Parallel are explicit.
> A for loop with the parallel keyword, and the "all" keyword.
> Semantically, each iteration is a separate chunk, or alternatively
> viewed as there not being any chunks. There is no order to the
> iteration, any iteration can occur before any other.

I would say using “all” to signify a universal quantified expression as well as
this new idea is asking for confusion.

...
> The all parallel case is useful when the programmer wants to do manual
> chunking. This is useful for instance in problems such as generating a
> cumulative array sum of another array, as two loops are needed and
> generally both loops should be chunked the same way.
...
> The synchronous barriers are used to transition between parallel and
> sequential processing, and then back to to parallel processing for
> each day of the forecast. For this to work, each iteration in
> City_Loop must iterate in parallel with each other since the
> Release_Threshold of the barriers needs to be known in advance. The
> calculation of each forecast for each city for each day is done in
> parallel, but presumably, the weather calculations are such that one
> cannot proceed to the next day until all the weather forecasting is complete
> for the previous day for all the cities.

I guess I understand the semantics, but it just feels too complicated to me.
For something like this, two separate parallel loops, or an explicit array of
tasks might be more appropriate.  Trying to synchronize across tasklets is going
to be complex, and the semantics will be subject to a lot of caveats, I suspect.
Tasklets can generally use run-to-completion semantics, and that clearly won’t
work here with barriers in the middle.  If you want a barrier, you can end the
parallel loop, and then start a new one, which creates a natural, highly
visible, easy-to-understand barrier.

> I know in Vienna, we had taken a straw poll, which favored putting the
> keyword parallel before the word "for" in the for loop. While this is
> a useful data point, I think if we want to support "all parallel"
> loops, I find it makes a lot more sense to have all parallel together
> after the word "in" or "of”.  …

Personally I would rather not re-open this decision.  We have already gone round
and round on a number of these issues.  The usual rule is that only someone who
voted *for* the chosen approach can re-open the discussion.

****************************************************************

From: Brad Moore
Sent: Monday, July 10, 2017  2:24 PM

>> … - Reverse Parallel
>> Both Reverse and Parallel are explicit.
>> A for loop with both the parallel keyword, and the reverse keyword.
>> Semantically, multiple chunks, each corresponding to a tasklet where
>> iteration within each chunk is from highest to lowest.
>>
>
> Given where we are placing the “parallel” word in the latest proposal
> discussed by the ARG (at the beginning), there seems no particular reason to
> disallow “reverse.”  But I don’t find your example very compelling:

I will try to see if I can come up with some more compelling examples.
But I agree that reverse parallel loops are not likely to be used much.
It might be that the best use case is finding the max occurrence of something,
similar to the example I provided, which is not a common problem, but might be
common enough to allow the syntax to cover that case.

...
> The semantics for exiting a loop early as currently proposed is that if any
> iteration invokes an “exit,” all other iterations are abandoned.

Not quite, the AI adds this note:

"Note that aborting a tasklet need not be preemptive, but should prevent the
initiation of further nested parallel blocks or parallel loops."

So if the loop is running its chunks non-preemptively, then presumably the code
would work as I suggested.

I think there are two cases to consider.
1) A high number of iterations with small processing per iteration
      eg. (incrementing integer elements of a large array)
2) A small number of iterations with large processing per iteration.
      eg. (Calculating weather for a handful of cities)


In case 1, the majority of time is being spent doing the iteration. I suspect
that preemptive tasklet abortion in that case will introduce significant
overhead and interfere too much with the parallelism. This is better suited for
the non-preemptive case.

In case 2, the actual iteration is insignificant, and there is much to be gained
by preemptively aborting other tasklets.

I can imagine we might want to give the programmer control of this decision via
an aspect or pragma, such as Preemptive_Aborts but otherwise let the compiler
decide how to deal with such transfer of control.

> One place where "reverse parallel" definitely makes sense is in the proposed
> map/reduce construct, where the “parallel” is really about doing the “map” part
> in parallel, and then doing the “reduction” part in some kind of parallel tree,
> preserving order.

Agreed.

...
>> - All Parallel
>> Both all and Parallel are explicit.
>> A for loop with the parallel keyword, and the "all" keyword.
>> Semantically, each iteration is a separate chunk, or alternatively
>> viewed as there not being any chunks. There is no order to the
>> iteration, any iteration can occur before any other.
>
> I would say using “all” to signify a universal quantified expression as well
> as this new idea is asking for confusion.

I dont think this is a real problem, just as it isn't for using "all"
for dereferencing access objects, because the surrounding context is different.
As in english, "all" typically describes what follows,

for all X ...

    -- meaning for all values of X

for X in all parallel ....

     -- meaning for all parallel iterations

Also Quantified expressions and Loops statements appear in very different
contexts in Ada code.

> ...
>> The synchronous barriers are used to transition between parallel and
>> sequential processing, and then back to to parallel processing for
>> each day of the forecast. For this to work, each iteration in
>> City_Loop must iterate in parallel with each other since the
>> Release_Threshold of the barriers needs to be known in advance. The
>> calculation of each forecast for each city for each day is done in
>> parallel, but presumably, the weather calculations are such that one
>> cannot proceed to the next day until all the weather forecasting is complete
>> for the previous day for all the cities.
>
> I guess I understand the semantics, but it just feels too complicated to me.
> For something like this, two separate parallel loops, or an explicit array
> of tasks might be more appropriate.  Trying to synchronize across tasklets
> is going to be complex, and the semantics will be subject to a lot of
> caveats, I suspect.  Tasklets can generally use run-to-completion semantics,
> and that clearly won’t work here with barriers in the middle.  If you want a
> barrier, you can end the parallel loop, and then start a new one, which
> creates a natural, highly visible, easy-to-understand barrier.

To do this as two separate loops, you need to store much more state. In this
case, you'd need a two dimensional array and store all the values for all days
of the forecast. Using barriers, you only need to track the current day
forecast, so a single dimensional array works here.

There are other similar problems that benefit from the use of barriers.

One is the n-body problem, where the masses of heavenly bodies and direction of
movement are tracked so that the orbital paths can be predicted. The technique
is to use small increments of time and then update the location of all the
heavenly bodies taking gravitational pulls into account, and the position of the
heavenly bodies can then be output for each successive time increment.

Using barriers, only the current location of each body needs to be tracked.
Using your suggestion of two separate loops, one would have to store all the
updated movements in memory before outputting the results. This could
potentially and literally require an astronomical amount of storage.

Using barriers really is the only reasonable solution for this, that I can see.

Yet another example of this sort of problem is of linear algebra solving solving
matrices using gaussian elimination.  As each column is processed, barriers
allow one to find the pivot point for the next column. I dont think I can
imagine how to solve this by breaking into separate loops, not that I have
tried, but it seems like it would be very difficult. I think using barriers is
the best solution for that problem when solving in parallel.

I also think in the AI we already talk about calling protected objects from
tasklets.

" calls by different tasklets of the same task into the same protected object
are treated as different calls resulting in distinct protected actions;
therefore synchronization between tasklets can be performed using non-blocking
protected operations. Note that this is consistent with the current standard
which already supports multiple concurrent calls by a single task in the
presence of the asynchronous transfer of control capability (RM 9.7.4 (23))."

So I think the current AI design suggests that barriers should work as I
describe in the example.

>> I know in Vienna, we had taken a straw poll, which favored putting
>> the keyword parallel before the word "for" in the for loop. While
>> this is a useful data point, I think if we want to support "all
>> parallel" loops, I find it makes a lot more sense to have all
>> parallel together after the word "in" or "of”.  …
>
> Personally I would rather not re-open this decision.  We have already gone
> round and round on a number of these issues.  The usual rule is that only
> someone who voted *for* the chosen approach can re-open the discussion.

I wouldn't ordinarily have suggested reexamining this issue, but I would hope
that it should be oK for anyone to open a discussion if new ideas and arguments
are brought to light that were not considered as inputs in the original
decision.

****************************************************************

From: Tucker Taft
Sent: Monday, July 10, 2017  2:43 PM

>>> ...
>> The semantics for exiting a loop early as currently proposed is that if any
>> iteration invokes an “exit,” all other iterations are abandoned.
>
> Not quite, the AI adds this note:
>
> "Note that aborting a tasklet need not be preemptive, but should
> prevent the initiation of further nested parallel blocks or parallel loops."
>
> So if the loop is running its chunks non-preemptively, then presumably
> the code would work as I suggested.

That is certainly not something to be relied upon.  And even if things run
non-preemptively, there might be tasklets that have never been started, so
certainly those would not start after an “exit.”

> I think there are two cases to consider.
> 1) A high number of iterations with small processing per iteration
>     eg. (incrementing integer elements of a large array)
> 2) A small number of iterations with large processing per iteration.
>     eg. (Calculating weather for a handful of cities)
>
>
> In case 1, the majority of time is being spent doing the iteration. I
> suspect that preemptive tasklet abortion in that case will introduce
> significant overhead and interfere too much with the parallelism. This
> is better suited for the non-preemptive case.
>
> In case 2, the actual iteration is insignificant, and there is much to
> be gained by preemptively aborting other tasklets.
>
> I can imagine we might want to give the programmer control of this
> decision via an aspect or pragma, such as Preemptive_Aborts but
> otherwise let the compiler decide how to deal with such transfer of
> control.

This is adding yet more complexity to the model.  And as pointed out above,
if you have more tasklets than cores, some of them might not be started at
all before the “exit."

>>> ...
>>>
>> I guess I understand the semantics, but it just feels too complicated to
>> me.  For something like this, two separate parallel loops, or an explicit
>> array of tasks might be more appropriate.  Trying to synchronize across
>> tasklets is going to be complex, and the semantics will be subject to a lot
>> of caveats, I suspect.  Tasklets can generally use run-to-completion
>> semantics, and that clearly won’t work here with barriers in the middle.
>> If you want a barrier, you can end the parallel loop, and then start a new
>> one, which creates a natural, highly visible, easy-to-understand barrier.
>
> To do this as two separate loops, you need to store much more state.

You should factor in the amount of space needed to store the stacks of all of
the temporarily suspended tasklets.  You are requiring that all but one tasklet
be suspendible until they all reach the barrier.

****************************************************************

From: Randy Brukardt
Sent: Monday, July 10, 2017  3:16 PM

Just a few words on this, because I don't think it pays to get bogged down in
detail...

...
> So if the loop is running its chunks non-preemptively, then presumably
> the code would work as I suggested.
>
> I think there are two cases to consider.
> 1) A high number of iterations with small processing per iteration
>       eg. (incrementing integer elements of a large array)
> 2) A small number of iterations with large processing per iteration.
>       eg. (Calculating weather for a handful of cities)

This is the crux of the issue. There clearly are multiple termination approaches
that make sense for a parallel loop, and those are mostly independent of the
form of the loop itself. (And there's more than two!)

   Preemptive vs. not preemptive
   Order significant vs. not significant

Specifically, one possibility is the proposed rule that an exit of any tasklet
terminated the others. Another possibility is that the exit effectively works
the same as a sequential loop -- meaning that an exit terminates any tasklets
with indicies above the exiting one, and the tasklets with indicies below
continue (this would work well for the search application among others). Since
the latter is likely to have a higher overhead, one probably would want the
option to select the termination mechanism.

There's other details that might be worth controlling.

My overall point is that it is unlikely that the details could be controlled
solely with keywords, there are just too many. We need something like an aspect
specification for these loops so that the details can be controlled without
introducing loads of keywords (which will be hard to remember).

And the default should be that semantics are as close as reasonable to
sequential loops; many cases can use simpler termination but I think the default
should be fewest surprises.

...
> >> I know in Vienna, we had taken a straw poll, which favored putting
> >> the keyword parallel before the word "for" in the for loop. While
> >> this is a useful data point, I think if we want to support "all
> >> parallel" loops, I find it makes a lot more sense to have all
> >> parallel together after the word "in" or "of".  .
> >
> > Personally I would rather not re-open this decision.  We
> have already gone round and round on a number of these issues.  The
> usual rule is that only someone who voted *for* the chosen approach
> can re-open the discussion.
>
> I wouldn't ordinarily have suggested reexamining this issue, but I
> would hope that it should be oK for anyone to open a discussion if new
> ideas and arguments are brought to light that were not considered as
> inputs in the original decision.

There are always going to be "new ideas" for an AI like this one that it at the
bleeding edge. Allowing unlimited reopening only ensures that nothing ever will
be decided and no progress ever will be made. That's certainly the case for this
AI in the last year or so. Keep in mind that to stay on schedule this thing has
to be mostly frozen in little over a year. Going back to square one for every
decision ensures that there is almost no chance of that happening. (The only
reason IMHO that we didn't completely change parallel blocks at this last
meeting is that we barely talked about them.)

****************************************************************

From: Brad Moore
Sent: Monday, July 11, 2017  9:18 AM

>>>> ...
>>> The semantics for exiting a loop early as currently proposed is that if any
>>> iteration invokes an “exit,” all other iterations are abandoned.
>>
>> Not quite, the AI adds this note:
>>
>> "Note that aborting a tasklet need not be preemptive, but should
>> prevent the initiation of further nested parallel blocks or parallel loops."
>>
>> So if the loop is running its chunks non-preemptively, then
>> presumably the code would work as I suggested.
>
> That is certainly not something to be relied upon.  And even if things run
> non-preemptively, there might be tasklets that have never been started, so
> certainly those would not start after an “exit.”

I think if we did support some sort of non-preemptive mode, probably the most
sensible behaviour for that mode is to just let the tasklets complete on their
own, without any attempt to shut them down midstream. Any tasklets that hit the
exit, or other transfer of control would terminate early, but otherwise tasklets
generated at a POP would run to completion.

...
>> I can imagine we might want to give the programmer control of this
>> decision via an aspect or pragma, such as Preemptive_Aborts but
>> otherwise let the compiler decide how to deal with such transfer of
>> control.
>
> This is adding yet more complexity to the model.  And as pointed out above, if
> you have more tasklets than cores, some of them might not be started at all
> before the "exit."

I think there are a number of controls that the real time community might want,
for example. I think all of these should be the subject of a different AI
though, as they can be added later on their own merit, including an explicit
non-preemptive mode for tasklet termination, which I think should probably allow
starting after an "exit", (or continuing after an exit if tasklets get initially
created to cover the full iterations of the for loop, which tends to be how I
look at things.)

...
>>>>
>>> I guess I understand the semantics, but it just feels too complicated to me.
>>> For something like this, two separate parallel loops, or an explicit array
>>> of tasks might be more appropriate.  Trying to synchronize across tasklets
>>> is going to be complex, and the semantics will be subject to a lot of
>>> caveats, I suspect.  Tasklets can generally use run-to-completion
>>> semantics, and that clearly won’t work here with barriers in the middle.
>>> If you want a barrier, you can end the parallel loop, and then start a new
>>> one, which creates a natural, highly visible, easy-to-understand barrier.
>>
>> To do this as two separate loops, you need to store much more state.
>
> You should factor in the amount of space needed to store the stacks of all
> of the temporarily suspended tasklets.  You are requiring that all but one
> tasklet be suspendible until they all reach the barrier.

Yes, but that storage is bounded, and the programmer can always do manual
chunking if necessary to to bring the number of chunks down to the number of
available cores, or a small multiple of that. The simulation can run
indefinitely in theory without having to worry about running out of storage.
This tasklet stack storage becomes insignificant for the n-body problem,
compared to the real bounds of N * state, where N is the number of bodies, and
state is the amount of state variables for each body.

With separate loops however, the storage needed for the simulation however would
be unbounded, in the sense that the longer the simulation runs, the more storage
is needed, and you need much more of it.

The needs become N * state * time-increments, where time-increments are the
number of simulation steps that are being computed. The notion of running a
simulation for an unspecified amount of time becomes problematic, because of the
risk of running out of storage.

I think this barrier pattern could be commonly applied to a lot of
simulation/modeling scenarios.

****************************************************************

From: Brad Moore
Sent: Tuesday, July 11, 2017  9:45 AM

> This is the crux of the issue. There clearly are multiple termination
> approaches that make sense for a parallel loop, and those are mostly
> independent of the form of the loop itself. (And there's more than
> two!)
>
>     Preemptive vs. not preemptive
>     Order significant vs. not significant

In the gang of four, we had discussed the idea of declaring a local subtype for
a loop iterator, and then applying aspects to that declaration to provide
specific controls.

eg.

declare
   subtype Loop_Iterator is Integer range 1 .. 100000
            with Executors => 10, Chunk_Size => 100,
            Load_Sensitivity, Preemptive_Termination => False; begin
    parallel for I in Loop_Iterator'range loop
       ....
    end loop;
end;

Adding these sort of specific controls should be a separate AI, and can be added
later independent of this AI.

We could say that the "all parallel" notion could be a similar aspect that is
specified this way, but to me at least it seems like this would be an
inconsistency, since all other iteration schemes are built right into the loop
syntax. I think the "all parallel" scheme is useful enough that it ought to be
visible at the same level as the parallel keyword, and the reverse keyword,
regardless whether these two keywords appear adjacent to each other, or
separated as we have coming out of Vienna.

The question is how best to do this in the syntax. It seemed to fit nicely in
with the version of the AI going into Vienna. It doesn't seem to fit very well
with the version of the AI we'll have coming out of Vienna, but maybe we'll
either live with that, or come up with a better idea.

...
> Specifically, one possibility is the proposed rule that an exit of any
> tasklet terminated the others. Another possibility is that the exit
> effectively works the same as a sequential loop -- meaning that an
> exit terminates any tasklets with indicies above the exiting one, and
> the tasklets with indicies below continue (this would work well for
> the search application among others). Since the latter is likely to
> have a higher overhead, one probably would want the option to select
> the termination mechanism.
>
> There's other details that might be worth controlling.
>
> My overall point is that it is unlikely that the details could be
> controlled solely with keywords, there are just too many. We need
> something like an aspect specification for these loops so that the
> details can be controlled without introducing loads of keywords (which will be
> hard to remember).

I think the aspect specification like I described above would cover most of
these, but "all parallel" feels like it should be more integrated with the loop
syntax, for consistency, and importance for semantic behaviour. I think there
are enough problems that would need to rely on these semantics. Most other
controls such as number of executors, or chunk size do not really affect the
semantics. They just impact the speed of the processing.

> And the default should be that semantics are as close as reasonable to
> sequential loops; many cases can use simpler termination but I think
> the default should be fewest surprises.

Agreed.

****************************************************************

From: Tucker Taft
Sent: Tuesday, July 11, 2017  9:55 PM

>>> So if the loop is running its chunks non-preemptively, then
>>> presumably the code would work as I suggested.
>>
>> That is certainly not something to be relied upon.  And even if things run
>> non-preemptively, there might be tasklets that have never been started, so
>> certainly those would not start after an “exit.”
>
> I think if we did support some sort of non-preemptive mode, probably the most
> sensible behaviour for that mode is to just let the tasklets complete on their
> own, without any attempt to shut them down midstream. Any tasklets that hit
> the exit, or other transfer of control would terminate early, but otherwise
> tasklets generated at a POP would run to completion.  ...

I believe this is creating two significantly different semantics for a single
syntactic construct, which is bad news in my view.  You are heading toward
“programming by pragma” (or "by aspect"), I fear.  That is largely what OpenMP
does, and I believe we should avoid it.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 19, 2017  8:36 PM

...
> > And the default should be that semantics are as close as reasonable
> > to sequential loops; many cases can use simpler termination but I
> > think the default should be fewest surprises.
>
> Agreed.

(I think this is the only thing that I wrote that Brad agreed with, but don't
look behind that curtain! :-)

I had the thought that any loop exits have to be explicitly in the loop body
(they can't be in some subprogram that is called). Thus, we could make the
termination semantics depend upon that. (The same applies to AI12-0189-1, the
loop body as anomymous procedure, BTW).

The basic idea is that one could depend on the loop index value determined at
the point of an exit; if the loop finishes normally or exits abnormally, you
couldn't have such a dependence. Thus if there is no exit, the termination is a
bit simpler (not much, really); if there is, we require the loop to get the same
loop index as the similar sequential loop.

You could use an exception for a non-local exit, but that is so ugly I think it
would be OK to ignore it for the purposes of termination semantics.

Thus a loop like the following would work in parallel:

             parallel for I in 1 .. Limit loop
                if DB(I).Name = Name_to_Find then
                   My_PO.Save_Found_Index (I); -- Save the lowest index found.
                   exit; -- Done looking.
                -- else continue
                end if;
             end loop;

P.S. Unrelated aside: I have to wonder about the wisdom of considering atomic
objects task-safe, since Ada doesn't have any form of atomic update. (The only
safe atomic writing operation is initialization.) Almost any use of an atomic
object would have a race condition. I originally wrote the above My_PO use as:
          if Result > I then
              Result := I;
          end if;
with Result an atomic object initialized to Integer'Last (something I do
sequentially all the time) -- but the above is a race condition and definitely
is NOT task-safe. The same would be true for any other test of an atomic object
followed by a write of that object -- but that is the natural sequential code.

It seems better (given the current facilities) to insist on a PO here (since the
natural code in a PO would do the right thing).

****************************************************************

From: Randy Brukardt
Sent: Wednesday, July 19, 2017  9:06 PM

...
> >>> So if the loop is running its chunks non-preemptively, then
> >>> presumably the code would work as I suggested.
> >> That is certainly not something to be relied upon.  And
> even if things run non-preemptively, there might be tasklets that have
> never been started, so certainly those would not start after an
> "exit."
> >
> > I think if we did support some sort of non-preemptive mode,
> probably the most sensible behaviour for that mode is to just let the
> tasklets complete on their own, without any attempt to shut them down
> midstream. Any tasklets that hit the exit, or other transfer of
> control would terminate early, but otherwise tasklets generated at a
> POP would run to completion.  ...
>
> I believe this is creating two significantly different semantics for a
> single syntactic construct, which is bad news in my view.  You are
> heading toward "programming by pragma"
> (or "by aspect"), I fear.  That is largely what OpenMP does, and I
> believe we should avoid it.

This is a real concern, but I have to wonder if we really have any choice.

If our primary goal is (as I hope it is) to make it as easy as possible to
convert sequential code into parallel code, the semantics have to be a close as
possible to the equivalent sequential loop. That means that there can't be any
erroneous execution from aborts, the loop index has be meaningful at a normal
exit, and so on. That's going to be more expensive than the straight
abort-everything-on-any-sort-of-completion (I'm skeptical that it would be much
more expensive in usual cases, but I also know that I know just enough to be
dangerous in this area). So people who need absolute performance may need some
other semantics (including abort, for instance).

I suspect that Brad's non-premptive semantics make the most sense for the
default semantics for parallel loops (probably not for parallel blocks; the
problems are very different). That means that anyone that wants preemptive
semantics is going to have to use a pragma or aspect. But I don't think that is
a huge problem semantically; the difference only matters if the loop has a bug
already, or depends on the loop index at exit (in which case the preemptive
semantics is plain wrong and we could even make it illegal to use it in that
case as noted in my other message).

Anyway, back to real work...

****************************************************************

From: Brad Moore
Sent: Thursday, July 20, 2017  12:16 AM

> ...
>> > And the default should be that semantics are as close as reasonable
>> > to sequential loops; many cases can use simpler termination but I
>> > think the default should be fewest surprises.
>>
>> Agreed.
>
> (I think this is the only thing that I wrote that Brad agreed with,
> but don't look behind that curtain! :-)

I disagree. :-)
I'd have to check, but my overall sense was that I was agreeing with more of
what you said.

> I had the thought that any loop exits have to be explicitly in the
> loop body (they can't be in some subprogram that is called). Thus, we
> could make the termination semantics depend upon that. (The same
> applies to AI12-0189-1, the loop body as anomymous procedure, BTW).
>
> The basic idea is that one could depend on the loop index value
> determined at the point of an exit; if the loop finishes normally or
> exits abnormally, you couldn't have such a dependence. Thus if there
> is no exit, the termination is a bit simpler (not much, really); if
> there is, we require the loop to get the same loop index as the similar sequential loop.
>
> You could use an exception for a non-local exit, but that is so ugly I
> think it would be OK to ignore it for the purposes of termination semantics.
>
> Thus a loop like the following would work in parallel:
>
>             parallel for I in 1 .. Limit loop
>                if DB(I).Name = Name_to_Find then
>                   My_PO.Save_Found_Index (I); -- Save the lowest index found.
>                   exit; -- Done looking.
>                -- else continue
>                end if;
>             end loop;

Maybe I am not fully understanding this idea, or maybe these semantics are the
ones I had in my mind all the time, and have not considered the alternate
semantics that you might otherwise have assumed?

> P.S. Unrelated aside: I have to wonder about the wisdom of considering
> atomic objects task-safe, since Ada doesn't have any form of atomic update.
> (The only safe atomic writing operation is initialization.) Almost any
> use of an atomic object would have a race condition. I originally
> wrote the above My_PO use as:
>          if Result > I then
>              Result := I;
>          end if;
> with Result an atomic object initialized to Integer'Last (something I
> do sequentially all the time) -- but the above is a race condition and
> definitely is NOT task-safe. The same would be true for any other test
> of an atomic object followed by a write of that object -- but that is
> the natural sequential code.
>
> It seems better (given the current facilities) to insist on a PO here
> (since the natural code in a PO would do the right thing).

Actually, if I were to write this, I was thinking that a PO would be used to
update a global volatile/atomic variable, but otherwise the global variable
could be read/tested directly.

Something like;

    parallel for I in 1 .. Limit loop
       if DB(I).Name = Name_to_Find and then I < Result then
          My_PO.Save_Found_Index (I); -- Save the lowest index found in Result.
          exit; -- Done looking.
    -- else continue
       end if;
    end loop;

The extra check "and then I < Result" would potentially provide an
improvement/short cut on performance, assuming it is cheaper to test an
atomic/volatile integer, than to always have to access the PO.

On a further aside, I recognise that one would ordinarily consider atomic as the
safe choice for such a variable declaration, but I found that declaring the
global variable to be volatile rather than atomic provided significant
performance improvements. I'm not sure if that is just an artefact of the
combination of compilers and target platforms I was using, or Ada semantics
somehow demands this, but if the type of the object is such that it can be read
atomically without putting memory fences or guards around the access, eg. such
as via a single machine instruction, then I found I'd get consistently correct
results, with improved performance, over declaring the global as atomic. Ada
doesn't distinguish, but I wonder if there are certain data types that when
declared to be volatile, are also atomic. We already know that variable declared
to be atomic are also volatile in Ada. Or maybe there are certain volatile types
that have some useful properties of atomic, but are somehow otherwise
lighter-weight than full atomic?

****************************************************************

From: Brad Moore
Sent: Thursday, July 20, 2017  12:37 AM

> I suspect that Brad's non-premptive semantics make the most sense for
> the default semantics for parallel loops (probably not for parallel
> blocks; the problems are very different). That means that anyone that
> wants preemptive semantics is going to have to use a pragma or aspect.

Not necessarily a pragma or aspect. See below.

> But I don't think that
> is a huge problem semantically; the difference only matters if the
> loop has a bug already, or depends on the loop index at exit (in which
> case the preemptive semantics is plain wrong and we could even make it
> illegal to use it in that case as noted in my other message).

I think the non-preemptive semantics probably also are less likely to introduce
erroneous or unexpected results more generally. But that is just my overall
sense.

I agree here, that I think the non-preemptive behaviour might be the best
default.

Also, I am not advocating a programming by pragma view. The goal should be that
for most cases (> 90-99%), no pragmas would be necessary.

Note, we have currently have other cases in Ada where two very different
semantics can be invoked with respect to preemptive and non-preemtive behaviour,
and I think we should try to be consistent with those already existing syntax if
possible.

In particular, we have;
   requeue
and
   requeue with abort

I am now thinking that we could use a "with abort" on a parallel loop, to get
the preemptive semantics, but otherwise the semantics would be non-preemptive.

eg.

parallel with abort
  for I in 1 .. limit loop
     Long_Calculation (I);
  end loop;

Here, for example, an exception raised in Long_Calculation would be expected to
preempt other parallel iterations involving the same call.

This might also be a good way to tie in the "all" for the "all parallel" loops
that I was describing in recent emails, with the current parallel keyword
placement. a "with all", would imply that any iteration can be expected to
potentially execute in parallel with any other iteration. (ie. no chunking)

parallel with all
   for I in 1 .. limit loop
     Long_Calculation (I);
   end loop;

****************************************************************

From: Randy Brukardt
Sent: Thursday, July 20, 2017  1:32 AM

...
> > Thus a loop like the following would work in parallel:
> >
> >             parallel for I in 1 .. Limit loop
> >                if DB(I).Name = Name_to_Find then
> >                   My_PO.Save_Found_Index (I); -- Save the lowest index found.
> >                   exit; -- Done looking.
> >                -- else continue
> >                end if;
> >             end loop;
>
> Maybe I am not fully understanding this idea, or maybe these semantics
> are the ones I had in my mind all the time, and have not considered
> the alternate semantics that you might otherwise have assumed?

I think you and I agree on this point, but Tucker seems to be using strict
parallel block semantics for this. And that wouldn't work for this loop, since
at the first exit, all of the other tasklets would be aborted (by the parallel
block semantics). So it would be non-deterministic which exit was ultimately
executed (assuming there is more than one).

It's pretty clear to me that "exit" is not like an exception, and should be
treated differently. (Not as certain about gotos, since ones clear out of the
loop may or may not be like an exit.) A block doesn't have anything like exit
(although it could be emulated with a goto).

...
> Actually, if I were to write this, I was thinking that a PO would be
> used to update a global volatile/atomic variable, but otherwise the
> global variable could be read/tested directly.
>
> Something like;
>
>     parallel for I in 1 .. Limit loop
>        if DB(I).Name = Name_to_Find and then I < Result then
>           My_PO.Save_Found_Index (I); -- Save the lowest index found in Result.
>           exit; -- Done looking.
>        -- else continue
>        end if;
>     end loop;
>
> The extra check "and then I < Result" would potentially provide an
> improvement/short cut on performance, assuming it is cheaper to test
> an atomic/volatile integer, than to always have to access the PO.

A PO is supposed to be a monitor, and its rather abstraction breaking to put the
monitored object outside of it. (Yes, you sometimes have to do that, but I don't
recommend it if you don't have to do so.)

> On a further aside, I recognise that one would ordinarily consider
> atomic as the safe choice for such a variable declaration, but I found
> that declaring the global variable to be volatile rather than atomic
> provided significant performance improvements. I'm not sure if that is
> just an artefact of the combination of compilers and target platforms
> I was using, or Ada semantics somehow demands this, but if the type of
> the object is such that it can be read atomically without putting
> memory fences or guards around the access, eg. such as via a single
> machine instruction, then I found I'd get consistently correct
> results, with improved performance, over declaring the global as
> atomic. Ada doesn't distinguish, but I wonder if there are certain
> data types that when declared to be volatile, are also atomic. We
> already know that variable declared to be atomic are also volatile in
> Ada.
> Or maybe there are certain volatile types that have some useful
> properties of atomic, but are somehow otherwise lighter-weight than
> full atomic?

I think you just got lucky. My understanding is that fences are needed to ensure
memory consistency across cores. So you can only see problems if you have two
tasks running simultaneously on two different cores -- and the OS might try to
avoid this particular situation. I don't think there are any memory reads or
writes that work in that situation without a fence; besides, ALL atomic objects
use a single instruction on most implementations.

Volatile is definitely not enough for any data that needs to be synchronized
between multiple tasks (or tasklets); certainly not from a language perspective.
Only atomic or a PO synchronizes data (you can use one of those to synchronize
volatile data). And I don't think experimental results mean much here, because
some of these situations are pretty rare -- but Murphy would say that they'll
only strike at the worst time.

And Atomic isn't really enough, either, since real problems always require some
sort of update.

****************************************************************

From: Jeffrey Cousins
Sent: Thursday, July 20, 2017  11:44 AM

>I know in Vienna, we had taken a straw poll, which favored putting the
>keyword parallel before the word "for" in the for loop. While this is a
>useful data point, I think if we want to support "all parallel" loops, I
>find it makes a lot more sense to have all parallel together after the word
>"in" or "of".
> eg
>    for I in all parallel 1 .. 10 loop
>
> or
>
>    for Element of all parallel List loop


Much as I sympathise with the logic of Brad’s argument, I think we should stick
with what was agreed at Vienna for the time being, whilst we sort out what
behaviour we want.

Terminating loops early seems to be a particular sticking point.

Only yesterday we had a case where one of our apps has M tasks checksumming N
files, where N >> M.  Currently when the abort button is clicked, no new files
are checksummed, but checksumming the files already in progress is allowed to
complete.  (Non-pre-emptive).  The user would like the current checksumming to
be aborted immediately (pre-emptive; maybe he wants to checksum a different set
of files a.s.a.p.).

But I think it equally likely that there will be a case where finishing the
current processing cleanly is the imperative.

So I think we need to provide both options (somehow).

Brad also wrote:

>Note, we have currently have other cases in Ada where two very different
>semantics can be invoked with respect to preemptive and non-preemtive behaviour,
>and I think we should try to be consistent with those already existing syntax
>if possible.
>
>In particular, we have;
>   requeue
>and
>   requeue with abort
>
>I am now thinking that we could use a "with abort" on a parallel loop, to get
>the preemptive semantics, but otherwise the semantics would be non-preemptive.
>
>eg.
>
>parallel with abort
>  for I in 1 .. limit loop
>     Long_Calculation (I);
>  end loop;

No doubt someone will point out holes in the above, but I quite like the above
suggestion.

****************************************************************

From: Erhard Ploedereder
Sent: Thursday, July 20, 2017  4:31 PM

> If our primary goal is (as I hope it is) to make it as easy as
> possible to convert sequential code into parallel code, the semantics
> have to be a close as possible to the equivalent sequential loop.

No. Particularly, if the syntax distinguishes the "new" semantics (or else, why
introduce new language features?)


Canonical example:
> That means that .... the loop index has be meaningful at a normal exit, and so on.

>             parallel for I in 1 .. Limit loop
>                 if DB(I).Name = Name_to_Find then
>                    My_PO.Save_Found_Index (I); -- Save the lowest index found.
>                    exit; -- Done looking.
>                 -- else continue
>                 end if;
>              end loop;

NO, again, to the comment in the example. Consider an array with duplicates.
Canonical semantics gives you the lowest index indeed. If you want that, do not
write "parallel"!

If you do write "parallel":
A parallel loop surely should be allowed to give you some found index but
whether it is the lowest is completely non-deterministic (I hope).

(Execution-wise, I am open to both models of "last one wins" or "exit terminates
the parallel execution".)

****************************************************************

From: Randy Brukardt
Sent: Thursday, July 20, 2017  6:26 PM

> > If our primary goal is (as I hope it is) to make it as easy as
> > possible to convert sequential code into parallel code, the
> > semantics have to be a close as possible to the equivalent sequential loop.
>
> No. Particularly, if the syntax distinguishes the "new"
> semantics (or else, why introduce new language features?)

It's much easier to convert from one to the other if the semantics doesn't
change much, if at all. Indeed, I'm not sure there is any point to having a
parallel loop construct at all if this isn't true, because the existing Ada
facilities are just fine (as demonstrated by Brad's libraries) for people who
are already comfortable with writing parallel code. It's the majority of
programmers who don't understand concepts like race conditions that need help.

> Canonical example:
> > That means that .... the loop index has be meaningful at a normal
> > exit, and so on.
>
> >             parallel for I in 1 .. Limit loop
> >                 if DB(I).Name = Name_to_Find then
> >                    My_PO.Save_Found_Index (I); -- Save the lowest index found.
> >                    exit; -- Done looking.
> >                 -- else continue
> >                 end if;
> >              end loop;
>
> NO, again, to the comment in the example. Consider an array with
> duplicates. Canonical semantics gives you the lowest index indeed. If
> you want that, do not write "parallel"!

So you think it is OK to make it impossible to write a significant proportion of
real problems in parallel? You're essentially saying that this search has to run
4 (or 8 or whatever) times slower, or the programmer has to create their own
solution using Ada tasks. That's insane.

> If you do write "parallel":
> A parallel loop surely should be allowed to give you some found index
> but whether it is the lowest is completely non-deterministic (I hope).
>
> (Execution-wise, I am open to both models of "last one wins"
> or "exit terminates the parallel execution".)

I don't think that the cost of the model I'm suggesting is significantly
different than either of these. (I again note that I know just enough about such
things to be dangerous, so there might be some magic solution that I cannot
imagine.) For each of the models, you have between 2 and 4 possible outcomes
when a tasklet finishes a chunk (normal completion, exit, goto, exception; some
of these might be treated the same in some models, but surely the first has to
be treated differently in all models).

So the only difference between a model that is index-aware and one that is not
is that the indexes have to be known when the chunk finishes. They're obviously
known when the chunk starts, so in the worst case, the starting index has to be
saved with the other tasklet information. That's hardly a hardship, and the loop
termination code is essentially the same in any of these models.

Ergo, I don't see much reason to make the semantics non-deterministic.
Non-determinism is evil; sometimes you can't avoid it, but it should never be
employed if there is an option. And in this case, there is an option.

****************************************************************

From: Brad Moore
Sent: Friday, July 21, 2017 12:30 AM

> I suspect that Brad's non-premptive semantics make the most sense for
> the default semantics for parallel loops (probably not for parallel
> blocks; the problems are very different).

A parallel block is effectively just a parallel loop with a small number of
iterations that has been unrolled. One should be able to convert source code
from either construct to the other (at least for simpler loops with a small
number of iterations).

If anything, I see an even greater need for non-preemptive semantics for
parallel blocks.

A parallel loop is typically applying similar processing iteratively. A parallel
block can also be that, (for example calculating Fibonacci recursively), but it
can also be suitable for completely independent processing. In one branch you
might want to exit the processing of that branch early, but (I think more
typically) without affecting the processing in the other branches.

eg.

Note: I am not using the exact syntax that we arrived at in Vienna, only because
I cannot remember exactly what it was, until I see the minutes.

parallel

   exit when Do_A_Step1 = Failure;
   exit when Do_A_Step2 = Failure;
   exit when Do_A_Step3 = Failure;
   exit when Do_A_Step4 = Failure;
   Result := Do_A_Step5;

and
   Do_B;
end parallel

-- Activity B has completed here, regardless whether Activity finished
-- successfully.

If one wants the processing for activity A to kill the processing for activity
B, then I think the programmer ought to explicitly ask for those semantics. The
default should be the more deterministic semantics.

Again, the "with abort" clause seems like a consistent and natural way to do
this.

parallel with abort

   exit when Do_A_Step1 = Failure;
   exit when Do_A_Step2 = Failure;
   exit when Do_A_Step3 = Failure;
   exit when Do_A_Step4 = Failure;
   Result := Do_A_Step5;

and
   Do_B;
end parallel

-- A failure in activity A, may have caused activity B to terminate early.

****************************************************************

From: Brad Moore
Sent: Thursday, July 20, 2017  12:40 AM

> ? I know in Vienna, we had taken a straw poll, which favored putting
> the ? keyword parallel before the word "for" in the for loop. While
> this is a ? useful data point, I think if we want to support "all
> parallel" loops, I ? find it makes a lot more sense to have all
> parallel together after the word ? "in" or "of".
>>
>> eg
>>    for I in all parallel 1 .. 10 loop
>>
>> or
>>
>>    for Element of all parallel List loop
>
> Much as I sympathise with the logic of Brad’s argument, I think we
> should stick with what was agreed at Vienna for the time being, whilst
> we sort out what behaviour we want.
> Terminating loops early seems to be a particular sticking point.

I have actually changed my view here. I am now in favour of having the parallel
keyword for loops at the beginning, as we discussed in Vienna. The reason is
that I find that if we do want to specify controls on the parallelism, such as
"with abort", or "with all", or anything else we come up with, applying these to
the keyword parallel is the best way I've seen so far to do this, in my opinion.
If "parallel" is located in the middle of the loop syntax, I don't think this
would work (too ugly and messy), but when parallel is at the beginning, this
provides a "header" that describes the parallelism without interfering with the
loop logic.

****************************************************************

From: Randy Brukardt
Sent: Thursday, July 20, 2017  1:22 PM

> If anything, I see an even greater need for non-preemptive semantics
> for parallel blocks.
>
> A parallel loop is typically applying similar processing iteratively.
> A parallel block can also be that, (for example calculating Fibonacci
> recursively), but it can also be suitable for completely independent
> processing. In one branch you might want to exit the processing of
> that branch early, but (I think more typically) without affecting the
> processing in the other branches.

Well, we don't have "exit" for blocks (it would be very incompatible); I suppose
we could have it only for parallel blocks but that seems like asking for trouble
with the forces of consistency. One could use a goto for that effect, but that
seems less likely than  an exit from a loop.

So, in the block case I was thinking mainly about exceptions; and exceptions
usually (not always) represent bugs. I don't think it makes much sense to
continue buggy code -- indeed I thought that it made sense to use some sort of
"terminate as soon as practical" semantics for exceptions in all cases. It's
only exit (and similar gotos) that it makes sense to treat non-preemptively.

(And I was thinking that "non-preemptive" simply meant that the tasklet ran to
the next task dispatching point, not that it necessarily ran to completion. A
queued and blocked tasklet might never finish otherwise, especially in the
exception case.)

...
> Again, the "with abort" clause seems like a consistent and natural way
> to do this.
>
> parallel with abort
>

Yes, that syntax seems OK. Better than some your other recent ideas. ;-)

****************************************************************

From: Tucker Taft
Sent: Tuesday, July 25, 2017  4:36 PM

I am quite strongly against having multiple interpretations for "exit" from a
parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I
have some convincing to do.  The interpretation I recommend is that an “exit” or
“return” (or “raise”) within a parallel loop causes all other iterations to be
terminated as soon as possible.  For a “return,” there is a “race” to see which
iteration reaches a return first, and that is resolved  using some kind of
synchronization, and then only the “return” that “wins” is allowed to proceed to
evaluate its return expression (presuming it is returning from a function).
Similarly, if an iteration would propagate an exception, then again, a
synchronization is performed against other operations that can terminate the
loop, and the winner gets to determine the net effect, be it a simple exit, an
exception, a return, or (if we permit it) a goto.  Essentially any operation
that would normally terminate a loop goes through a “two-phase” commit, where
the first phase requests the “right” to decide the fate of the loop, and then
the winner determines what happens next.

I have not seen any convincing (to me!) argument to allow unstarted iterations
to continue processing, as this seems to defeat one of the whole points of a
parallel “divide-and-conquer” approach.  I can see some arguments to not
terminating iterations that are already executing, if there is danger of them
leaving some object visible outside the iteration in an abnormal state.  In
ParaSail, if a loop includes an operation that could result in an early exit, it
cannot manipulate a variable that is declared outside the loop unless it is
protected.  This doesn’t quite work in Ada since any parallel loop could have an
iteration die due to the failure of a run-time check.  But to some extent that
is already a problem because we recognize that the failure of a run-time check
can always leave certain variables in an abnormal state.  For a parallel loop, a
run-time check failure could leave any variable manipulated by any iteration in
an abnormal state.  But if you look at 11.6(6/3), even a “regular” loop that
dies due to a failed run-time check can have the same effect, since the Ada RM
has always permitted implicit parallelization (e.g. see NOTE 1 in chapter 9 —
aka RM 9(11)) and instruction re-ordering.  So we can effectively ignore failed
run-time checks, and only worry about explicit “raise” as relevant to this
discussion.  So what it effectively means is that the action of propagating a
“raise” out of an iteration is just like a “return” or an “exit” in terms of
having to “compete" to decide which terminating construct gets to determine the
“fate” of the parallel loop.

The bottom line to me is that we should terminate parallel loops as soon as
possible, be it from an exit, return, raise (or goto or requeue for that
matter), using a two-phase approach that first determines which one will
actually decide the fate of the loop as a whole, and then a second step that
actually does the deed.

****************************************************************

From: Tuillo Vardanega
Sent: Wednesday, July 26, 2017  7:43 AM

I think I understand the semantics as Tuck proposes and I find it to make sense
in the specific context of parallel computation. To me, wanting exactly the same
source code to operate identically within and without a parallel construct
requires writing it to that end, which means understanding the implications of
the most demanding side. In a sense, this is no different (only more complex)
than it was for concurrency: the code that you embed in concurrent threads must
be amenable to concurrent semantics else problems occur. I suppose that what
matters most here is that the programmer has the means (for syntax and
semantics) to ensure that the externally visible effect of a parallel construct
"stabilizes" before it can be used outside of it. To my reading, Tuck's
"two-phase commit" notion does that quite fine, including the
"as-soon-as-possible" part of it.

****************************************************************

From: Brad Moore
Sent: Thursday, July 27, 2017  12:26 PM

> I am quite strongly against having multiple interpretations for "exit" from a
> parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I
> have some convincing to do.

Unfortunately, I am not yet convinced. I think if we want to have only one
interpretation for Exit, then that interpretation should be non-preemptive. Lets
see if I can convince you.

First I have some findings to report first however that should be of interest.

I have added preemptive termination to most of my Paraffin libraries (except the
Ravenscar ones) so that if any exception occurs, it preemptively terminates
other tasklets.

I found that adding this, (using asynchronous transfer of control), did not add
any significant performance hit to any of the libraries with GNAT at least. This
applies even to problems such as adding the sum of a large range of integers,
which I had mentioned previously might be a case where preemptive abort might
introduce significant overhead.

So, for exceptions, I think it makes sense to have preemptive termination,
generally. However, I think exit, return, and goto should be handled
differently.

Here are some arguments

1. I think it is quite a common problem to be searching for the first or last
   occurrence of something in an array or container. For example, consider an
   array or container of events stored in chronological order.

I should think answering question such as;
   When is the first time that conditions X occurred?, or when was the last time
   that conditions Y occurred?

I find it quite objectionable if we have to say, sorry, but you cannot do such
searches efficiently in native Ada using parallelism. You should either use
libraries such as Brad Moore's Paraffin libraries, or call out to another
language such as C or C++ and write that code in OpenMP.

If Exit is preemptive, then we cannot continue searching in other tasklets to
find better or correct answers. We effectively have to not use Exit at all, and
search the full range of iterations, which can be a waste of time if we already
know the answer.

We can use exceptions as the mechanism for preemptive abort. If a programmer
wants preemptive termination, they could raise their own exception in the loop,
and write an exception handler at the appropriate level to handle that. We could
even define a language defined exception, such as
Parallelism_Termination_Exception, which could be defined as an exception that
is implicitly handled prior to the completion of the loop, and can be explicitly
raised by the user if preemptive termination is desired.

We still need a mechanism for non-preemptive abort to solve problems such as the
above however, and exit, return, and gotos could provide that mechanism.

2. Having Exit and Return treated effectively as an exception and causing
   preemptive is very different from sequential Ada, and likely to introduce a
   lot of unexpected errors to users programs. For instance, I found adding the
   preemptive termination to Paraffin introduced at least one bug where one task
   might be waiting on an entry while the aborted code that would have
   ultimately serviced that code would have left another task waiting on the
   entry indefinitely. Another example could be where a loop iteration creates
   dependent tasks, and aborting the loop iteration might shortcut interactions
   with those dependent tasks, leaving those tasks in a state where they cannot
   terminate, and where the aborted iteration is left in an abort-deferred
   operation, effectively deadlocking.

   I think when the user's code explicitly uses a select then abort statement,
   the user is fully aware of the asynchronous transfer of control, and can take
   care to ensure that the preemptive behaviour makes sense. I think it would be
   undesirable however to have the compiler implicitly introducing such
   asynchronous transfers of control, where the user is likely in many cases to
   not even be aware of this, which could lead to difficult to reproduce
   failures in deployed code.

3. I think we should also be supporting parallelism in Ravenscar compliant code.
   I have Paraffin libraries that do this, which shows that it can be done.
   However, in Ravenscar, I believe all termination needs to be non-preemptive,
   including exceptions, since Ravenscar has the restrictions:
   No_Abort_Statements, No_Dependence => Ada.Asynchronous_Task_Control,
   No_Select_Statements.

It seems pretty clear that preemptive aborting behaviour is not supported in a
simpler Ravenscar runtime. I think this is OK, as you still get the benefits of
divide and conquer with non-preemptive termination, just maybe not as responsive
for early exits, but that is traded off with better deterministic behaviour
which is generally desired in Ravenscar. The main point here is that if we have
Exit, goto, and return as non-preemptive, then we have consistent, single
semantic interpretations for these constructs for all of Ada. If we make these
preemptive, then we are actually forcing two interpretations if we want these to
work in Ravenscar also.

Another suggestion, if we really do want preemptive exit, return, or goto, then
perhaps we need a different flavour of syntax for those.

I had previously suggested that a "with abort" clause could be added to the
parallel keyword of the loop. Rather than do that, I think it would be better to
attach that clause to the actual Exit, return, or goto statement, as in;

parallel
    for I in 1 .. N loop
       if Arr (I) > X then
          exit with abort;
       end if;
    end loop;

It is clearer to tie the "with abort" closer to where the premptive termination is desired.


   The interpretation I recommend is that an “exit” or “return” (or
“raise”) within a parallel loop causes all other iterations to be
terminated as soon as possible.  For a “return,” there is a “race” to
see which iteration reaches a return first, and that is resolved  using
some kind of synchronization, and then only the “return” that “wins” is
allowed to proceed to evaluate its return expression (presuming it is
returning from a function).  Similarly, if an iteration would propagate
an exception, then again, a synchronization is performed against other
operations that can terminate the loop, and the winner gets to determine
the net effect, be it a simple exit, an exception, a return, or (if we
permit it) a goto.  Essentially any operation that would normally
terminate a loop goes through a “two-phase” commit, where the first
phase requests the “right” to decide the fate of the loop, and then the
winner determines what happens next.

A two phase commit is pretty much the only possibility that I see.
Paraffin has always had such a two phase commit for handling exceptions,
except until just recently, the exception handling was non-preemptive.
If multiple exceptions were raised in different tasklets, the
implementation decides which one wins, prior to re-raising that
exception and propagating it out into the user's code. The two-phase
commit idea is orthogonal to this discussion about how Exit statements
are treated I think.

> I have not seen any convincing (to me!) argument to allow unstarted iterations
> to continue processing, as this seems to defeat one of the whole points of a
> parallel "divide-and-conquer" approach.

You do get benefits of divide and conquer regardless whether termination
is preemptive or not. The preemptive termination can be viewed as an
additional optimization that can be added on top of the divide and
conquer optimization, but only if it does not introduce errors!


   I can see some arguments to not terminating iterations that are
already executing, if there is danger of them leaving some object
visible outside the iteration in an abnormal state.  In ParaSail, if a
loop includes an operation that could result in an early exit, it cannot
manipulate a variable that is declared outside the loop unless it is
protected.  This doesn’t quite work in Ada since any parallel loop could
have an iteration die due to the failure of a run-time check.  But to
some extent that is already a problem because we recognize that the
failure of a run-time check can always leave certain variables in an
abnormal state.  For a parallel loop, a run-time check failure could
leave any variable manipulated by any iteration in an abnormal state.
But if you look at 11.6(6/3), even a “regular” loop that dies due to a
failed run-time check can have the same effect, since the Ada RM has
always permitted implicit parallelization (e.g. see NOTE 1 in chapter 9
— aka RM 9(11)) and instruction re-ordering.  So we can effectively
ignore failed run-time checks, and only worry about explicit “raise” as
relevant to this discussion.  So what it effectively means is that the
action of propagating a “raise” out of an iteration is just like a
“return” or an “exit” in terms of having to “compete" to decide which
terminating construct gets to determine the “fate” of the parallel loop.

I dont think this last sentence follows from the previous, particularly
in the case of Exit. In Paraffin, an Exit statement is simply treated as
an Exit statement in the user's code. It causes that tasklet to exit,
without affecting any others, no special treatment than other sequential
statements in Ada. There is no need to decide which exit wins prior to
leaving the loop, as there no result or exception being propagated out
of the loop. The implementation requires no special treatment of this.
Exceptions do require special treatment, since a specific exception
should be propagated out of the loop.

For a return statement, there is a stronger case for preemptive
behaviour, but I think considering the other arguments, it should be
left non-preemptive to side with Exit statements, rather than side with
the preemptive nature of exceptions.

Goto's should probably not be too relevant to this discussion if we even
support them at all. I doubt this feature would be very rarely used or
useful in a parallel loop, but they are similar to return statements.

>
> The bottom line to me is that we should terminate parallel loops as soon as
> possible, be it from an exit, return, raise (or goto or requeue for that
> matter), using a two-phase approach that first determines which one will
> actually decide the fate of the loop as a whole, and then a second step that
> actually does the deed.

The bottom line to me is that the user should be able to write programs
that execute as fast as possible. Sometimes terminating loops as soon as
possible comes at a greater expense than being able to use a more
optimal algorithm. The user should be able to choose either option.

Below I have execution times showing a case where the non-preemptive
algorithm beats the preemptive one.

A two phase approach is recommended for sure, but Exit's do not need to
be involved in the second stage of the commit. They do not contribute to
any result associated with the loop. Return results do need to be
considered in the second phase, but I would think that any exception
raised should trump any return result.

Here are some more results comparing performances using preemptive vs
non-preemptive termination.

I have a simple program that scans an array of 600_000_000 integers
looking for the value in the highest index of the array that satisifies
some property. In this case, the element in Arr'Last is the one that
satisfies the query.

Using a Reverse parallel loop with non-premptive exit, the elapsed time
for the loop is 0.13 seconds.

The parallel loop looks like;

      Maximum_Index : Index := Index'First with Volatile;

      parallel
       for I in reverse Start .. Finish loop
          if I < Maximum_Index then
             exit;
          elsif Satisfies (Arr (I)) and then I > Maximum_Index then
             Maximum_Index := I;
          end if;
       end loop;

If I take out the reverse keyword, I still get the correct result, but
the elapsed time becomes 1.4 seconds, since with forward iteration, the
Maximum_Index has to iterate through the full chunk of the highest
chunk, since each iteration improves upon the previous value.

If we assume that Exit must be preemptive, then we cannot use Exit,
because it gives us an incorrect result, so I have to rewrite the loop
as the following if we assume that reverse parallel loops are still allowed;

     parallel
       for I in reverse Start .. Finish loop
          if Satisfies (Arr (I)) and then I > Maximum_Index then
             Maximum_Index := I;
          end if;
       end loop;

This produces an elapsed time of 1.3 seconds;

If we say that reverse parallel loops are not allowed, then I have to
remove the reverse keyword, and the elapsed time becomes 3.8 seconds.

So in this particular example, it appears that the decision to not
support non-preemptive Exit results in a program that runs 9.8 times
slower on my computer than the version that does allow non-premptive
Exit, for reverse parallel loops.  For the non-reverse loops, the
preemptive version runs 2.7 slower.

The preemptive forward loop runs 29.2 times slower than the
non-preemptive reverse parallel loop.

****************************************************************

From: Tucker Taft
Sent: Thursday, July 27, 2017  4:29 PM

>> I am quite strongly against having multiple interpretations for "exit" from a
>> parallel loop (e.g. "preemptive" and "non-preemptive"), but I can see that I
>> have some convincing to do.
>
> Unfortunately, I am not yet convinced. I think if we want to have only
> one interpretation for Exit, then that interpretation should be
> non-preemptive. Lets see if I can convince you.
> ...
>
> Here are some arguments
>
> 1. I think it is quite a common problem to be searching for the first
> or last occurrence of something in an array or container. For example,
> consider an array or container of events stored in chronological order.
>
> I should think answering question such as;  When is the first time
> that conditions X occurred?, or when was the last time that conditions
> Y occurred?
>
> I find it quite objectionable if we have to say, sorry, but you cannot
> do such searches efficiently in native Ada using parallelism. You
> should either use libraries such as Brad Moore's Paraffin libraries,
> or call out to another language such as C or C++ and write that code
> in OpenMP.

There are many ways to search efficiently, including in parallel. If you want
the lowest-numbered element that satisfies a particular criteria, then that
should be explicit in the algorithm, rather than something that is hidden in an
assumption about how non-preemptive chunked parallel iteration works.  It seems
unwise to make requirements about the order in which things are done, e.g.
whether lower-numbered or higher-numbered iterations are processed first,
whether "exit" ends one chunk but not all chunks, whether exit stops new chunks
from starting, etc.  This could be seen as defeating the purpose of parallel
iteration, and forcing the semantics to be restricted in what seem somewhat
unnatural ways just for this particular case.

> If Exit is preemptive, then we cannot continue searching in other
> tasklets to find better or correct answers. We effectively have to not
> use Exit at all, and search the full range of iterations, which can be
> a waste of time if we already know the answer. ...

In my view, "exit" means the entire loop is done, and trying to imply that under
certain circumstances it means "well not really if there is a better answer yet
to be found" seems like it would require a compiler that can guess your
intentions.  I suspect what would be most efficient for such a problem is a more
explicit divide and conquer, not something where you are implicitly relying on
chunking to accomplish what you want.

For example, if you want the item with the lowest index which satisfies some
predicate, you would want all of the cores looking at the first few items in the
array, and then if nothing is found there, you would have them all start looking
at the next few items in the array.  It would be less efficient to break the
array into big chunks, and have half of the cores looking at the last half of
the array.

Here is an example algorithm that would devote all of the cores to search the
beginning of the array first.  This is just one example, but I suspect it would
be faster than one that made various assumptions about chunking and assumed that
"exit" didn't really mean “exit” when there were better answers waiting.  My
point is that having multiple meanings for "exit" adds a lot of complexity
without providing sufficient general benefit.

   Assume we have "N" cores, and an array A where we are trying to find the
   lowest-indexed element of A that satisfies the predicate P.

function Find_First(A : Elem_Arr; P : access function(E : Elem_Type) return Boolean)
   return Integer is
   N : constant Positive := Num_CPUs;
   Lowest_Index : array(1..N) of Integer := (others => A’Last+1)
   Found : Integer := A'Last+1 with Atomic;
       --  Atomic flag used to stop other threads early begin
   parallel for I in 0 .. N-1 loop
      -- find A(J) with lowest valued J satisfying P(A(J)) by looking
      -- at A(A'First+I), A(A'FIrst + I + N), A(A'First + I + 2*N),
      declare
         J : Integer := A'First + I;
      begin
         while J <= A'Last and then J < Found loop
             if P(A(J)) then
                 -- This is the lowest-numbered element
                 -- satisfying P with index of form A’First + I + K*N
                 Lowest_Index(I) := J;
		 Found := J;  --  Stop others as soon as possible.
                    -- Note that there is a benign race condition here, since
                    -- we are merely trying to get other threads to stop if further
                    -- searching would be a waste of time.  "Found" is not presumed
                    -- to be the overall lowest value.
                 exit;
             end if;
             J := J + N;
          end loop;
      end;
   end loop;

   -- Use proposed "map/reduce" construct to find overall minimum index
   return (for I in 1 .. N => Integer'Min(<A'Last+1>, Lowest_Index(I)); end;

****************************************************************

From: Brad Moore
Sent: Tuesday, August 1, 2017  2:08 AM

> There are many ways to search efficiently, including in parallel. If you want
> the lowest-numbered element that satisfies a particular criteria, then that
> should be explicit in the algorithm, rather than something that is hidden in
> an assumption about how non-preemptive chunked parallel iteration works.  It
> seems unwise to make requirements about the order in which things are done,
> e.g. whether lower-numbered or higher-numbered iterations are processed first,
> whether “exit” ends one chunk but not all chunks, whether exit stops new chunks
> from starting, etc.  This could be seen as defeating the purpose of parallel
> iteration, and forcing the semantics to be restricted in what seem somewhat
> unnatural ways just for this particular case.

I see your point, but I also think treating an exit like an exception is also
asking for trouble. Your arguments have moved me somewhat though. I now feel
that If we cannot have both preemptive and non-premptive exits, returns, and
gotos, then I think we shouldn't allow them to transfer control out of a
parallel loop at all. I think they introduce too much in the way of confusion,
unexpected behaviour, and semantic problems.

If one really wants to preemptively terminate a parallel loop, they can always
raise an exception and handle it at the appropriate level. Otherwise I think
gotos, returns, and exits should only be allowed to exit loops that do not have
the parallel keyword. This also simplifies the parallelism proposal, I should
think.

> For example, if you want the item with the lowest index which satisfies some
> predicate, you would want all of the cores looking at the first few items in
> the array, and then if nothing is found there, you would have them all start
> looking at the next few items in the array.  It would be less efficient to
> break the array into big chunks, and have half of the cores looking at the
> last half of the array.

That can be tricky to program though without introducing too much overhead of
having the cores needing to interact with each other to get the next search
region, and it depends on the probability of finding things in the bottom half
of the array. If what you are looking for is rare enough, that could raise the
chances that what you are looking for can be found in the upper half of the
array. But yes, it could be an alternate strategy that might work well enough in
certain situations.

> Here is an example algorithm that would devote all of the cores to search the
> beginning of the array first.  This is just one example, but I suspect it
> would be faster than one that made various assumptions about chunking and
> assumed that "exit" didn't really mean "exit" when there were better answers
> waiting.  My point is that having multiple meanings for "exit" adds a lot of
> complexity without providing sufficient general benefit.

This approach can be summarized as manual chunking, where the programmer writes
an outer parallel loop that executes a number of chunks in any order, and an
inner sequential loop that calculates the chunk boundaries, where you can then
use sequential semantics including reverse loops, with non-preemptive exits.

I think given the syntax we are proposing, manual chunking is likely the only
solution that that works for this problem, if we do not have non-preemptive
exits and want to write all the code in line. A point to mention is that with
manual chunking perhaps that can eliminate the need for reverse parallel loops,
since one can use the inner sequential loop to provide the reverse searching.

I think I have a better solution for this problem though, because there are a
number of criticisms I see here.

For starters, I dislike that the programmer is having to make decisions about
how many chunks are needed, and the values of the chunk boundaries.

The reasons are;

   1) Choosing the number of cores as the number of chunks is likely too course
      a chunk size. If one core finishes its chunk early, there are no
      opportunities to steal or seek work from the other cores. It is better to
      choose a smaller chunk size to provide better load balancing, but having
      the programmer choose this, is likely not going to be as good a decision
      that a compiler or library writer might choose.

   2) Manual chunking in this manner is a form of static parallelism, where the
      decisions on chunk boundaries is made before loop execution begins. One
      ideally shouldn't necessarily have to commit to static chunking decisions,
      as dynamic chunking algorithms are possible that might be beneficial in
      cases where the amount of time to evaluate a single iteration can vary
      wildly between iterations.

   3) Having to write the code to convert back and forth from chunk index to
      actual index is messy and error prone. It is more difficult to both read
      and write.


I think this particular problem is best solved using a library approach, in
combination with the anonymous loop body AI.

A library approach can provide the outer parallel loop that corresponds to the
outer loop of the manual chunking solution, without the programmer having to
deal with decisions about chunking, or converting from chunk numbers to real
index values. A library also can provide more sophisticated parallelism
strategies that might be too difficult for a programmer to write from scratch.

If we take a library approach to find the maximum index that satisfies some
condition, using anonymous loop bodies from AI we could write;

       declare
         package Search is new
           Parallel.Loops.Static_Reduce (Loop_Index  => Index,
                                         Result_Type => Index,
                                         Reducer     => Index'Max,
                                         Identity    => Index'First,
                                         Result      => Maximum_Index);
        begin
         for (Start, Finish, Maximum_Index) of Search.Parallel_Loop
            for I in reverse Start .. Finish loop
               if I < Maximum_Index then
                  exit;
               elsif Satisfies (Arr (I)) and then I > Maximum_Index then
                  Maximum_Index := I;
               end if;
            end loop;
         end loop;
       end;

Note, the Maximum_Index of the Parallel_Loop anonymous procedure is an in out
parameter. The value coming in, is the best value so far for that executor.

Otherwise to do this with explicit manual chunking, I would write;

         declare
            Number_Of_Chunks : constant Positive := 50;
            Iterations : constant Positive
              := Positive (Arr'Last - Arr'First + 1);
            Chunk_Size : constant Positive :=
              Iterations / Number_Of_Chunks;
            Results : array (1 .. Number_Of_Chunks) of Index :=
               (others => Index'First);
         begin
            parallel
              for I in 1 .. Number_Of_Chunks loop
                 for J in reverse Arr'First +
                    (Index (I) - 1) * Index (Chunk_Size)
                 .. (if I = Number_Of_Chunks then Arr'Last
                     else Arr'First + Index (I * Chunk_Size) - 1) loop

                    if Satisfies (Arr (J)) then
                       Results (I) := I;
                       exit;
                    end if;
                 end loop;
              end loop;

           Maximum_Index := (for Element of Results => <0> + Element);

         end;

I can tell you that the amount of time it took me to write the second version
involved quite a bit more time and mental power than the previous version. I
think the previous one is also a lot more readable, because it provides a higher
level of abstraction. The code only focuses on the actual algorithm, whereas the
second version mixes the algorithm with a simplistic chunking algorithm.

Note: The exit in the last example is non-preemptive, since the inner loop is
not a parallel loop. If you look at that exit statement, one might worry for a
moment whether that is a preemptive exit or a non-preemptive exit. If we
disallow exits to transfer control out of a parallel loop, then suddenly that
exit feels "safer" to me, since all exits would not have preemptive abortive
behavior to worry about.

Otherwise, I can imagine one might want to scan all the source code in a system
looking for preemptive exits when doing code reviews, and feeling the need to
provide extra analysis to show that any found are safe.

****************************************************************

From: Tucker Taft
Sent: Tuesday, August 1, 2017  12:45 PM

I think you have confused the issue of "exit" meaning "exit" with the issue of
whether "abort" semantics or "as soon as possible" are used to terminate the
other iterations.

lI can understand the concern about using "abort" semantics, but after an exit
by one iteration, when any other iteration hits a “natural” stopping point, or
hasn't started at all, it should terminate.

So let's separate these two issues, if possible.  What I mostly object to is the
notion of giving meaning to multiple exits from the same parallel loop.  I don’t
mind waiting a bit for iterations that are already running, for the loop to
actually terminate.  But trying to define what it means to have multiple exits
is a mistake in my view.  I believe the two-phase approach is implementable, and
I believe should be used for any kind of "early" termination of a parallel loop,
be it exception propagation, exit, return, goto, or requeue.  That is, if an
iteration wants to terminate the loop, it "requests" a termination, and then if
it “wins” the race, it determines the fate.  How quickly the loop actually
finishes may vary, but it should be as soon as possible, and certainly no
additional iterations should be started.

So no need to bring up the "boogie" man of "abort."  I could imagine some aspect
that might affect how quickly termination happens, but it shouldn’t affect the
basic semantics of the loop, which are “terminate as soon as possible.”

****************************************************************

From: Brad Moore
Sent: Wednesday, August 2, 2017  9:03 AM

> So let's separate these two issues, if possible.

OK, I can buy this. We really now have 3 potential ways to exit a loop though.
What I would call hard preemption, soft preemption, and non-preemption, where
hard preemption is the abortive exit likely involving asynchronous transfer of
control, soft preemption which is a polled exit that only occurs on iteration
boundaries (grain or chunk boundaries most likely), and non-preemption which
wouldn't be directly supported in syntax, but if one does manual chunking either
inline, or via a library call, an exit is exiting a sequential inner loop, which
then is non-preemptive. The searching example of the previous email represents a
loop that must involve manual chunking that results in a non-preemptive exit.

I think hard preemption should probably be reserved for exceptions, which would
allow breaking out of a loop like the following on a quad-core machine, which
with only soft preemption would likely not result in any savings in execution
time;

     parallel
        for I in 1 .. 4 loop
            if Flip_Coin then
               raise Get_Me_Out_Of_Here;
            else
               Do_12_Hour_Calculation;
            end if;
        end loop;

All parallel loops involving chunking are actually nested loops at the
implementation level where the outer loop is the parallel loop that is provided
by the implementation, and the inner sequential loop is a callout to user code.
So in this sense, an exit from the inner loop should be treated as a
non-preemptive exit. But we want to create the illusion that there is only a
single loop at the syntax level, so an exit would involve soft preemption to
more closely mirror sequential behaviour that when an exit statement is
executed, there are no further iterations. Parallel execution is not exactly
like that, as execution likely continues after an exit, and further iterations
can be started, particularly if a tasklet involves multiple iterations and soft
preemption polling occurs at tasklet boundaries. But we want to at least try to
approximate the sequential behaviour as best we safely can.

I actually have these three exit strategies implemented in Paraffin now. I had
forgotten that I had implemented soft preemption some time ago, where a special
exception can be raised from a user callback and is handled by the library which
sets a flag that is polled at grain/tasklet boundaries to effect early
termination.

Any other exception raised in a Paraffin library call results in a hard
preemptive abort involving asynchronous transfer of control, which is a feature
I just added.

Any exit statement in a user provided callback is treated as a regular exit
statement in Ada, which is just a non-preemptive exit, which would also work
well for manual chunking examples involving the anonymous loop body procedure
AI.

So I am happy now with this treatment of exit, return, and gotos if we say it
has the soft-preemption model.

I do think we probably want to distinguish between exceptions and the other
forms of transfer of control out of the loop (exit, gotos, returns, requeues) to
consider treating exceptions as the hard-abortive case, particularly for
exceptions such as Constraint_Error at least, but probably all exceptions for
consistency sake.

****************************************************************

From: Jeff Cousins
Sent: Thursday, August 3, 2017  11:16 AM

It's good to see this progressing even if it's taking a lot of to and fro.
Tucker's ideas sound desirable, but are the others from the implementors' side
happy that the "some kind of synchronization" won't be an undue implementation
burden?

****************************************************************

From: Randy Brukardt
Sent: Thursday, August 3, 2017  8:54 PM

> It's good to see this progressing even if it's taking a lot of to and fro.
> Tucker's ideas sound desirable, but are the others from the
> implementors' side happy that the "some kind of synchronization" won't
> be an undue implementation burden?

To answer your question directly, I don't think there is any alternative. What
Tucker describes is a bit more complex than the model I had in mind, but they're
pretty similar. That is, there is a loop "manager" (for the lack of a better
term) that determines the chunks, creates (or more likely, requests existing)
tasklets for execution, gives the tasklets chunks, and then waits for the
tasklets to finish their chunk (or complete some other way, such as exit or
propagate an exception). That waiting implies some sort of synchronization, but
it seems to me to be the cheapest kind available. The manager would generally be
asleep waiting for tasklets, and would only wake up when one completes - it
would then either assign another chunk (if there aren't any, just sleeping
waiting for the others), or if the loop needs to complete (exit or exception),
wait for the other tasklets to reach a termination point (or abort them - in
Janus/Ada, these are effectively the same but probably not in most
implementations).

I definitely would allow iteration termination at any task dispatching point
(especially for tasklets that are blocked), lest a blocked tasklet prevent loop
termination indefinitely.

With this sort of structure, I don't really see any issue with having exit work
deterministically. The only requirement are to ensure that any iterations on the
left of the iteration of the exit (that is, with lower indexes) run to
completion. There certainly isn't any requirement to start new tasklets after an
exit; you just have to assign chunks to tasklets in a left-to-right order (and
it would be a lot of extra work to do anything else). Thus, if an exit occurs,
any chunk not ever assigned to a tasklet can be immediately discarded (it
necessarily comes after the exit). Only tasklets logically before the exit would
need to continue.

Exceptions are different, of course, they terminate everything as soon as
possible as the loop is not going to provide an answer. Those are rather
different use-cases.

----

I want to step back a minute and think a bit about the uses for these constructs
and our actual goals for them. I'm getting concerned that there can be no real
uses of the parallel loop construct.

We've already eliminated reduction loops from using the parallel loop (they have
their own construct now). That's a substantial number of the loops that I've
written that would be sensibly parallel.

Almost all of the other loops that I've written or maintained are searches of
some sort. Depending upon the rules adopted, it's quite likely that many of
those could not use the parallel loop construct. First of all, most of the
search loops I write are intentionally not going to execute long, and many are
walking through linked lists; probably the parallel overhead would be far more
than any possible savings from parallel execution. (The cost to walk the list to
find chunks would be more than the comparison costs.) More of the loops wouldn't
work if a non-deterministic solution for exit is adopted; you wouldn't be able
to find the *first* match or the *next* match as opposed to some random match.
Coding this using a non-deterministic loop is very expensive - you want to avoid
any further iterations above any match but continue to run the ones below. The
only way I could imagine doing it would require pre-checking on every iteration
if something is found below: but that would require synchronization, so a lock
or fence would be needed -- most likely costing more than the iteration itself.
It probably wouldn't be worth doing, meaning that one would have to search the
entire space regardless of where or when any matches are found. And, depending
on the likelihood of matches in the first third of the data, probably means that
a parallel solution would perform worse than a sequential one.

There must be some searches that would work with non-deterministic exit, but I
can't think of any at the moment.

So, if a loop is neither a reduction or a search (that is, no early exits), what
else could it do? Something like a matrix multiply comes to mind, but that is
very likely to violate the static safety check on the loop contents. (The
compiler probably can't prove that the array elements being written are
independent for each iteration, and certainly any fully static check based on
Global alone cannot do that - you need information about how the array is
indexed.) So, what can such a loop do? Is there enough to justify the extensive
work on this feature (as opposed to parallel blocks and parallel reductions)??

----

Going back to first principles. The most important thing about these constructs
is that they tell the compiler that the user doesn't care about the details of
which iteration runs first, or last, or in between. Because of exceptions (and
finalization), an Ada compiler cannot parallelize any code itself (unless of
course it can prove the absence of such things, but that's hard and unusual).
Most of the time, the programmer doesn't really care about the state of the loop
when an exception is raised; the loop has a bug and we're not going to use the
results of the loop at all. OTOH, a "normal" exit (whether via an exit
statement, return statement, or a goto) usually just means that we have the
needed results and don't need to continue.

We're also looking to make checks to ensure that these constructs are "safe",
that is free of race conditions and conflicts. The vast majority of programmers
are not going to be able to successfully avoid these things (even if they know
to look for them) -- I know I can't. Ideally, I want to have a programming
construct that gives the benefit of parallel execution without any of the pain.

For all of these reasons, I much prefer a model where exceptions and exits are
treated differently, and exits give the same answer as a sequential loop. In
such a model, almost all searching loops can be written in parallel. In the
"first exit wins" scenario, only a few searching loops could be written in
parallel. Since most other loops are reductions or would be prohibited by the
parallel safety checks, that model seems necessary for parallel loops to be
useful at all.

****************************************************************

From: Tucker Taft
Sent: Monday, August 14, 2017  8:46 AM

> ...
> With this sort of structure, I don't really see any issue with having
> exit work deterministically. The only requirement are to ensure that
> any iterations on the left of the iteration of the exit (that is, with lower
> indexes) run to completion. There certainly isn't any requirement to
> start new tasklets after an exit; you just have to assign chunks to
> tasklets in a left-to-right order (and it would be a lot of extra work
> to do anything else). Thus, if an exit occurs, any chunk not ever
> assigned to a tasklet can be immediately discarded (it necessarily
> comes after the exit). Only tasklets logically before the exit would
> need to continue. ...

I am not convinced we should imply there is any ordering of iterations in a
parallel loop. This just adds complexity to the implementation with little
benefit in my mind.  If you have a “many-core” machine, then it is quite
possible that all of the iterations (or at least all of the chunks) get going
simultaneously, so what value is there to promising ordering in some particular
cases?  As I suggested in an earlier e-mail, if you really want the
lowest-indexed element that satisfies your predicate, then you wouldn’t want to
use “normal” chunking, but rather you would want to bias things to have multiple
cores looking at the early part of the range first.  In any case, are there
really that many cases where you want the lowest-indexed item?  More likely in
my view is that you would either want any item, or you would want all items that
satisfy the predicate.  If you want all items, then of course “exit” would not
be used, and you would build up a list of all desired items.  If y want any
item, then the first you find should “win”.

****************************************************************

From: Randy Brukardt
Sent: Monday, August 14, 2017  9:23 PM

At the risk of continuing to beat this same horse...

> ... If you
> have a "many-core" machine, then it is quite possible that all of the
> iterations (or at least all of the chunks) get going simultaneously,
> so what value is there to promising ordering in some particular cases?

The benefit is obvious: loop algorithms work as they are written. No mental
gymnastics needed.

> ...  As I suggested in an
> earlier e-mail, if you really want the lowest-indexed element that
> satisfies your predicate, then you wouldn't want to use "normal"
> chunking, but rather you would want to bias things to have multiple
> cores looking at the early part of the range first.

That was a terrible idea:
(1) It inverts the looping algorithm, obscuring the actual operative code;
(2) It forces the use of complex code to get parallel operation, defeating
    the purpose of making getting better performance easy;
(3) It assumes a particular chunking mechanism, which makes the code much less
    portable (at least if performance matters).

To elaborate on (3): A good compiler will tailor the chunking to the iteration
being executed (specifically whether it is balanced so the iterations probably
take the same amount of time, or if they are obviously unbalanced, and whether
or not there are any explicit exits), the hardware (the number of cores and how
they are mapped to Ada tasks), and possibly do that at runtime (especially as
the tasks that are running aren't known until the moment the loop is launched).
An algorithm using what Brad called "manual chunking" is going to end up
fighting that compiler-generated chunking, and that is going to scrub off
performance. You'd be better off using a library like Parafin to implement such
a loop -- but in that case, there is no longer enough value to the syntactic
parallel loop to bother with defining it.

But an algorithm like the one you propose makes a lot of sense as the
implementation generated by the *compiler* when the code contains explicit
exits. It can do that with all of the knowledge needed and adding no portability
issues; there's surely no requirement that a chunk run consecutive iterations!

> ... In any case, are there really that many cases where you want the
> lowest-indexed item?

Yes, this is pretty much the only thing I search for. Usually, I need to be able
to restart the search if the found item eventually turns out to not work for
some complex reason. That would be difficult if some random result was returned.

> More likely in my view is
> that you would either want any item, or you would want all items that
> satisfy the predicate.  If you want all items, then of course "exit"
> would not be used, and you would build up a list of all desired items.

Which you can't do with a parallel loop; you have to use a reduction expression
for that.

> If you want any item, then the first you find should "win".

The only case where it makes sense to me to want "any item" is when you are
looking for the absence of something (in such a case there shouldn't be any
match). That's pretty rare in my view.

I still think that "exit" should produce a deterministic answer; there's no
problem if you don't need a deterministic answer -- don't use "exit" in that
case. The goal here is to make it as easy as possible to write a parallel loop.
If you have to know lots of unusual patterns in order to get anything to work,
then only experts will be able to write parallel code. But that's already true;
it's not the least bit hard for an Ada real-time expert to write parallel
algorithms (as Brad has proved). The whole point of these features is to bring
this programming capabilities to the rest of us. If we fail to do that, then
these features become essentially useless (and a massive amount of work for
that).

****************************************************************

From: Brad Moore
Sent: Tuesday, August 15, 2017  10:12 AM

> I want to step back a minute and think a bit about the uses for these
> constructs and our actual goals for them. I'm getting concerned that
> there can be no real uses of the parallel loop construct.

I have been thinking about this also, and I am thinking we probably should
consider dropping parallel loop syntax from the proposal.

The discussion about find first, find last and preemptive exits has led to show
that the anonymous loop body AI coupled with a parallelism library call is a
better solution to these problems than the parallel loop syntax solutions we
have been discussing.

Once you come to that realization, it opens a bit of a slippery slope which
suggests that once you need parallelism libraries to solve certain problems,
probably they are also the best way to solve other manual chunking problems as
well.

This includes what I was calling the "parallel all" loops where an outer
sequential loop encloses an inner nested loop that is desired to be executed in
parallel. The best performance for these sorts of problems, which I have
encountered several, is to use barriers with manual chunking to push the
parallelism out to an outer loop, that encloses the sequential loop, and use
barriers to transition between parts of the algorithm that need to run
sequential and other parts that need to run in parallel.

It felt admittedly a bit like trying to force a square peg into a round hole to
graft the "parallel all" syntax onto the parallel loop syntax.

If you take out all the manual chunking problems and reductions problems out of
the picture, whats left?

Not much, as I would say easily more than half of the parallel loop problems I
have encountered that are not reductions are of the manual chunking variety.

I have only two representative problems left that would fit under parallel loop
syntax.

  1. Find any
  2. Combining arrays/containers

The Find any problems are the ones that Tucker was talking about where a
preemptive exit would be needed to break out of the loop.

I think these sorts problems are actually better served by quantified
expressions, which as you may recall are actually a form of reduction
expression.

A quantified expression is a boolean expression that a compiler should break out
of evaluating once it has found a case that determines the value of the
expression. This also applies for parallel quantified expressions. In this case,
the programmer doesn't even need to write the exit statement. It is expected
that the compiler will implicitly do this.

Consider determining if a number is a prime number.

Is_Prime : constant Boolean :=
   (parallel for all I in 1 .. Integer(Sqrt(N)) => N mod I != 0);

Whether the parallel keyword is present or not, a compiler should stop further
evaluation once if finds a factor of N.

Sometimes you want a non boolean value as the result of what you are looking
for. A quantified expressions works here also, except a side affect of the
function is needed to store the result. As an output, you still need a boolean
result also, to indicate if you found what you were looking for.

Consider finding any prime value within a range of integers.

Prime_Number : Integer := 0 with Atomic;

function Test_And_Store_Prime (N : Integer) return Boolean is begin
  if Is_Prime (N) then
     Prime_Number := N;
     return True;
  else
     return False;
  end if;
end Test_And_Store_Prime;

Prime_Found : constant Boolean :=
    (parallel for all I in Start .. Finish => Test_And_Store_Prime (I));

if Prime_Found then
   Put_Line (Integer'Image (Prime_Number) & " is a prime number"); end if;

There will be some naysayers who dislike seeing a side-effect in a quantified
expression, but I think for this particular problem it feels good enough to use
here.

That leaves combining arrays/containers, however those can be written as a
reduction expression also.

Array_Result : array (ArrA'First .. ArrA'Last) of Integer :=
   (for I in ArrA'First .. ArrA'Last => <""> & (ArrA(I) + ArrB(I)));
Where concatenation is he reducing operator.

Perhaps this looks inefficient, though whose is to say how the compiler can
optimize this. It is otherwise a simple enough expression to read or write, I
think.

A more visually appealing way to do this would be to allow the keyword parallel
with the new array aggregate syntax, which eliminates the concatenation
semantics.

Array_Result : array (ArrA'First .. ArrA'Last) of Integer :=
    (parallel for I in ArrA'First .. ArrA'Last => (ArrA(I) + ArrB(I)));

See RM 4.3.3 5.1/5 for the array initialization syntax.

I have not been able to yet think of any other cases where parallel loop syntax
would be needed, so it seems to me that we should not introduce parallel loop
syntax given that there are no use cases for parallelism problems that cant be
better solved with reduction expressions or anonymous loop body syntax combined
with parallelism library calls.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, August 16, 2017  12:22 AM

...
> The discussion about find first, find last and preemptive exits has
> led to show that the anonymous loop body AI coupled with a parallelism
> library call is a better solution to these problems than the parallel
> loop syntax solutions we have been discussing.

There's a couple of problems using the "anonymous loop body" AI in this way:
(1) Without some additional mechanism, we'd lose the safety checks of the
    parallel loop. They don't make any sense in the context of the "anonymous
    loop body" proposal per-se. So I think we'd still need a "parallel" keyword
    or something like that to trigger the safety checks. (Without them, parallel
    operations are just a giant race condition; no human programmer could avoid
    them.)
(2) There's no certainty that the "anonymous loop body" proposal even makes it
    into the Standard. Several people have expressed strong reservations about
    that proposal.

As an aside, I'd like to note that the "exit" problem with the "anonymous loop
body" proposal is in fact very similar to the problems that arise with the
parallel loop proposal. In both case, implementing exit/return/goto is very
painful, because you have to get control, somehow save whatever is supposed to
happen after the loop, do some actions, and then proceed to doing the after the
loop actions. The only place where the "anonymous loop body" is a bit worse is
that a user-defined routine can see some of what happens there; the parallel
loop is all-compiler generated.

...
> If you take out all the manual chunking problems and reductions
> problems out of the picture, whats left?
>
> Not much, as I would say easily more than half of the parallel loop
> problems I have encountered that are not reductions are of the manual
> chunking variety.

Exactly my point.

> I have only two representative problems left that would fit under
> parallel loop syntax.
>
>   1. Find any
>   2. Combining arrays/containers
>
> The Find any problems are the ones that Tucker was talking about where
> a preemptive exit would be needed to break out of the loop.

I think "Find any" per-se is rather unlikely, although they make sense in the
case of an "ensure none" predicate.

The array combination cases are most likely going to trigger the safety checks
(unless we build in some special holes, and even then only a few special forms
would work). The compiler is probably not going to be able to tell that
different iterations write different components unless the array index is of a
special, recognized form. (It definitely will not work with the raw static
Global check, since the entire array is considered a single object, which
necessarily overlaps with itself.)

> I think these sorts problems are actually better served by quantified
> expressions, which as you may recall are actually a form of reduction
> expression.

Makes sense. Certainly "ensure none" is the sort of case that quantified
expressions were designed for.

...
> That leaves combining arrays/containers, however those can be written
> as a reduction expression also.
>
> Array_Result : array (ArrA'First .. ArrA'Last) of Integer := (for I in
> ArrA'First .. ArrA'Last => <""> & (ArrA(I) + ArrB(I))); Where
> concatenation is the reducing operator.

They have to be written this way (or some other reduction expression), as
otherwise the array objects would be overlapping (making a parallel loop
illegal).

> Perhaps this looks inefficient, though whose is to say how the
> compiler can optimize this. It is otherwise a simple enough expression
> to read or write, I think.
>
> A more visually appealing way to do this would be to allow the keyword
> parallel with the new array aggregate syntax, which eliminates the
> concatenation semantics.
>
> Array_Result : array (ArrA'First .. ArrA'Last) of Integer := (parallel
> for I in ArrA'First .. ArrA'Last => (ArrA(I) + ArrB(I)));
>
> See RM 4.3.3 5.1/5 for the array initialization syntax.

Right, and this probably is a better way to write this expression rather than
writing a loop in the first place.

> I have not been able to yet think of any other cases where parallel
> loop syntax would be needed, so it seems to me that we should not
> introduce parallel loop syntax given that there are no use cases for
> parallelism problems that cant be better solved with reduction
> expressions or anonymous loop body syntax combined with parallelism
> library calls.

Agreed, so long as we find a way to make the safety checks that we have in
parallel loops.

****************************************************************

From: Tucker Taft
Sent: Thursday, August 17, 2017  7:43 PM

> ...
>
>> I have only two representative problems left that would fit under
>> parallel loop syntax.
>>
>>  1. Find any
>>  2. Combining arrays/containers
>>
>> The Find any problems are the ones that Tucker was talking about
>> where a preemptive exit would be needed to break out of the loop.
>
> I think "Find any" per-se is rather unlikely, although they make sense
> in the case of an "ensure none" predicate.

I don’t agree, based on my experience with ParaSail.  YMMV.

> The array combination cases are most likely going to trigger the
> safety checks (unless we build in some special holes, and even then
> only a few special forms would work). The compiler is probably not
> going to be able to tell that different iterations write different
> components unless the array index is of a special, recognized form.
> (It definitely will not work with the raw static Global check, since
> the entire array is considered a single object, which necessarily overlaps
> with itself.) ...

This is very well traveled area.  Compilers have been recognizing the special
case of independent loop iterations for decades.  Given:

   parallel for I in A’Range loop
       C(I) := A(I) + B(I);
   end loop;

Many, many compilers are smart enough to understand that this is safe.  I agree
the loop-body procedure construct is very nice, but you still end up wanting to
do the safety checks, and something as simple as the above, whether the user is
driving the “chunking” or the compiler does it by itself, has exactly the same
safety-check requirements.

We have been talking with existing Ada customers recently about planned Ada 202X
enhancements, and this is the one that gets them drooling every time.  They
*really* want parallel loops.  Of course if they are too restrictive, they won’t
solve the problem, but I believe with the proposed reduction expression, the
above simple parallel loop, and the nice loop-body procedure, we will give them
what they need to be quite happy.  I really believe we need all three.

****************************************************************

From: Randy Brukardt
Sent: Thursday, August 17, 2017  9:32 PM

...
> > The array combination cases are most likely going to trigger the
> > safety checks (unless we build in some special holes, and even then
> > only a few special forms would work). The compiler is probably not
> > going to be able to tell that different iterations write different
> > components unless the array index is of a special, recognized form.
> > (It definitely will not work with the raw static Global
> check, since
> > the entire array is considered a single object, which
> necessarily overlaps with itself.) ...
>
> This is very well traveled area.  Compilers have been recognizing the
> special case of independent loop iterations for decades.  Given:
>
>    parallel for I in A'Range loop
>        C(I) := A(I) + B(I);
>    end loop;
>
> Many, many compilers are smart enough to understand that this is safe.

Surely. But that's not the point. The point is that we have to have Ada Legality
Rules which makes this sort of thing legal and portable. The proposed rules
using the Global aspect certainly would not allow the above. Unless you have
been secretly planning some much more complex rules to allow such things, loops
like the above couldn't be legal Ada.

Note that I'm not against allowing loops like the above, but that has to be
portably defined. (The above case: a single dimensioned target that doesn't
overlap anything else, and is indexed directly by the loop parameter, is clearly
safe and could be defined that way, but hardly anything else could be. Is this
case alone enough for the extra complication???)

...
> I agree the loop-body procedure construct is very nice, but you still
> end up wanting to do the safety checks, and something as simple as the
> above, whether the user is driving the "chunking" or the compiler does
> it by itself, has exactly the same safety-check requirements.

Not exactly, as noted above. But I definitely agree that whatever construct is
used needs safety checks.

> We have been talking with existing Ada customers recently about
> planned Ada 202X enhancements, and this is the one that gets them
> drooling every time.  They *really* want parallel loops.  Of course if
> they are too restrictive, they won't solve the problem, but I believe
> with the proposed reduction expression, the above simple parallel
> loop, and the nice loop-body procedure, we will give them what they
> need to be quite happy.  I really believe we need all three.

Probably because no one will be able to grok reduction expressions. :-) It took
me a three hour ARG presentation to get any idea...

People naturally understand parallel loops. At least they think they do, until
they find out that exit doesn't work sensibly. :-)

****************************************************************

From: Brad Moore
Sent: Friday, August 18, 2017  12:01 AM

I have some more thoughts towards answering the question on whether we should
eliminate parallel loop syntax from our proposal.

Firstly, I had said in the previous email that find-first and find-last problems
should be written using parallel libraries and the anonymous loop proposal.

I think these can also be written well enough using reduction expressions.

Consider finding the highest prime number in a range of integers; One could
express this using the following reduction expression.

Highest_Prime_Found : constant Integer :=
   (parallel for I in reverse Start .. Finish =>
      Integer'Max(<Integer'First>, (Is_Prime(I) then I else Integer'First));

A simpler compiler might only apply divide and conquer here, which means each
chunk fully executes. This should still cut the execution time down from the
sequential version by a factor of the number of cores.

A smarter compiler might recognize that when the reduction is 'Max or 'Min, and
the expression value result is of the loop iterator in such a pattern, that once
a higher index is found to be prime, there is no longer a need to keep searching
the lower indices. Indeed a smarter compiler might recognize this as a special
case and insert a completely custom algorithm optimized for find first/find last
problems.

So there appears to be no need to rely on library packages nor parallel loop
syntax for these problems.

Now consider another one of the manual chunking problems where I was suggesting
using barriers would be the best approach when the outer loop must be a
sequential loop that happens to enclose a parallel loop.

Consider a Newton package that simulates the movement of heavenly bodies in a
solar system. A procedure Update_Position can be called that considers the mass,
gravity, direction, and movement of neighbouring bodies to calculate the
position of a body for the next time increment. An Advance function might be
needed to swap internal buffers for the next and current values for the next
time increment.

One could express this using a parallel loop inside a sequential loop that
advances time.

for Current_Time in Now .. Time + 100_000_000 loop

   for parallel all Heavenly_Body in Solar_System'Range loop
      Newton.Update_Position (Heavenly_Body);
   end loop;

   Newton.Advance;
end loop;

This doesn't work well with reduction expressions if the data arrays/containers
are hidden behind an abstraction in a package body.

However, if one were to modify the Update_Position subprogram to be a function
that returns a boolean status to indicate the success of the calculation, then
one can use a reduction expression here. The Newton.Advance function might raise
an exception if the boolean input parameter is false, indicating some failure
occurred, such as two planets colliding, that isn't handled by the simulation.

for Current_Time in Now .. Time + 100_000_000 loop

   Newton.Advance (for parallel all Heavenly_Body in Solar_System'Range
                     => Newton.Update_Position (Heavenly_Body)); end loop;

So, if we did not have parallel loop syntax, it might influence people more to
write functions for parallelism problems, rather than procedures, but I wouldn't
say thats a bad thing necessarily.

A smarter compiler might recognize the pattern of a reduction expression inside
a parallel loop and treat that as a special case using barriers to get optimal
performance, or it may be that the default implementation of the compiler just
works well here. If the programmer finds the performance is poor, he could
always write the problem himself using barriers and use the anonymous loop
procedure with a parallelism library.

The point again though is that I don't think we need to rely on the availability
of parallelism libraries for this problem, nor do we need parallel loop syntax,
unless we feel strongly that one should be able to use a procedure instead of a
function for the Update_Position call.

The one other manual chunking problem I have that is quite different from the
others is to generate a cumulative sum of an array where the result array has
the sum of the corresponding element from the input array plus all values
preceding that value in the input array. This solution involves 3 loops. The
first loop is a parallel loop that gets the sum of each chunk and stores that
value in a partial sum intermediate array.

The second loop is a sequential loop that performs a cumulative sum operation on
the partial sum intermediate array.

The third loop is a parallel loop that takes the corresponding partial result
values as a starting point for each chunk, and then applies the cumulative sum
to each chunk.

Using manual chunking I can write the first loop using a nested reduction
expression where the outer parallel reduction expression concatenates sum values
calculated from the inner reduction expression for each chunk.

However, the third loop seems to pretty much need to be a parallel loop.

declare
    Chunk_Size : constant Natural := Natural'Min(Arr'Length, 10_000);
    Chunks     : constant Natural := Arr'Length / Chunk_Size;

    function Chunk_Start (Chunk : Natural) return Index_Type is
      (Arr'First + (Chunk - 1) * Chunk_Size);

    function Chunk_Finish (Chunk : Natural) return Index_Type is
      (Arr'First + (if Chunk = Chunks then Arr'Last else Chunk * Chunk_Size - 1));

    -- Create the Partial sums in parallel
    Partial    : array (1 .. Chunks) of Element_Type :=
       (parallel for I in 1 .. Chunks => <""> &
           (for J in Chunk_Start (I) .. Chunk_Finish(I) => <0> + Arr(J))); begin

    -- Sequential carry partial sums through array
    for I in 2 .. Partial'Length loop
       Partial (I) := @ + Partial (I - 1);
    end loop;

    for parallel I in 1 .. Chunks loop
      declare
        Sum : Element_Type := (if I = 1 then 0 else Partial (I - 1));
      begin
        for J in Chunk_Start (I) .. Chunk_Finish (I) loop
           Sum := @ + Arr (J);
           Cum (J) := Sum;
        end loop;
      end;
    end loop;
end;

However, as this is a manual chunking problem, I find the code written using
anonymous loop procedures and a library call is quite a bit simpler and easier
to read, and requires less knowhow on the programmer to decide how the loops
should be chunked. The chunking logic is hidden inside the library call.

declare
    package Scan is new Parallel.Loops.Static (Loop_Index => Positive);
    Partial : array (1 .. Scan.Worker_Count) of Element_Type := (others => 0); begin

    -- Get the sum of each chunk
    for (Start, Finish) in Scan.Parallel loop
      declare
         Worker : constant Parallel.Worker_Id := Scan.Get_Worker_Id;
      begin
          for I in Start .. Finish loop
             Partial(Worker) := @ + Histogram (I);
          end loop;
       end;
    end loop;

    --  Sequential exclusive scan phase
    for I in 2 .. Worker_Id (Effective_Workers) loop
       Partial (I) := @ + Partial (I - 1);
    end loop;

    for (Start, Finish) in Scan.Parallel loop
      declare
         Worker : constant Parallel.Worker_Id := Scan.Get_Worker_Id;
         Sum : Integer := (if Worker = 1 then 0 else Partial (Worker - 1));
      begin
         for I in Start .. Finish loop
            Sum := @ + Arr (I);
            Cum (I) := Sum;
          end loop;
      end;
    end loop;

So I would say this is the one case where a parallel library call with anonymous
loop procedure appears to be the best solution, particularly if it means that we
can simplify the parallelism proposal and not have parallel loop syntax.

Tucker did have a proposal involving parallel array syntax that was designed
specifically for this problem. At the time we were considering using that as the
reduction solution, but since reduction expressions appears to be a better
solution, it seems unlikely that we would want to bring back parallel array
syntax if it only is useful for this one problem.

So in summary, I still think most parallel looping problems can be handled with
reduction expressions.

There may be a few odd cases involving manual chunking where one might want to
use a parallelism library with the anonymous loop procedure.

My sense is that there are not enough of these problems needing parallelism
libraries to warrant standardizing a set of libraries for this, given that
libraries are available, and one can also write Ada wrapper libraries around
OpenMP calls, for example, although standardizing a set of parallelism libraries
could be another thing to consider.

While there are some special cases where parallel loop syntax could be useful
for manual chunking problems, the alternative of using a reduction expression
seems to be a good enough alternative, otherwise using anonymous loop body with
a library call is another viable alternative.

That leaves me still thinking that parallel loop syntax is not worth adding at
this point, which would simplify the AI to just cover reduction expressions and
parallel blocks. But it would be good to get others to weigh in as well, of
course.

****************************************************************

From: Brad Moore
Sent: Tuesday, August 29, 2017  10:01 AM

> ...
>> The discussion about find first, find last and preemptive exits has
>> led to show that the anonymous loop body AI coupled with a
>> parallelism library call is a better solution to these problems than
>> the parallel loop syntax solutions we have been discussing.
> There's a couple of problems using the "anonymous loop body" AI in this way:
> (1) Without some additional mechanism, we'd lose the safety checks of
> the parallel loop. They don't make any sense in the context of the
> "anonymous loop body" proposal per-se. So I think we'd still need a
> "parallel" keyword or something like that to trigger the safety
> checks. (Without them, parallel operations are just a giant race
> condition; no human programmer could avoid
> them.)

I have been thinking about this. I think the safety checks could be provided if
we could allow the global aspect on formal subprograms and formal access to
subprograms, with an addition to the global syntax to indicate a subrange of an
array.

with global => array(x .. y)

Which would indicate that there is  global usage of an array from the indices x
.. y. The array might be a name of an array, or just the "array" keyword which
would mean any array.

This might be used with a parallel loop library such as the following.

with System.Storage_Elements;

generic
    type Loop_Index is range <>;
package Parallel is

    procedure Parallel_Loop
      (Process : not null access
            procedure (Start, Finish : Loop_Index) with
                    global => (Input => array (Start .. Finish),
                                      In_Out => array (Start .. Finish),
                                      Out     => array (Start .. Finish));
       From : Loop_Index := Loop_Index'First;
       To      : Loop_Index := Loop_Index'Last)
    with global => null;

end Parallel;

The idea would be that the compiler would reject any call to Parallel_Loop that
involved a Process callback that had any globals other than array references of
a slice using the parameters of the call.

The compiler would also reject the code if there was an attempt to reference the
array using indices that are outside the range of the specified limits, or not
derived from the specified parameters.

Note the Parallel_Loop library call itself does not involve any globals.

The compiler would not reject calls involving a Process callback that didn't
involve any globals. i.e. It doesn't necessarily expect the actual to involve
globals, but if it does, it cannot involve globals outside of the specification.
This should eliminate the race conditions that we are worried about I think.

example usage of the above:

   -- Anonymous loop body call
   for (Start, Finish) of Parallel_Loop loop

       -- Sequential inner loop
       for I in Start .. Finish loop
          C(I) := A(I) + B(I);           -- OK
       end loop;

       C(1) := 3; -- Illegal, Index is not derived from Start or Finish
   end loop;

> (2) There's no certainty that the "anonymous loop body" proposal even
> makes it into the Standard. Several people have expressed strong
> reservations about that proposal.
>
> As an aside, I'd like to note that the "exit" problem with the
> "anonymous loop body" proposal is in fact very similar to the problems
> that arise with the parallel loop proposal. In both case, implementing
> exit/return/goto is very painful, because you have to get control,
> somehow save whatever is supposed to happen after the loop, do some
> actions, and then proceed to doing the after the loop actions. The
> only place where the "anonymous loop body" is a bit worse is that a
> user-defined routine can see some of what happens there; the parallel loop
> is all-compiler generated.

Actually, this is not a problem for exit because with the anonymous loop body,
the users code appears as a sequential loop nested inside the outer anonynmous
loop. Since the inner loop is a sequential loop, exit works just as it does
today, it only exits the inner sequential loop.

I suspect we'd probably want to make returns and gotos illegal from inside an
anonymous loop body.

So the anonymous loop body needs to have no knowledge that there is parallelism
involved. Its just a library call which happens to hide the parallelism inside
the library call.

****************************************************************

From: Tucker Taft
Sent: Tuesday, August 29, 2017  2:05 PM

Although SPARK only allows names of stand-alone objects in Global annotations,
for Ada we allow any object name, presumably including an indexed component or a
slice.  We also have proposed in AI12-0079 to allow them on formal subprograms
and access-to-subprogram, so I think the cases you mentioned are consistent with
the current AI.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, August 30, 2017  10:16 PM

How do the static checks work in such a case, as the bounds in question would
typically be non-static? For instance, in Brad's

  with global => array(x .. y)

X and Y probably would be outer loop indexes, or parameters to the current
subprogram.

****************************************************************

From: Tucker Taft
Sent: Wednesday, August 30, 2017  10:26 PM

You just need to prove two slices don’t overlap, which is equivalent to
proving the high bound of one is less than the low bound of the other. This
is the kind of thing that static analysis and more advanced proof tools are
pretty good at doing! This is not significantly harder than eliminating
run-time checks for array indexing when the bounds are not known at
compile-time, which is something that many Ada compilers can do based on other
information available at the point of usage.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, August 30, 2017  10:40 PM

OK, but again these are Legality Rules, not something that compilers are doing
on their own. The entire set of rules have to be defined formally and required
of all Ada compilers. What "static analysis" and "advanced proof tools" can or
can't do is irrelevant (unless of course we allowed what is legal or not to be
implementation-defined -- when I previously proposed that for exception
contracts, the idea seemed to be as effective as a lead balloon).

****************************************************************

From: Randy Brukardt
Sent: Wednesday, August 30, 2017  11:02 PM

> I have some more thoughts towards answering the question on whether we
> should eliminate parallel loop

...

Let me throw another classic problem (one that I recently wrote for a hobby
program), the problem of determining correlations and other statistics between
two parallel arrays of data.

The core of the problem is calculating 6 pieces of data: the sum of the
elements of each array, the sum of the squares of the elements of each array,
the product of the matching elements of the arrays, and the square of the
product of the arrays.

The sequential solution is a single loop creating all 6 items:

     for I in 1 .. MAX loop
        A_Sum := @ + A(I);
        A_Sum_Sqr := @ + A(I)*A(I);
        B_Sum := @ + B(I);
        B_Sum_Sqr := @ + B(I)*B(I);
        Prod_Sum := @ + A(I)*B(I);
        Prod_Sum_Sqr := @ := (A(I)*B(I))**2;
     end loop;

MAX is over 3000 for this particular problem, so it would seem that a parallel
solution might add some performance.

These are clearly reduction expressions. However, we have to write 6 separate
reduction expressions, which means 6 times the loop overhead. Also, the above
is full of common subexpressions; it's likely that a compiler would calculate
and load A(I) and B(I) once each, and possibly eliminate some of the other
operations as well (in particular, the A*B product). None of these
optimizations can be done with separate reduction expressions, so we'll end up
with 5 times the memory reading as well (each A(I) and B(I) is read 5 times in
the sequential loop).

Thus, a parallel rewrite into 6 reduction expressions is likely to be
disappointing. With the extra parallel execution overhead, it would likely
take 8 cores to improve the performance over the sequential loop, and the
performance would be worse for 4 cores or less.

This is not a theoretical concern: Janus/Ada doesn't do many floating point
optimizations; recompiling the program with GNAT (which presumably does do
those optimizations) improved the performance by a large factor (more than 10,
as I recall).

How would you parallelize this loop??

****************************************************************

From: Tucker Taft
Sent: Thursday, August 31, 2017  11:02 PM

I do not believe we should eliminate parallel loop syntax.  The question that
relates more directly to this example is whether we need special reduction
syntax beyond the “map/reduce” construct.  This example argues that in cases
where you have multiple accumulators, you might want some way of declaring
them so that you get one per chunk.

Alternatively, we encourage the use of an accumulator “record” for this sort
of computation.  E.g.:

   type Accums is record
       A_Sum, A_Sum_Sqr : Float;
       B_Sum, B_Sum_Sqr : Float;
       Prod_Sum, Prod_Sum_Sqr : Float;
   end record;

   function Update_Accums(Orig : Accums; A_Elem, B_Elem : Float) return Accums is
   begin
       return (A_Sum => Orig.A_Sum + A_Elem,
                  A_Sum_Sqr => Orig.A_Sum_Sqr + A_Elem * A_Elem,
                  B_Sum => Orig.B_Sum + B_Elem,
                  B_Sum_Sqr => Orig.B_Sum_Sqr + B_Elem * B_Elem,
                  Prod_Sum => Orig.Prod_Sum + A_Elem * B_Elem;
                  Prod_Sum_Sqr => Orig.Prod_Sum_Sqr + (A_Elem * B_Elem) **2);
   end Update_Accums;

   Result : constant Accums :=
      (for I in 1..MAX => Update_Accums(<(others => 0.0)>, A(I), B(I));

****************************************************************

From: Tucker Taft
Sent: Thursday, August 31, 2017  3:19 PM

>> You just need to prove two slices don't overlap, which is equivalent
>> to proving the high bound of one is less than the low bound of the
>> other.  This is the kind of thing that static analysis and more
>> advanced proof tools are pretty good at doing!  This is not
>> significantly harder than eliminating run-time checks for array
>> indexing when the bounds are not known at compile-time, which is
>> something that many Ada compilers can do based on other information
>> available at the point of usage.
>
> OK, but again these are Legality Rules, not something that compilers
> are doing on their own. The entire set of rules have to be defined
> formally and required of all Ada compilers. What "static analysis" and
> "advanced proof tools" can or can't do is irrelevant (unless of course
> we allowed what is legal or not to be implementation-defined -- when I
> previously proposed that for exception contracts, the idea seemed to
> be as effective as a lead balloon).

This is a general problem with data race detection, I would say.  My suggestion
is to define what we consider to be a "potential data race" in a way that
run-time detection can be used, but make it clear that compilers are permitted
to give compile-time errors if they can show that the "potential data race"
exists in a given instance.  This is somewhat analogous to the "potentially
blocking operation," but in this case, we require the equivalent of
"detect-blocking."  And of course a compiler can omit the run-time check if it
can prove there is no potential data race.  Anything in between presumably many
compilers would provide warnings.

****************************************************************

From: Randy Brukardt
Sent: Thursday, August 31, 2017  3:38 PM

That's effectively making the checks implementation-defined (whether a
particular piece of code will compile is impl-def). Not sure how well that
will fly. Ada usually requires a run-time check in such instances and does
not allow a compile-time check (the latter because of the mess that comes
up in conditional code). Anyway, food for thought.

****************************************************************

From: Tucker Taft
Sent: Thursday, August 31, 2017  8:17 PM

Perhaps you are right.  No big reason to treat this differently than any other
run-time check.  Most compilers already produce a warning when a run-time check
is certain to fail, and often provide a mode where certain classes of warnings
are treated as errors, so that would effectively have the same net effect.  In
generic bodies, we often define things to raise Program_Error under certain
circumstances, but for compilers that do macro-expansion, almost all of these
end up as compile-time warnings, which are often treated as errors by the user.

****************************************************************

From: Brad Moore
Sent: Tuesday, October 10, 2017  12:15 AM

I am trying to write up my homework for AI12-0119-1 (parallel operations),
with an eye to possibly writing up AI12-0197-4 (Coroutines and Channels),
and have come across some observations and ideas that I think improves from
where we left off at Vienna but is modified to account for some new issues.

The basic change is that instead of parallel "begin ... end" for parallel
blocks, I am planning to substitute "do ... end do".

Eg.

    parallel do
       Foo;
    and
       Bar;
    end do;

My reasoning for this is;

1) The sequence of statements in each branch should just be a
  sequence-of-statements, rather than a handled-sequence-of-statements
  because attempting to put an exception handler in this construct I find
  is very confusing; I recall now that this is one of the main reasons
  why the gang-of-four settled on syntax modeled after the select
  statement rather than a begin statement for parallel blocks.

For example, if one writes;

    parallel begin
       Foo;
    and
       Bar;
    exception
       when others =>
         Put_Line ("Exception Caught!");
    end;

    I think it is very confusing. Is the exception handler only catching
    exceptions for the call to Bar, or is it also catching exceptions for
    the call to Foo? If we say that the exception handlers are not allowed,
    then it forces the programmer to either enclose the construct in a
    block statement or equivalent, or use block statements within each arm
   of the construct.  Either way, I find to be much clearer.

    eg.

     begin

           parallel do
              Foo;
           and
              Bar;
           end do;

     exception
        when others =>
          Put_Line("Exception Caught!");
     end;

or

       parallel
       do
           begin
              Foo;
           exception
              when others =>
                Put_Line ("Bad Foo!");
           end;

        and

           begin
             Bar;
           exception
              when others =>
                 Put_Line ("Bad Bar!");
           end;

        end do;

The use of "do" has the benefit that "begin" from Vienna, had in that you can
remove the keyword parallel and still have a working construct.
The parallel keyword is a modifier. In the previous parallel block proposal
this would not work as parallel was the name of the construct. If you remove
"parallel", you have a bunch of code that needs to be cleaned up before it
will compile.

Secondly, I don't think there is much use in having a declarative region in
this construct. Generally, you need the results to outlive the construct, so
that you can do something useful with the results.
This suggests again that the construct should be enclosed in another construct
such as a block statement.

eg.

    declare
       X : Integer;  -- Do Foo & Bar get their own copies of these?
       Y : Float;
    parallel
    begin
       X := Foo;
    and
       Y := Bar;
    exception  --  Is this handler for both Foo & Bar, or just Bar?
      when others =>
        Put_Line("Exception Caught!);
    end;

    Put_Line (????);  -- !!! I Cant print out the results !!!

In this example, two results have been calculated, but if you want to examine
both these results at the same time, you cannot, because the scope of the
results have ended before you can examine both results.

Also, having a declarative region on a parallel begin is confusing. I think some
users would be confused by this, wondering if each arm of the parallel begin
gets its own local instances of those declarations, or whether they are shared
between all the arms.

By eliminating the declarative region, it eliminates this confusion, and also
not likely to be missed because it is not useful to begin with.

If one instead writes;

declare
   X : Integer;
   Y : Float;
begin

    parallel do
       X := Foo;
    and
       Y := Bar;
    end do;

    Put_Line ("X=" & Integer'Image(X) & ", Y=" & FLoat'Image(Y));

exception
    when others =>
      Put_Line("Exception Caught!");
end;

I find this to be non-ambiguous, and provides the scope needed to examine both
results of the arms.

So it becomes clearer to me that this construct is quite different than a
block statement, and therefore it probably should have its own distinct
syntax, rather than try to make the block statement accommodate this usage,
which feels like a hack to me, and also feels like it would be making a bit
of a mess in the syntax.

Why "do"?  The reserved keyword do is relatively unused in Ada, appearing only
in the selective accept statement. It seems to fit better than begin in terms
of English semantics.

The construct really is a list of things to be done.
   e.g.
       Do X and Y and Z.

   reads better than a list of things to be begun I think.

       Begin X and Y and Z.

   Begin really is more tied to a declarative region, where you
   declare a bunch of things, then you need to specify where to "begin"
   executing. Since this construct doesn't seem to need a declarative
   region, there no need to indicate where the execution begins.

   Also many syntax constructs in Ada have the form
      name
      end name;

      e.g.

      loop
      end loop;

      if
      end if;

      select
      end select;

      case
      end case;

      record
      end record;

      If one sees "end" without looking at the above, the assumtion is that it
      corresponds to a normal block statement.

      If one sees "end do" it provides better feedback that the preceding code
      has different semantics.

      I suspect that the reason "begin" is not paired with
      "end begin" mostly because end begin looks like a weird oxymoron
      that would also be confusing to readers. "end do" does not seem
      to have this problem.

      Finally, I think the coroutine concept of ai-0197 pretty much fall
      out of this for free. If the keyword "parallel" is not present,
      then the semantics could be that each arm gets its own executor,
      similar to when the parallel keyword is present, but each executor
      is tied to the same core, thus each arm of the construct executes
      concurrently, but not in parallel, as only one arm can be executing
      at one time. If you want more parallelism, simply add the parallel
      keyword.

      I dont think channels are needed or anything else, one can use
      existing capabilities in Ada to provide the channels.
      For example, Ada.Containers.Synchronous_Queues could be used to
      provide a channel.

      Here is an example of coroutines involving two producers and one
      consumer.  The first arm is an endless loop that produces integer
      values, the second arm is a bounded loop that produces higher
      integer values, and the third arm is the consumer which will end
      up pulling values from both the other arms.

      The Exit statement of the consumer is used here to terminate the
      construct, which will abort the endless loop of the 1st arm.
      There may be reasons why adding "Exit" to block statements would
      not fit very well with syntax. I suspect there would be less reason
      to disallow Exit in a do construct.

      Alternatively, this could be a goto, return, or exception, which
      is treated as a transfer of control out of the construct, which
      we've already discussed.

declare
   Queue : Integer_Queues.Queue;
begin

    do
       declare
          X : Integer := 0;
       begin
          loop    -- endless loop
             Integer_Queue.Enqueue (X);
             X := X + 1;
          end loop;
       end;

    and

       for I in 1 .. 3 loop
          Integer_Queue.Enqueue (1_000_000 + I);
       end loop;

    and

       declare
          Value : Integer;
       begin
          for I in 1 .. 5 loop
             Integer_Queue.Dequeue (Value);
             Put_Line ("Value=" & Integer'Image (Value));
          end loop;

          Exit; -- or raise exception???
       end;

    end do;

    I've tried this code in Paraffin, using Ada's asynchronous
    transfer of control mechanism, and I see the expected results;

    Value= 1000001
    Value= 1000002
    Value= 1000003
    Value= 0
    Value= 1

The last two values might be output before the first three depending on
which arm gets to execute first. I think we could go for better determistic
behaviour and say that the initial order of execution is top down. Once all
have had their initial start, they proceed in a concurrent fashion dependent
on the behaviour of the "channel" and the blocking of the arm branches.

    Anyway, I think these are compelling reasons for using "do" rather
    than "begin", so I will write my homework up that way, unless someone
    can convince me otherwise before then.

    If we decide to go back to "begin", I dont think it will be a big
    change to go that way. I just wanted to present these ideas earlier
    so that it wont be as much a surprise when we meet in Boston,
    and to possibly receive comment earlier.

****************************************************************

From: Randy Brukardt
Sent: Tuesday, October 10, 2017  12:42 AM

...
> I just wanted to present these ideas earlier so that it wont be as
> much a surprise when we meet in Boston, and to possibly receive
> comment earlier.

With the homework deadline about 40 hours from now, there isn't much time for
comment and AI update. And last minute flurries of mail is an issue for me, too
as it takes time to file e-mail threads. Ergo, at this point, just do it.

(Also, do as I say, not as I do... ;-)

>Finally, I think the coroutine concept of ai-0197 pretty much fall out
>of this for free. If the keyword "parallel" is not present, then the
>semantics could be that each arm gets its own executor, similar to when
>the parallel keyword is present, but each executor is tied to the same
>core, thus each arm of the construct executes concurrently, but not in
>parallel, as only one arm can be executing at one time.

OK, but what is providing the "yield" semantics? Without that, it's just stupid
tasking. And it can't require a lot of work to write, lest it hardly solve
anything that can't already be solved with an existing task.

> I dont think channels are needed or anything else, one can use
> existing capabilities in Ada to provide the channels.
> For example, Ada.Containers.Synchronous_Queues could be used to
> provide a channel.

I don't think that is what Tucker had in mind -- if normal task communication
would work, there'd be no need to propose such an idea in the first place. (Nor
any reason for coroutines, but I digress.) In any case, I want to find out
exactly what he had in mind, and trying to preempt him is not helpful.

> There may be reasons why adding "Exit" to block statements would not
> fit very well with syntax. I suspect there would be less reason to
> disallow Exit in a do construct.

Exit fits fine with the syntax of a block statement. But allowing it is wildly
incompatible, consider the Ada 95 code:

     loop
         begin
             exit when ...;
         end;
     end loop;

If block statements had exit, this exit would exit the block, rather than the
loop. That's clearly not the intention of the writer of the Ada 95 code.

As you note, there's no compatibility problem with "do", but I don't think it is
a particularly good idea to allow that in one construct but not in a similar
one. Is this really necessary? It seems like a rare need, and a goto works fine
(even if clunky).

****************************************************************

From: John Barnes
Sent: Tuesday, October 10, 2017  1:53 AM

i rather like do.

****************************************************************

From: Tullio Vardanega
Sent: Tuesday, October 10, 2017  2:25 AM

So do I.

****************************************************************

From: Brad Moore
Sent: Tuesday, October 10, 2017  9:24 AM

> Ergo, at this point, just
> do it.

Interesting choice of words. If you had said, just begin it, it would probably
lower the chances of me getting it done. ;)

> Finally, I think the coroutine concept of ai-0197 pretty much fall
>> out of this for free. If the keyword "parallel" is not present, then
>> the semantics could be that each arm gets its own executor, similar
>> to when the parallel keyword is present, but each executor is tied to
>> the same core, thus each arm of the construct executes concurrently,
>> but not in parallel, as only one arm can be executing at one time.
> OK, but what is providing the "yield" semantics? Without that, it's
> just stupid tasking. And it can't require a lot of work to write, lest
> it hardly solve anything that can't already be solved with an existing task.

We could have two subtype attributes, 'Consume and 'Yield allowed only in "do"
statements that semantically read and write to an implicit buffer or queue of
that subtype. The scope of the implicit queue would be tied to the scope of the
do statement. This would be expected to work regardless whether the do statement
has the parallel keyword or not. The default length of the queue could be 1, but
perhaps could be controlled by an aspect on the subtype declaration.

e.g.

do
   for I in 1 .. 1_000_000 loop
      Integer'Yield(I);
   end loop;
and
   for I in 1 .. 1_000_000 loop
      Float'Yield(Sqrt(I));
   end loop;
and
   for I in 1 .. 10 loop
      Put_Line ("The Square Root of" & Integer'Image(Integer'Consume) & " is"
                & Float'Image(Float'Consume));
   end loop;
   goto Done;
end do

<<Done>>

****************************************************************

From: Brad Moore
Sent: Wednesday, October 11, 2017  4:01 PM

Here is my homework for this AI. [This is version /04 of the AI - Editor.]

I have applied major rework to the reduction expression construct, since it was
very sketchy to begin will.

I have also renamed parallel blocks to concurrent blocks, using the do ... and
.. end do syntax I described in a recent email. If the parallel keyword is not
present, then the do statement executes concurrently within the same task, which
is safer and more useful than just applying sequential execution.

This will also provide a good starting point for AI12-0197, coroutines, since
most of the construct is already in place.

For parallel loops, the biggest change was to move the parallel keyword from in
the middle of the syntax, to before the "for" keyword.

****************************************************************

From: Randy Brukardt
Sent: Wednesday, October 11, 2017  4:01 PM

> Here is my homework for this AI.

I fixed some minor issues in this AI (without making a separate version):

The reference to AI12-0064-1 is stale (that alternative is abandoned), and the
aspect is named Nonblocking.

The paragraph starting "Any transfer of control..." had two left-over uses of
"parallel sequences", which I replaced by "concurrent sequences" to match the
other changes (including earlier in the same sentence).

The paragraph starting "Note that the same rules..." talks about "parallel
blocks", which was otherwise changed to "concurrent blocks".

There was a missing ) in the first paragraph of the Legality Rules for 4.5.9.

There was a stray "Modify 5.5.2", which was removed.

Some of the formatting was unusually narrow (while it was unusually wide last
time - can't win, I guess).


===

Comment: Examples in the RM need to be complete, typically by depending on
declarations from previous examples. None of the examples in 4.5.9 or 5.6.1 seem
to do that. (A couple might be stand-alone, which is OK, but I didn't check
carefully.) That needs to be fixed.

Comment: You added "reducer" aspects as needed to Ada.Strings and the like. But
don't the containers need something similar? I could do that in AI12-0112-1, but
I'd have to know what is needed (and I'm not the right person to figure that
out).

****************************************************************

From: Brad Moore
Sent: Wednesday, October 11, 2017  11:26 PM

...
> I fixed some minor issues in this AI (without making a separate version):
...
> There was a missing ) in the first paragraph of the Legality Rules for 
> 4.5.9.

Thanks for catching these, Randy.

> There was a stray "Modify 5.5.2", which was removed.
>
> Some of the formatting was unusually narrow (while it was unusually 
> wide last time - can't win, I guess).

I dont know how wide is good. I had my text editor set at a 72 character
margin by default and tried to stick to that. Would 80 characters be
better?

 ===
>
> Comment: Examples in the RM need to be complete, typically by 
> depending on declarations from previous examples. None of the examples 
> in 4.5.9 or 5.6.1 seem to do that. (A couple might be stand-alone, 
> which is OK, but I didn't check carefully.) That needs to be fixed.
>
> Comment: You added "reducer" aspects as needed to Ada.Strings and the like.
> But don't the containers need something similar? I could do that in 
> AI12-0112-1, but I'd have to know what is needed (and I'm not the 
> right person to figure that out).

I thought I had gone through the containers in this AI. It should be there.
For instance, see

  Modify A.18.2 (15/2-18/.2)
and so on....

****************************************************************

From: Randy Brukardt
Sent: Wednesday, October 11, 2017  11:41 PM

...
> I dont know how wide is good. I had my text editor set at a
> 72 character margin by default and tried to stick to that. 
> Would 80 characters be better?

The usual is either 79 or 80 (depends on whether a program or I am doing it).
Some of the old text seemed to be about 100 characters, and the new was 72.
Tough to compare.

****************************************************************

Questions? Ask the ACAA Technical Agent