Version 1.4 of ai12s/ai12-0119-1.txt
!standard 5.5.2 (2/3) 16-10-03 AI12-0119-1/02
!class Amendment 14-06-20
!status work item 14-06-20
!status received 14-06-17
!priority Medium
!difficulty Hard
!subject Parallel operations
!summary
New syntax and semantics to facilitate parallelism via Parallel Loops
and Parallel Blocks
!problem
The increased presence of parallel computing platforms brings concerns to the
general purpose domain that were previously prevalent only in the specific
niche of high-performance computing. As parallel programming technologies
become more prevalent in the form of new emerging programming languages and
extensions of existing languages, safety concerns need to
consider the paradigm shift from sequential to parallel behavior.
Ada needs mechanisms whereby the compiler is given the necessary semantic
information to enable the implicit and explicit parallelization of code.
After all, the current Ada language only allows parallel execution when
it is semantically neutral (see 1.1.4(18) and 11.6(3/3)) or explicitly
in the code as a task.
!proposal
This proposal depends on the facilities for aspect Global (AI12-0079-1)
and for aspect Potentially_Blocking (AI12-0064-1).
Those proposals allow the compiler to statically determine where
parallelism may be introduced without introducing data races.
This proposal informally introduces the semantic notion of a Parallel
OPportunity (POP) and a tasklet to the language. POPs are places in
program text where work can be spawned to parallel executing workers
that work in concert to correctly execute the algorithm. Tasklets are
the notational (logical) units within a task that are executed in
parallel with each other.
The goals of this proposal are:
- To permit existing programs to benefit from parallelism through
minimal restructuring of sequential programs into ones that permit
more parallel execution;
- To support the development of complete programs that maximize
parallelism;
- To efficiently guide the compiler in the creation of effective
parallelism (without oversubscription) with minimal input from the
programmer;
- To support the parallel reduction of results calculated by parallel
computation;
- To avoid syntax that would make a POP erroneous or produce noticeably
different results if executed sequentially vs in parallel.
An important part of this model is that if the compiler is not able to
verify that the parallel computations are independent, then a warning
will be issued at compile time (See AI12-0079-1 and AI12-0064-1 for
ways on how this can happen).
Note that in this model the compiler will identify any code where a
potential data race occurs (following the rules for concurrent access
to objects as specified in the Language Reference Manual RM 9.10(23),
and point out where objects cannot be guaranteed to be independently
addressable. If not determinable at compile-time, the compiler may
insert run-time checks to detect data overlap.
Another issue is the underlying runtime. In this model, tasklets are
orthogonal to tasks. Regardless of implementation, tasklets are
considered to execute in the semantic context of the task where they
have been spawned, which means that any operation that identifies a
task, such as those in Task_Identification, will identify the task in
which the tasklet is spawned. On the other hand, calls by different
tasklets of the same task into the same protected object are treated as
different calls resulting in distinct protected actions; therefore
synchronization between tasklets can be performed using non-blocking
protected operations. Note that this is consistent with the current
standard which already supports multiple concurrent calls by a single
task in the presence of the asynchronous transfer of control capability
(RM 9.7.4 (23)).
There are two areas of this proposal; one that introduce capabilities
for parallel blocks and another that introduces capabilities for
parallel loops.
parallel Blocks
--
parallel blocks may be used to specify that two or more parts of an
algorithm may be executed in parallel with each other.
Semantics: A parallel block statement encloses two or more sequences of
statements (two or more "parallel sequences") separated by the reserved
word "and". Each parallel sequence represents a separate tasklet, but
all within a single Ada task. task identity remains that of the
enclosing Ada task, and a single set of task attributes is shared
between the tasklets.
With respect to the rules for shared variables (see RM 9.10(13)), two
actions occurring within two different parallel sequences of the same
parallel block are not automatically sequential, so execution can be
erroneous if one such action assigns to an object, and the other reads
or updates the same object or a neighboring object that is not
independently addressable from the first object. The appropriate use of
atomic, protected, or task objects (which as a group we will call
synchronized objects) can be used to avoid erroneous execution.
In addition, the new Global and Potentially_Blocking aspects may be
specified to enable the static detection of such problems at compile
time (see AI12-0079-1 and AI12-0064-1).
Any transfer of control out of one parallel sequence will initiate the
aborting of the other parallel sequences not yet completed. Once all
other parallel sequences complete normally or abort, the transfer of
control takes place. If multiple parallel sequences attempt a transfer
of control before completing, one is chosen arbitrarily and the others
are aborted.
If an exception is raised by any of the parallel sequences, it is
treated similarly to a transfer of control, with the exception being
propagated only after all the other sequences complete normally or due
to abortion. If multiple parallel sequences raise an exception before
completing, one is chosen arbitrarily and the others are aborted. The
parallel block completes when all of the parallel sequences complete,
either normally or by being aborted.
Note that aborting a tasklet need not be preemptive, but should prevent
the initiation of further nested parallel blocks or parallel loops.
parallel Loops
--
A common approach to parallelizing loops is to break the loop iterations
into chunks where each chunk consists of a subset of the entire
iteration set, and may be as small as a single iteration, up to as large
as the entire number of iterations, or somewhere in-between. In Ada,
each chunk becomes a separate tasklet.
This proposal gives the programmer some degree of control over the
parallelization of for loops into appropriately sized chunks, but
without requiring that they specify the exact chunk size or the number
of chunks. In addition, to deal with data dependences, this proposal
supports per-thread copies of relevant data, and a mechanism of reducing
these multiple copies down to a final result at the end of the
computation. This mechanism is called reduction.
The multiple copies are combined two at a time and thus involves either a
function that accepts two partial results, returning a combined result, or a
procedure that accepts two partial results where at least one of the parameters
is an in out parameter, that accepts a partial result, and contains the
combined result upon return. The subprogram that is used to combine the results
is called a reducer. A reducer identifies an associative operation. A reducer
does not need to be commutative, as it is expected that the implementation
will support non-commutative reductions, such as vector concatentation.
Reduction result variables must be initialized in the declaration statement.
The multiple copies of the partial results are each initialized by assigning
the same value that was assigned to the result variable for its declaration.
It is important that this initial value does not affect the result of the
computation. This value is called the Identity value, and depends upon the
operation applied by the reducer. For example, if the reduction result type is
Integer, and the reducer is addition, then the Identity value needs to be zero,
since adding zero to any value does not affect the value. Similarly, if the
reducer is multiplication, then the Identity value needs to be one, since
multiplying any Integer value by 1 does not affect the value. If the result
type is a vector and the reducer is concatenation, the the Identity value needs
to be an empty vector, since concatenating an empty vector to another vector
does not affect the value of the other vector.
To specify the reducer for a result variable, a new aspect, Reducer may be
applied to the declaration of the result variable. The Reducer aspect denotes
either a function with the following specification:
function reducer_name (L, R : in Result_Type) return Result_Type;
or a procedure with the following specification:
procedure reducer_name (Target, Source : in out Result_Type);
We could allow the compiler to implicitly determine reducers for simpler
operations such as integer addition based on the operations applied inside
the loop, but it was thought to be better to require all reducers to be
explicitly specified by the programmer for each reduction result. This makes
it more clear that parallel reductions are being applied for the reader,
and should make it easier for the compiler as well. Only one reducer aspect
may be applied to a single result declaration.
To indicate that a loop is a candidate for parallelization, the reserved
word "parallel" may be inserted immediately after the word "for"
in a "for" loop. Such a loop will be broken into chunks, where each chunk is
processed sequentially. If reduction is involved in the loop, then the
names of the reduction result variables are specified in a comma separated
list enclosed by parens immediately following the reserved word "parallel".
The parens are not specified if reduction is not involved in the loop.
The reduction list serves two purposes; It assists the reader to identify what
reductions, if any, are to occur during loop execution, and it notifies the
compiler that references to the listed variables in a global scope are not
expected to be a source of data races, and that special treatment in the form
of reduction management code needs to be generated. It also indirectly informs
the compiler about the data types that need to be involved in the reduction.
e.g. A parallel loops that calculates the minimum, maximum, and sum of all
values in an array.
declare
Min_Value : Integer with Reducer => Integer'Min := Integer'Last;
Max_Value : Integer with Reducer => Integer'Max := Integer'First;
Sum : Integer with Reducer => "+" := 0;
begin
for parallel (Min_Value, Max_Value, Sum) I in 1 .. 1_000_000 loop
Min_Value := Integer'Min(@, Arr(I));
Max_Value := Integer'Max(@, Arr(I));
Sum := @ + Arr(I);
end loop;
end;
Note that the same rules presented for parallel blocks above apply to
the update of shared variables and the transfer of control to a point
outside of the loop, and for this purpose each iteration (or chunk) is
treated as equivalent to a separate sequence of a parallel block.
This AI could also attempt to define parallel loops for generalized loop
iteration, but that is a big enough a topic that it deserves its own AI.
This AI is big enough as it is, so it will only deal with for loops with
loop_parameter specifications, but probably a similar approach could be
applied to deal with iterator_specifications.
Additional Capabilities for Parallel Loops
------------------------------------------
There are certain problems where additional syntax may be needed to support
the parallelism. In most cases, the choice of parallelism implementation does
not matter. For such cases, typically a dynamic approach that allows for
load balancing of the cores can be applied, where if an executor assigned to
a core completes its work earlier than the others, it can "steal" or obtain
work from the other executors. For certain problems however, the programmer
needs to be able to specify that dynamic load balancing needs to be disabled,
to ensure that work is evenly divided among the available executors, and that
each executor will only process a single chunk of the loop iterations. An example
of such a problem is the prefix scan, otherwise known as a cumulative sum,
where a result array is to be generated that will contain a cumulative sum
of an input array where each element in the array contains the sum of the
corresponding element in the input array with the cumulative sum for the
previous element in the result array.
The most obvious solution is to perform two passes, where the first pass
is a parallel loop, that breaks the input array into chunks and creates an
intermediate array representing the final cumulative sum of each chunk.
This intermediate array then is processed sequentially using the cumulative
sum algorithm to update the intermediate array to have the cumulative sum of
the cumulative sum chunks from the first pass.
The second pass then is another parallel loop that adds the cumulative sums from
the intermediate array to each corresponding chunk as determined in pass 1.
It is important that the chunking for both parallel loops are the same, and
that each chunk is associated with a single executor.
To support this, another aspect called Parallelism can be applied to a
type declaration or subtype declaration for a discrete_subtype_definition,
where the discrete_subtype_definition is to be used as part of a
loop_parameter_specification. Such a declaration is called a
parallel_loop_iteration_subtype. The Parallelism aspect can be specified as
either "Dynamic" or "Fixed", where Dynamic indicates that the chunking can occur
dynamically during the execution of the loop, and "Fixed" indicates that the
chunking of the loop is determined prior to loop execution, where each
available executor is given approximately the same amount of loop iterations to
process, and each executor is assigned only a single chunk of the iteration.
The default for the Parallelism aspect is implementation defined.
In addition it will be necessary to be able to query the implementation to
determine the number of executors that are associated with a parallel loop.
This can be obtained by querying a new attribute that is associated with a
parallel_loop_iteration_subtype, called Executor_Count. The implementation
can normally be relied upon to choose a reasonable value for this attribute,
but it would also be useful to allow the programmer to specify this value via
the attribute, or via an aspect of the same name.
Finally, it would also be necessary to be able to associate the currently
executing chunk with a numeric executor id value. This would allow the
programmer to specify where intermediate results are to be placed for the
cumulative sum problem for example.
To support this, a new attribute Executor_Id may be applied to query the
defining_identifier of a loop_parameter_specification. The Executor_Id value
is a value of type univeral_integer and is non-negative ranging from 1 to the
value of Executor_Count.
With these additional features the above algorithm for cumulative sum can be
programmed as;
declare
subtype Loop_Iterations is Arr'First .. Arr'Last
with Parallelism => Fixed,
Executor_Count => System.Multiprocessors.Number_Of_CPUs + 1;
-- Note + 1 is used above because in the 2nd parallel loop below, the first
-- core exits early leaving an idle core. We would like to keep all cores
-- busy.
Intermediate : array (1 .. Loop_Iterations'Executor_Count) of Integer
:= (others => 0);
Sum : Integer := 0;
begin
for parallel I in Loop_Iterations'range loop
Intermediate(I'Executor_Id) := @ + Arr(I);
Cum(I) := @ + Intermediate(I);
end loop;
--
for I of Intermediate loop
Sum := @ + I;
I := Sum;
end loop;
for parallel I in Loop_Iterations'range loop
if I = 1 then
exit;
end if;
Cum(I) := @ + Intermediate(I'Executor_Id - 1);
end loop;
end;
!wording
Append to Introduction (28)
"A parallel block statement requests that two or more sequences of
statements should execute in parallel with each other."
Replace 5.5(3/3) with
iteration_scheme ::= while condition
| for [parallelism_specification] loop_parameter_specification
| for iterator_specification
parallelism_specification ::= parallel [reduction_list]
reduction_list ::= (name{,name})
Add after 5.5(5)
Legality Rules
Each name of the reduction_list shall designate a directly visible
object_declaration of a definite, nonlimited subtype.
(Static Semantics)
A reduction_list specifies that each declaration denoted by each name given
in the list has the Reducer aspect (see 9.12.xx)
Append to 5.5(9/4)
[A loop_statement with a parallelism_specification allows each execution of
the sequence_of_statements to be executed in parallel with each other.]
if a parallelism_specification has been specified then there may be more than
one thread of control each with its own loop parameter that covers a
non-overlapping subset of the values of the discrete_subtype_definition.
Each thread of control proceeds independently and concurrently between the
points where they interact with other tasks and with each other.
Prior to executing the sequence_of_statements, each thread of control declares
and elaborates variables local to the thread of control that corresponds to each
name in the reduction_list of the parallelism_specification. The type of each
thread local variable is the same as that of the variable that the name
in the reduction list designates. The initial value for each of these thread
local variables is the same as was assigned to the variable that the name in
the reduction list designates, in its declaration. As each thread of control
completes execution of its subset of the discrete_subtype_definition, the
thread local variables are reduced into the variables that they designate
in the reduction_list, by applying the reducer operation identified by the
reducer aspect for the designated variable.
AARM - An implementation should statically treat the
sequence_of_statements as being executed by separate threads of control,
but whether they actually execute in parallel or sequentially should be a
determination that is made dynamically at run time, dependent on factors such
as the available computing resources.
Examples after 5.5(20)
Example of a parallel loop without reduction
for parallel I in Buffer'Range loop
Buffer(I) := Arr1(I) + Arr2(I);
end loop;
Example of a parallel loop with reduction_list
Alphabet : Character_Vectors.Vector with Reduce => Character_Vectors."&"
:= Character_Vectors.Empty_Vector;
for parallel (Alphabet) Letter in 'A' .. 'Z' loop
Alphabet.Append(Letter);
end loop;
"5.6.1 Parallel Block Statements
[A parallel_block_statement encloses two or more sequence_of_statements
where all the sequence_of_statements can execute in parallel with each
other.]
Syntax
parallel_block_statement ::=
parallel
sequence_of_statements
and
sequence_of_statements
{and
sequence_of_statements}
end parallel;
Static Semantics
Each sequence_of_statements represents a separate thread of control that
proceeds independently and concurrently between the points where they
interact with other tasks and with each other.
AARM - An implementation should statically treat each
sequence_of_statements as a separate thread of control, but whether they
actually execute in parallel or sequentially should be a determination
that is made dynamically at run time, dependent on factors such as the
available computing resources.
Examples
Example of a parallel block statement:
parallel
Foo(Z);
and
Bar(X, Y);
and
Put_Line ("Executing Foo and Bar in parallel with Other_Work");
Other_Work;
end parallel;
[Brads comment: I'm not sure was want the following paragraph?]
Bounded (Run-Time) Errors
It is a bounded error to invoke an operation that is potentially
blocking within a sequence_of_statements of a parallel block or loop
statement with a parallelism_specification."
Add C.7.1 (5.1)
The Task_Id value associated with each sequence_of_statements of a
parallel_block_statement or of a loop statement is the same as that of the
enclosing statement.
AARM - Each sequence_of_statements of a parallel block or parallel loop are
treated as though they are all executing as the task that encountered the
parallel block or parallel loop statement.
Change 9.10 (13)
"Both actions occur as part of the execution of the same task {unless
they are each part of a different sequence_of_statements of a parallel
block statement or loop statement with a parallelism_specification.}"
New section 9.12 Executors and Tasklets
A task may distribute execution across different physical processors in
parallel, where each execution is a separate thread of control that proceeds
independently and concurrently between the points where they
interact with other tasks and with each other. Each separate thread of control
of a task is an Executor, and the execution that each executor performs between
synchronization is a tasklet. When a task distributes its execution to a set
of executors, it cannot proceed with its own execution until all the executors
have completed their respective executions.
A parallel block statemet or loop statement with a parallelism_specification
may assign a set of executors to execute the loop, if extra computing resources
are available.
Each executor is associated with an value that identifies the executor.
This value is the executor id and of the type universal_integer within the
range of 1 to the number of executors of the task assigned to the parallel
block statement or loop statement with a parallelism_specification.
For a prefix S that is of a type declaration or subtype declaration of a
discrete_subtype_definition that is referenced in the
loop_parameter_specification, the following attribute is defined:
S'Executor_Count Yields the number of executors assigned to execute the
loop. The value of this attribute is of the type universal_integer.
Executor_Count may be specified for a type declaration or subtype declaration
of a discrete_subtype_definition via an attribute_definition_clause. If no
executors are assigned to the loop, 1 is returned.
For a type declaration or subtype declaration of a discrete_subtype_definition,
the following language-defined representation aspects may be specified:
Parallelism The aspect Parallelism specified the underlying parallelism
semantics of a loop statement with a loop_parameter_specification that
references the discrete_subtype_definition of the type declaration or subtype
declaration. A Parallelism aspect specified as Fixed indicates that each
executor will execute only a single subset of the discrete_subtype_definition
for the loop_parameter_specification of the loop, and that this subset
assignment occur prior to loop execution.
A Parallelism aspect specified as Dynamic indicates that each executor can
execute multiple subsets of the discrete_subtype_definition for the
loop_parameter_specification of the loop, and that the subset assignment may
occur during the execution of the loop.
Reducer The aspect Reducer either denotes a function with the following
specification:
function reducer(L, R : in result_type) return result_type
or denotes a procedure with the following specification:
procedure reducer (Target, Source : in out result_type)
where result_type statically matches the subtype of the type declaration or
subtype declaration.
Only one Reducer aspect may be applied to a single declaration;
For a prefix X that is of a defining_identifier of a
loop_parameter_specification the following attribute is defined:
X'Executor_Id Yields the id of the executor associated with the
attribute reference. The value of this attribute is of type universal_integer,
and positive and not greater than the number of executors that are assigned by
the implementation to execute the parallel loop. If no executors are assigned
to the loop, then 1 is returned.
Add bullet after 13.1 (8.x)
Executor_Count clause
[Brad's comment: I'm not sure we want this paragraph change]
Modify H.5 (1/2)
"The following pragma forces an implementation to detect potentially
blocking operations within a protected operation {or within a
sequence_of_statements of a parallel_block_statement or loop statement with
a parallelism_specification}"
[Brad's comment: I'm not sure we want this paragraph change]
H.5 (5/2)
"An implementation is required to detect a potentially blocking
operation within a protected operation {or within a
sequence_of_statements of a parallel_block_statment or within a
sequence_of_statements of a loop_statement with a parallelism_specification},
and to raise Program_Error (see 9.5.1). "
[Brad's comment: I'm not sure we want this paragraph change]
H.5 (6/2)
"An implementation is allowed to reject a compilation_unit if a
potentially blocking operation is present directly within an entry_body
or the body of a protected subprogram{, a sequence_of_statements of a
parallel_block_statement, or a sequence_of_statements of a loop_statement
with a parallelism_specification}. "
[Editor's note: The following are automatically generated but they're
left here to provide the summary AARM note for that generator. The K.2
item are completely automatically generated; the normative wording and
the summary wording must be identical (and the normative wording needs
to work in the summary. They cannot be different!]
After K.1 (22.1/4)
Executor_Count Number of executors to be associated with a parallel loop
that references a discrete_subtype_definition with this aspect specified in
its loop_parameter_specification. See 9.12.xx
After K.1 (40/3)
Parallelism Indicates the parallelism model applied to a parallel loop.
See 9.12.xx."
After K.1 (49/3)
Reducer Subprogram to combine two partial results from a parallel loop
computation. See 9.12.xx
!discussion
There is a continuing trend of exponential growth of computational
elements embedded on a single chip. This has led to significant
challenges for software designers and implementers. Prior to 2005, the
increase in transistor count corresponded to a similar increase in CPU
clock speed, which boosted performance of sequential algorithms. Since
then, CPU clock speeds have leveled off for reasons of practicality, and
chip manufacturers have instead moved towards multicore technologies as
a means of achieving performance increases.
As parallel technologies become more prevalent in the form of new
emerging hardware platforms, safety concerns need to consider the
paradigm shift from sequential to parallel behaviour. It is a paradigm
that has been for a long time contained within a specialized domain of
high-performance computing, but that is now required in all domains of
computing systems.
It is important for the programmer to indicate places in the code
where parallelism is desired, which allows the reader to better
understand the algorithm, and allows the programmer to receive
confirmation from the compiler whether the desired parallelism is safe,
or even worthwhile.
This proposed model does not define syntax for the explicit
parallelization of individual subprogram calls, since such
parallelization can be performed implicitly by the compiler, when it
knows that the calls are free of side-effects. This is facilitated by
annotations identifying global variable usage on subprogram
specifications (see AI12-0079-1).
parallel Blocks
--
example:
declare
X, Y : Integer;
Z : Float;
begin
parallel
X := Foo(100);
and
Z := Sqrt(3.14) / 2.0;
Y := Bar(Z);
end parallel;
Put_Line(“X + Y=” & Integer'Image(X + Y));
end;
In this example, the calculation of Z and Y occur sequentially with
respect to each other, but in parallel with the calculation of X. Note
that the compiler, using the rules specified in AI12-0079-1, may
complain if the parallel sequences might have conflicting global
side-effects.
The parallel block construct is flexible enough to support recursive
usage as well, such as:
function Fibonacci (N : Natural) return Natural is
X, Y : Natural;
begin
if N < 2 then
return N;
end if;
parallel
X := Fibonacci (N – 2);
and
Y := Fibonacci (N – 1);
end parallel;
return X + Y;
exception
when others =>
Log ("Unexpected Error");
end Fibonacci;
We considered allowing the parallel block to be preceded with an
optional declare part, and followed with optional exception handlers,
but it was observed that it was more likely to be useful to have objects
that are shared across multiple parallel sequences to outlive the
parallel block, and that having exception handlers after the last
parallel sequence could easily be misconstrued as applying only to the
last sequence. Therefore we reverted to the simpler syntax proposed
above. This simpler syntax is also more congruous with the syntax for
select statements. Because there are no local declarations, there was
also no point in having a statement_identifier (block label) for a
parallel block. This is actually not entirely true. Allowing an exit
statement to replace a goto that targets the location following the
end of the select statement seems useful. To allow such a statement,
a block label might be needed to identify the parallel block that is
being exited. It was felt that the need for allowing an exit statement
in a parallel block could be the subject of a separate AI.
parallel Loops
--
In most compute-intensive applications, a significant proportion of the
computation time is spent in loops, either iterating over
arrays/container data structures, or systematically searching a large
solution space. To benefit from parallel hardware, the computation
associated with a loop should be spread across the available processors.
One approach, presuming the iterations of the loop have no data
dependences between them, is to treat each iteration of the loop as a
separate tasklet, and then have the processors work away on the set of
tasklets in parallel. However, this introduces overhead from the queuing
and de-queuing of work items, and the communication of results from each
work item. Furthermore, there often are data dependences between
iterations, and creating a separate work item for each iteration can
introduce excessive synchronization overhead to deal safely with these
interdependences. Therefore, it is common to break large arrays, and/or
the loops that iterate over them, into chunks (or slices or tiles),
where each chunk is processed sequentially, but multiple chunks can be
processed in parallel with one another.
Note: Allowing the parallel keyword on while loops was considered, but
was discarded since while loops cannot be easily parallelized, because
the control variables are inevitably global to the loop.
For example, here is a simple use of a parallelized loop, with an
array of partial sums (with one element per chunk), which are
then summed together (sequentially) to compute an overall sum for the
array:
declare
subtype Loop_Iterations is Arr'range with Parallelism => Fixed;
Partial_Sum : array (1 .. Loop_Iterations'Executor_Count) of Float
:= (others => 0.0);
Sum : Float := 0.0;
begin
for parallel I in Loop_Iterations'range loop
Partial_Sum(I'Executor_Id) := @ + Arr(I);
end loop;
for J in Partial_Sum'range loop
Sum := Sum + Partial_Sum(J);
end loop;
Put_Line ("Sum over Arr = " & Float'Image (Sum));
end;
In this example, the programmer has merely specified that the
Partial_Sum array is to be an array (with each element
initialized to 0.0). In this case, the compiler will automatically select the
appropriate bounds for the array, depending on the number of chunks
chosen for the parallelized loops in which the parallel array is used.
In the above case, we see I'Executor_Id indicating we are
accumulating the sum into a different element of the Partial_Sum in each
distinct chunk of the loop.
To illustrate the use of common chunking used between consecutive loops,
here is a pair of parallelized loops that produce a new array that is
the cumulative sum of the elements of an initial array. The parallel
arrays Partial_Sum and Adjust are used to carry data from the first
parallelized loop to the second parallelized loop:
declare
subtype Iterations := Arr'range with Parallelism => Fixed;
Partial_Sum : array (1 .. Iterations'Executor_Count) of Float
:= (others => 0.0);
Adjust : array(parallel Partial_Sum'range) of Float
:= (others => 0.0);
Cumulative_Sum : array (Arr'range) of Float := (others => 0.0);
begin
--
for parallel I in Iterations'range loop
Partial_Sum(I'Executor_Id) := @ + Arr(I);
Cumulative_Sum(I) := Partial_Sum(I'Executor_Id);
end loop;
--
for J in Partial_Sum'First..Partial_Sum'Last-1 loop
Adjust(J+1) := Adjust(J) + Partial_Sum(J);
end loop;
--
for parallel I in Iterations'range loop
Cumulative_Sum(I):= @ + Adjust(I'Executor_Id);
end loop;
--
Put_Line("Arr, Cumulative_Sum");
for I in Cumulative_Sum'range loop
Put_Line(Float'Image(Arr(I)) & ", " &
Float'Image(Cumulative_Sum(I)));
end loop;
end;
Note that this feature eliminated the need to reference two different
elements of the same array (element I and element I – 1) within any of
the parallel loop bodies. This reduces expression complexity and
eliminates data race issues at chunk boundaries, where the I – 1th
element could refer to an element of another chunk.
Note also that chunking is not explicit in parallelized loops, and in
the above example, the compiler is free to use as few or as many chunks
as it decides is best, though it must use the same number of chunks in
the two consecutive parallelized loops because both
loop_parameter_specifications share the same discrete_subtype_definition.
The programmer could exercise more control
over the chunking by explicitly specifying the attribute for Executor_Count
for the Iterations declaration, rather than allowing it to default.
For example, if the programmer wanted these parallelized loops to be broken
into "N" chunks, then the
declarations could have been:
declare
subtype Iterations := Arr'range with Parallelism => Fixed,
with Executor_Count => N;
...
Automatic Reduction
-------------------
As is illustrated above by the first example, it will be common for the
values of a parallel array to be combined during processing,
using an appropriate reduction operator. In this case, the Partial_Sum
parallel array is reduced by "+" into the single Sum value.
The use of the Reducer aspect can can eliminate the need to write
the final reduction loop in the first example, and instead we could have
written simply:
declare
Sum : Float with Reducer => "+" := 0.0;
begin
for parallel(Sum) I in Loop_Iterations'range loop
Sum := @ + Arr(I);
end loop;
Put_Line ("Sum over Arr = " & Float'Image (Sum));
end;
The operation of the Reducer aspect will automatically be applied as each chunk
completes execution.
!ASIS
** TBD.
!ACATS test
ACATS B-Test and C-Test are needed for these features.
!appendix
From: Tucker Taft
Sent: Tuesday, June 17, 2014 10:18 AM
Here is the paper the "gang of 4" is submitting to HILT 2014 with a relatively
detailed proposal for parallel programming in Ada 202x:
http://bit.ly/parallada-gangof4
If there is time at the upcoming ARG meeting, some discussion and feedback from
ARG members would be appreciated.
****************************************************************
From: Randy Brukardt
Sent: Tuesday, June 17, 2014 12:52 PM
Are we going to get formal proposals for some/all of these features? For
instance, we have an open AI on global in/out (AI12-0079-1) which is certainly
more broadly applicable than just parallelism. There is also an AI for
Potentially_Blocking, AI12-0026-1, misnamed "Task_Safe", which is useful for any
program including tasking. In both of these cases, an important part of the
proposals will be how they interact with predefined and language-defined
subprograms. (If they're not specified for predefined "+", nothing useful could
be done, for example.)
As with all of these sorts of things, the devil is in the details. The basic
ideas look fine, but we'll never really know without detailed proposals that can
be tested against.
****************************************************************
From: Tucker Taft
Sent: Tuesday, June 17, 2014 1:54 PM
True, but it is not worth making a more detailed proposal if the overall
reaction is "yuck"! ;-)
****************************************************************
From: Randy Brukardt
Sent: Tuesday, June 17, 2014 3:10 PM
Well, it looks great to me, but that's mainly because all of the details are
missing. A lot of stuff looks great without the details (it's how modeling
works, after all). The "yuck" only comes when the details are fleshed out...
****************************************************************
From: Bob Duff
Sent: Tuesday, June 17, 2014 5:52 PM
Sounds like there are some good ideas in the paper, but I am very strongly
opposed to ARG getting "detailed proposals" about how to do parallelism, or even
wasting time discussing the matter. ARG should be moving more to the
"standardize existing practise" way of doing. And that goes double for features
whose ONLY purpose is efficiency -- such as parallelism.
We (ARG) don't need detailed proposals in that area; we need an experimental
implementation, and MEASUREMENTS of the efficiency of realistic programs. (And
don't try to show me speedups of the recursive Fibonacci -- that's downright
dishonest!) Until then, ARG should avoid wasting any time on the matter. Once
we have proven practise, then ARG should take charge of standardizing it.
My comments above don't apply so much to things like Globals, which are more
about correctness/safety than efficiency. ARG could reasonably discuss Globals,
although even there, an experimental implementation would be helpful.
Regarding those: You can put Globals and Potentially_Blocking on a package.
That's good. I think it should apply to child and nested packages of that
package. I think you also want a way to give a default for Overlaps_Storage: "no
overlaps unless specified, in this package and children".
****************************************************************
From: Robert Dewar
Sent: Wednesday, June 18, 2014 2:52 AM
> Sounds like there are some good ideas in the paper, but I am very
> strongly opposed to ARG getting "detailed proposals" about how to do
> parallelism, or even wasting time discussing the matter. ARG should
> be moving more to the "standardize existing practise" way of doing.
> And that goes double for features whose ONLY purpose is efficiency --
> such as parallelism.
I am in full agreement with this.
At the last RTW, which I attended, I was struck by the fact that the RTW was
debating removing some features in Annex D which no one has ever implemented,
yet alone found useful in actual practice.
The RM is not a research proposal :-)
****************************************************************
From: Alan Burns
Sent: Wednesday, June 18, 2014 5:59 AM
The policy currently in Ada has been implemented by the Ada run-time from
Santandar, and is supported in EDF operating systems (not Ada), though I agree
these are mainly research and experimental platforms, not mainstream industrial
platforms.
****************************************************************
From: Robert Dewar
Sent: Wednesday, June 18, 2014 10:39 AM
But I am talking about the specific Ada features, which for sure have not been
used by anyone in an Ada program :-)
****************************************************************
From: Tucker Taft
Sent: Wednesday, June 18, 2014 8:38 AM
> The RM is not a research proposal :-)
I agree that many language standards come after the fact, and are used to try to
tame a wildly varying set of ad hoc extensions. That has never been the story
with Ada. There are relatively few extensions (other than as packages, pragmas,
or attributes) that Ada vendors or users try in any serious way, before the Ada
Rapporteur Group begins to talk seriously about them. The ideal is that the ARG
gets the ball rolling, indicates a direction we are serious about, and then
vendors try prototyping the ideas, and then we come around to picking among the
"serious" ideas based on prototyping results. I really don't see anyone beside
the ARG being taken seriously at this point as a fountain for "ideas worth
prototyping" (apologies to TED ;-).
****************************************************************
From: Stephen Michell
Sent: Sunday, June 22, 2014 10:24 PM
Attached is a report that ARG requested from Brad, Tucker and myself in
investigating how OpenMP could be used to implement Ada parallelism. For
discussion at the ARG meeting this week.
-----
Report to WG 9/ARG
on the use of OpenMP for Parallelism in Ada
Stephen Michell
with contributions from
Luis Miguel Pinho
Brad Moore
Tucker Taft
Foreward
At the ARG meeting November 2014, ideas were presented about ways to
implement parallelism in Ada, and ways to interoperate with other
language systems while using the multicore capabilities of modern
computers. We were requested to investigate OpenMP from a technical
perspective as well as a business perspective as a potential solution
to implementing multicore parallelism in Ada.
OpenMP Characteristics
OpenMP is a set of libraries and language extensions that permit a
user to access the capabilities of multicore systems from within
certain languages - namely Fortran, C and C++.
OpenMP's parallelism is "explicit parallelism" - i.e. all aspects of
the parallelism are specified by compiler directives in the code that
invoke library routines at the place of the directive through OpenMP API's.
OpenMP lets the user specify code segments (usually "for loops",
blocks and functions) that can be executed by threads (possibly
operating on multiple cores) that are co-ordinated to implement an
algorithm. The language extensions are converted into library calls to
interact with a runtime to implement the algorithm.
In trying to implement OpenMP on an Ada system, one would note that
the concurrency model is quite different than Ada's. The
"threads" that are created interact with other threads that use OpenMP
primitives. This means that they will not interact with Ada tasks
using Ada tasking primitives.
For the interoperability with Fortran, C and C++ systems that are
using OpenMP, an Ada OpenMP capability would be useful, whetehr or not
it interacted with Ada tasks. This brings us to the business model.
Business Model
To implement An OpenMP for Ada, one must participate in the OpenMP
community. One could develop compiler directives, either as aspects or
pragmas and convert calls to meet the API, but to have it
"standardized", one must be a full participant in the OpenMP group.
One can participate in OpenMP as a university, as an individual or as
a corporate participant. AdaCore was approached to see if they were
interested in joining OpenMP (and paying fees), but declined. There is
no point joining as individuals, as individuals cannot propose new
language specifications for standardization.
This leaves participation through a university. Barcelona
Supercomputing Group is a participant, and Miguel Pinho has approached them about
participating through Barcelona. Barcelona is receptive, but
participation requires that Miguel find a PhD student willing to
participate. To date there has been no final resolution.
There is a GNU OpenMP binding (GOMP) that supports the GNU Fortran, C
and C++ GNU compilers. Ada could define a thin binding to that
interface. This would still leave the challenge of defining Ada syntax
for the Open MP compiler directives and determining how such a library
would interact with the Ada runtime (Tasking is not the only issue -
elaboration issues are also a potential issue).
Conclusion
There are business reasons and technical reasons for creating an
OpenMP Ada implementation. If an interface with OpenMP through
Barcelona Supercomputing Group becomes possible, further exploration should
happen. Otherwise, no further action in this area is anticipated.
****************************************************************
From: Brad Moore
Sent: Monday, October 3, 2016 10:09 AM
[This is version /02 of the AI - Editor.]
Here is my attempt to refine ideas from the gang of 4 working on parallelism,
so we might have something more to talk about on this topic at the upcoming
ARG meeting. There are many ideas to choose from, so I chose what I thought
were the most promising, I'm not sure if all 4 of us would agree, though it is
possible that we might be in agreement, as we haven't yet gotten around to
deciding on the best way forward. Most of the writing related to parallel
blocks is unchanged from the previous version of this AI, and was mostly
written by Tucker. The updates I have are mostly to do with parallel loops
and reduction.
I have attempted to put in wording for these changes, but the wording is not
polished. The wording hopefully does capture and describe the intent at least.
The main hope is to gain some feedback, to see if this direction makes sense,
or if other directions should be pursued.
It would probably also be good to discuss whether this parallel loop proposal
is an improvement over what was written in the previous version of the AI,
which discussed the use of parallel arrays.
****************************************************************
Questions? Ask the ACAA Technical Agent