Version 1.3 of ai12s/ai12-0346-1.txt
!standard 5.5(2/3) 20-01-12 AI12-0346-1/01
!standard 5.5.2(5/4)
!standard 5.5.2(7/3)
!class Amendment 19-10-11
!status work item 19-10-11
!status received 19-10-07
!priority Medium
!difficulty Medium
!subject Ada and OpenMP
!summary
** TBD.
!problem
Systems like OpenMP are commonly used to provide the framework for
lightweight threading, GPU usage, and similar capabilities. (OpenMP directly
supports C and Fortran.) Since these frameworks are already in industrial
usage, it would be useful if Ada (particularly the new parallelism features)
could use them.
!proposal
The following proposal is just one possible way to map Ada 202X to
OpenMP. This is the one proposed for GNAT, though it has not
been approved. The author believes it is a good approach, and
matches what the IRTAW and other OpenMP experts such as Luis Miguel Pinho
recommend. Be that as it may, it is not appropriate to mandate
it as part of the Ada standard, but is is hopefully useful
as an example that has been prototyped extensively and has been shown
to provide useful speedups. It is not clear what if anything we
should put into the Ada RM. A separate technical report might
be appropriate, perhaps augmented with some implementation advice
somewhere in the RM. In any case, here goes...
For the mapping to OpenMP, we are recommending a layered mapping
to OpenMP (or other light-weight-threading (LWT) scheduler), where upon
seeing the syntax for a parallel construct, the compiler generates calls
on a top layer (dubbed “System.Parallelism” for now). This layer is
independent of the particular LWT scheduler that will be controlling the
light-weight threads that are spawned as a result of the parallel
constructs. Below System.Parallelism will be a package dubbed
“System.LWT”, which provides the LWT-scheduler-independent API, and
implements it using a “plug-in” architecture. Specific LWT schedulers
would be children of this package, for example “System.LWT.OpenMP”, and
one of them would be “plugged in” to System.LWT and handle the various
calls through the API.
The user will determine which particular LWT scheduler gets linked into
the program by “with”ing a package (e.g. Interfaces.OpenMP) and
declaring a “control” object of a type declared in that package (e.g.
Interfaces.OpenMP.OMP_Parallel), in the task body for the Ada tasks (or
the main subprogram for the environment task) where it is desired to
have multiple threads of control. Discriminants of the control object
can be used to control the level of parallelism desired (e.g. “Control :
OMP_Parallel (Num_Threads => 5);”), as well as potentially other options
that should apply by default across all parallel constructs in the given
Ada task. This approach is modeled on the “#pragma omp parallel” of
OpenMP which creates a “parallel region” in which work-sharing or other
forms of parallelism can be used. In the absence of any LWT scheduler
linked into the program, the System.LWT package would fall back to a
purely sequential implementation. The Interfaces.OpenMP package might
have other subprograms intended to be called directly by the user, in
particular those that are part of the “official” OpenMP API, such as
“omp_get_thread_num” or “omp_get_team_size” (for the full API, see the
Runtime Library Routines section in
https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.
pdf). We also have proposed that record extensions of “Ada.Aspects”
(suggested name) could be used to pass in options to parallel
constructs, in a kind of generalized “aspect specification.”
A prototype implementation of this structure is available, and
can be provided as a zip archive.
Interaction with Ada tasks
We propose that each Ada task defines its own parallelism region, if
any, recognizing that many Ada tasks will need no parallelism, and might
continue to serve special roles in a real-time environment, such as
managing particular hardware devices, or specific jobs of the real-time
system, at a particular real-time priority. For each Ada task that does
want internal parallelism, the expectation is that the underlying LWT
scheduler (e.g. OpenMP) will start up (or re-use) additional
(heavy-weight) “kernel threads” to act as servers for the LWTs that will
be spawned somewhere within the Ada task. Each of these servers will
run at the same priority as the enclosing Ada task, and will share the
Ada “identity” and “attributes” of that Ada task. The LWTs served by
these server threads will in turn get their Ada task “identity” and
“attributes” from these servers.
Each light-weight thread is run in the context of an “LWT group,” which
maps quite directly to an OpenMP “taskgroup.” All of the LWTs spawned
during the scope of an LWT group are automatically awaited at the end of
the LWT group. This ensures that the scope where an LWT is defined
isn’t exited while such LWTs are still running and potentially making
up-level references to objects from the enclosing scope.
In past discussions in the ARG and elsewhere we had presumed that
vendor-specific pragmas or aspects could be used to communicate
additional options to the underlying LWT scheduler. However, more
recently we have discussed including a kind of “generalized aspect”
syntax as part of the syntax for parallel constructs. So far we have
discussed this in detail only on Ada-Comment, but at the upcoming ARG
meeting we will probably discuss it further. The general idea is to
allow for a variant of the aspect-specification syntax after the
“parallel” reserved word, and allow it to be based on any record
extension of a special Ada.Aspects.Root_Aspect tagged type. A parameter
of type “access Root_Aspect’Class” could then be included in the various
System.LWT and System.Parallelism APIs, allowing the underlying LWT
scheduler to receive the options and use them as it sees fit (see below
for an example of how such a parameter might appear). The advantage of
this approach is that a user-provided LWT scheduler might be substituted
for one provided by the compiler vendor, and it could also take
advantage of the generalized aspect syntax without any need to add
special handling into the compiler. This also allows us to reach
additional options for some future OpenMP standard without further
additions to the compiler.
This layered approach, along with the parallelism control object and the
generalized aspects defined by types from the Interface.OpenMP package,
should allow users to reach the OpenMP features of interest, and to
accommodate evolution of the OpenMP standard as well as the ability to
use other LWT schedulers which might come from, say, an RTOS vendor. If
the user chooses to use no LWT scheduler, a sequential fall back will be
part of System.LWT whenever there is no LWT scheduler “plugged in.”
Handling secondary stack, exceptions, and transfers of control out of
the code for a light-weight thread
Independent of which particular LWT scheduler is present (if any), the
code for a particular light-weight thread is defined by a function
pointer and a data object. For Ada, the data object will typically be a
tagged data object, and the function will be a simple wrapper that
merely invokes a special “LWT_Body” dispatching operation on the object,
and handles all exceptions propagated by the body (similar to the way a
wrapper around an Ada task body handles all exceptions). Normal
secondary stack and exception raising and handling can be performed
inside the LWT_Body, because light-weight threads run on a given server
until completion or cancelation. They aren’t swapped back and forth, so
there is no added complexity in stack or exception management.
Effectively, exceptions are being raised and handled on the stack of the
server, in the usual way.
When an exception is propagated out of an LWT_Body, or if the code for
an LWT_Body has an explicit transfer of control out of the code for the
light-weight thread, an attempt is made to cancel the other threads in
the LWT group. The first LWT to attempt cancelation receives an
indication of “success” from this attempt. Later LWTs of the same group
making such an attempt will not receive the “success” indicator.
Cancelation in OpenMP, and in Ada 202X, allows cancelation to be
implemented using a polling approach, where there are various
well-defined “cancelation points.” When code within LWT_Body detects
that the enclosing LWT group has been canceled, it generally just
returns from the LWT_Body. The LWT that successfully initiated the
cancelation records in a variable visible at the point where the LWT
group ends, what action should be taken. The code at the end of the LWT
group propagates the exception, or continues any other transfer of
control, after waiting for all of the LWTs within the group to complete
their execution.
Expansions for proposed Ada 202X features
Expansion for Ada 202X parallel block:
The OpenMP recommended approach to supporting a sequence of blocks to be
(potentially) run in parallel is to create a loop around a switch/case
statement. The “GOMP” implementation indicates the same approach in:
https://gcc.gnu.org/onlinedocs/libgomp/Implementing-SECTIONS-construct.html
We suggest the same approach for Ada. Here is an example expansion:
procedure Walk (Tree : Tree_Ptr) is
--
begin
if Tree = null then
return;
elsif Tree.Kind = Leaf then
Process (Tree);
else
parallel
do
Walk (Tree.Left);
and
Walk (Tree.Right);
end do;
end if;
end Walk;
expands into:
procedure Walk (Tree : Tree_Ptr) is
--
begin
if Tree = null then
return;
elsif Tree.Kind = Leaf then
Process (Tree);
else
parallel
for _I in 1 .. 2 loop
case _I is
when 1 =>
Walk (Tree.Left);
when 2 =>
Walk (Tree.Right);
end case;
end loop;
end if;
end Walk;
which then expands further as a parallel loop, as described below. This
approach makes it easy to turn parallelism on and off, and requires the
creation of only one out-of-line procedure independent of the number of
arms in the parallel block statement.
Expansions for Ada 202X parallel loop:
In this description, we show an optional intermediate step where GNAT
might use a pragma Par_Loop so that parallel loops could be specified
while remaining compilable by older Ada compilers, analogous to the way
the "Pre" aspect expands into pragma Precondition in GNAT.
Ada 202X defines the notion of a “chunk specification” which can give a
user-specified name to the index used to identify a chunk. When using a
pragma instead of syntax, there would be no way to specify the
chunk-index name, so the value of the chunk index can be referenced when
inside the body of a parallel loop by calling the intrinsic
parameterless function System.Parallelism.Chunk_Index, which will always
return 1 in a version of the System.Parallelism package for use with
sequential-only implementations. If there is an explicit "chunk
parameter" in the chunk specification, references to the chunk parameter
could be replaced by a computation based on the result of a call to this
intrinsic Chunk_Index function.
parallel (Num_Chunks)
for ... loop
<loop body>
end loop;
expands into:
pragma Par_Loop(Num_Chunks);
for ... loop
<loop body>
end loop;
which expands further according to the kind of for-loop immediately
following the pragma Par_Loop:
parallel loop over a range of values:
pragma Par_Loop(Num_Chunks);
for I in S range A..B loop
<loop body>
end loop;
expands into:
declare
procedure I__Loop_Body
(I__Low, I__High : Longest_Integer; I__Chunk_Index : Positive) is
begin
for I in S'Val (I__Low) .. S'Val (I__High) loop
<loop body>
end loop;
end I__Loop_Body;
begin
System.Parallelism.Par_Range_Loop
(S'Pos(A), S'Pos(B), Num_Chunks,
Loop_Body => I__Loop_Body'access);
end;
parallel loop over an array:
pragma Par_Loop(Num_Chunks);
for C of Arr loop
<loop body>
end loop;
expands into:
pragma Par_Loop(Num_Chunks);
for C__Index in Arr'Range loop
declare
C renames Arr(C__Index);
begin
<loop body>
end;
end loop;
which then expands according to expansion (1) above for a loop over a
range. Note that a loop over a multidimensional array would be
transformed effectively into a loop over a conceptually flattened array,
as is done in the sequential loop case.
parallel loop over a generalized iterator:
pragma Par_Loop(Num_Chunks);
for C of Iterator loop
<loop body>
end loop;
expands into:
declare
package Inst renames <some instantiation of Ada.Iterator_Interfaces>;
package Par_Iterator_Inst is new Inst.Par_Iterator_Loop;
procedure C__Loop_Body
(C__Iterator : Inst.Parallel_Iterator'Class; C__Chunk_Index : Positive) is
C : Inst.Cursor := C__Iterator.First (C__Chunk_Index);
begin
while Inst.Has_Element(C) loop
<loop body>
C := C__Iterator.Next (C, C__Chunk_Index);
end loop;
end C__Loop_Body;
begin
Par_Iterator_Inst
(Iterator, Num_Chunks, Loop_Body => C__Loop_Body'access);
end;
parallel loop over a container:
pragma Par_Loop(Num_Chunks);
for E of Container loop
<loop body>
end loop;
expands into:
pragma Par_Loop(Num_Chunks);
for E__Cursor of Container'Default_Iterator(Container) loop
declare
E renames Container(E__Cursor);
begin
<loop body>
end;
end loop;
which then expands according to (3) above for a loop over an iterator.
System.Parallelism package and Ada.Iterator_Interfaces.Par_Iterator_Loop
child The System.Parallelism package spec might contain (at least) the
following:
package System.Parallelism is
type Longest_Integer is range System.Min_Int .. System.Max_Int;
--
--
--
--
procedure Par_Range_Loop
(Low, High : Longest_Integer;
Num_Chunks : Positive;
Aspects : access Ada.Aspects.Root_Aspect’Class := null;
Loop_Body : access procedure
(Low, High : Longest_Integer; Chunk_Index : Positive));
function Chunk_Index return Positive
with Convention => Intrinsic;
end System.Parallelism;
A child unit of Ada.Iterator_Interfaces, Par_Iterator_Loop, could be
provided as follows. Note that GNAT might want to instantiate this just
once for each instantiation of Ada.Iterator_Interfaces, rather than at
each parallel loop over an iterator.
generic
procedure Ada.Iterator_Interfaces.Par_Iterator_Loop is
(Iterator : Parallel_Iterator'Class;
Num_Chunks : Positive;
Aspects : access Ada.Aspects.Root_Aspect’Class := null;
Loop_Body : access procedure
(Iterator : Parallel_Iterator'Class; Chunk_Index : Positive));
Questions and Answers about Ada 202X parallelism and mapping to OpenMP
Here are some questions and answers, derived from initial comments on the document.
Question: What sort of changes to the language reference manual are
needed to add support for light-weight parallelism (LWP)?
Answer: Ada 2012 and before presumes a one-to-one mapping between Ada
tasks and logical threads of control. in Ada 2X, we allow a single Ada
task to have multiple logical threads of control. So places where Ada
2012 and before talked about tasking, we introduce this distinction. See
third paragraph of introduction for more on this.
Question: What is the impact in terms of performance for users if LWP is
not used?
Answer: Performance impact is minimal on code that makes no use of
parallelism. The Ada Current_Task function returns the same "Task_Id"
value for all server threads working on behalf of a given Ada task. This
is different than what happens now, but should be very close in terms of
efficiency. See "Interaction with Ada tasks" section.
Changing the priority of a task needs to be spread to all of the servers
working on its behalf, but this can likely be handled without any
overhead on tasks making no use of parallelism, as priority changes are
not expected to propagate instantly. There are well-defined points,
basically at the beginning and end of a parallel construct, where any
priority change in the "parent" task can be propagated to the server
threads. Tasks with no parallelism would never encounter parallel
constructs, and hence would not need to worry about this.
Question: What is the impact in terms of Ada semantics and performance
for users if LWP is not implemented by the compiler?
Answer: in the current proposed approach, if no LWT scheduler is
"plugged in" to the System.LWT package, then there is a "sequential
fallback" that effectively ignores the "parallel" and executes the loop
sequentially, using Ada 2012 semantics. A parallel block becomes
equivalent to a sequence of statements. See last paragraph of
"Interaction with Ada tasks" section.
Question: if an implementer wants to implement the core of Ada 202X
(keeping annexes aside) can it just recognize the "parallel" keyword and
do nothing more for LWP (except perhaps issuing a warning that parallel
is ignored) or are there some semantic checks that still need to be
carried out by the front end or in the run-time? Same question for the
run-time.
Answer: From a dynamic semantics point of view, an implementor could do
as you suggest. From the static semantics point of view, there are
legality checks associated with potentially conflicting uses of
variables visible across a parallel construct that are controlled by a
"Conflict_Check_Policy":
http://www.ada-auth.org/standards/2xaarm/html/AA-9-10-1.html
There are three levels:
No checks (“No_Conflict_Checks”);
Shared-object checks (“Known_Conflict_Checks”);
This is based on the “known to denote the same object” check
performed on OUT parameters of functions.
Synchronized-object checks (“All_Conflict_Checks”).
This check is based on disallowing any uplevel references to
non-synchronized objects.
Also, there are separate policies for tasking (which defaults to “No
checks”) and for parallel constructs (which defaults to
“Synchronized-object checks”).
One suggestion is to move the more complex “shared-object” checks to an
annex, and perhaps the tasking-related checks as well. The goal is to
preserve some basic, default, safety checks in the core, so that Ada
programs will by default not become less safe when the word “parallel”
is inserted in front of a loop. A sophisticated user could decide to
turn off the checks, or choose the subtler “shared-object” checks, but
the early user of Ada 202X LWP will not be introducing “silent” bugs by
adding “parallel.”
Question: Does OpenMP provide a light-weight-thread scheduler?
Answer: Yes, OpenMP provides a light-weight-thread scheduler, and that
is the one we are talking about in this document. OpenMP generally
relies on the underlying operating system to schedule the heavier-weight
"server threads," sometimes called "kernel threads” that actually
execute the lighter-weight OpenMP “tasks.”
Question: do Ada 202x LWP features easily map to OpenMP without
requiring a complex executive?
Answer: Yes. OpenMP has a notion of a "parallel" team of servers which
corresponds to what we are initiating on behalf of those Ada tasks where
parallelism is desired, light-weight "task"s which map to Ada 2X
"logical threads of control", and a "taskloop" construct that maps
directly to Ada 2X "parallel for I in ...".
From an implementation point of view, the "System_LWT.OpenMP" which
provides the actual "plug-in" for OpenMP is about 650 lines of code in
the prototype, most of which are either declarations of OpenMP functions
that are being imported, calls on such functions, or code generating
debugging output. There is almost no logic beyond these call-throughs to
the "GOMP" library (GNU's library for OpenMP and OpenACC). The layers
above that (System_Parallelism and System_LWT) are also quite simple,
again mostly just call-throughs to the next layer down.
Question: What are the implications if a user calls one of the routines
in Interfaces.OpenMP that correspond to the OpenMP API?
Answer: Most of these routines are queries (omp_get_...), with no side
effects. The few routines that have side effects are generally about
setting certain parameters that affect OpenMP scheduling decisions, but
should not interfere with Ada-level semantics at all.
-----
Survey of Light-Weight Parallelism Features in Comparable Languages
As part of this report, there was a request to survey the light-weight
parallelism (LWP) features of other comparable languages. For this
survey, we reviewed the features of Go, Rust, C++, Java, and C#. We
have an appendix with more detail, but first is a summary of the LWP
features of these languages. Some languages have only (chunked)
data-parallel features (parallel loop over an array or other homogeneous
data structure), some have only “fork/join” parallelism, which is good
for irregular computations involving recursive divide-and-conquer
algorithms, often over tree-like structures. Several languages have
both loop/stream-oriented and divide-and-conquer-oriented.
Go
Since the beginning, Go has had “goroutines” which are syntactic
constructs that represent light-weight threads which can be spawned at
any point by simply writing “go <function call>;”. This makes an
asynchronous function call. Typically goroutines communicate using
“channels” which are essentially FIFO queues, with a user-specified
level of buffering. Goroutines can be called from a loop, but they
would typically be considered a variant of “fork/join” parallelism.
Go does no safety checking, but using channels for communication is
always safe. Goroutines are implicitly awaited at certain points, such
as on leaving the function from which they are spawned.
Rust
Rust originally focused on safe multi-threading, where the threads were
heavy-weight “kernel” threads.
More recently Rust introduced a light-weight parallelism library called
“Rayon” and that seems to be growing in popularity relative to the
heavy-weight threading.
Rust’s safety is based on having no global variables, and a somewhat
complicated “borrowing” mechanism plus the notion of lifetimes, but the
net result is an ownership-based model with no globals, allowing
compile-time checks that prevent threads (light- or heavy-weight) from
having conflicts due to concurrent unsynchronized access to shared data.
C++
C++ added heavy-weight threading in C++ 11.
C++ added light-weight threading in C++ 17, providing parallel
iterators, parallel quantifiers (any-of, all-of), parallel sorting, etc.
A fork/join capability is planned for a future C++ standard
Java
Java has had heavyweight threading from the beginning Java has a general
streaming capability, and more recently it has been enhanced to support
light-weight parallelism, which gives parallelization across homogeneous
data structures
Java has specific parallelism operations on arrays
Java also has a fork/join library that provides light-weight parallelism
for “irregular” divide-and-conquer style computations.
C#
Has heavyweight threading provided by System.Threading
Has lightweight threading based on System.Threading.Tasks, which
includes both loop-oriented capabilities and ad-hoc light-weight task
creation.
More details are available
!wording
** TBD.
!discussion
The OpenMP session at ARG meeting #62 showed that implementing the proposed
parallel constructs of Ada on top of OpenMP is feasible. In many of the
semantic choices, OpenMP arrived at the same solution that the ARG did
(particularly in the case of termination semantics), so the amount of extra
mapping needed seems fairly minimal.
The Ada Standard probably should not directly reference the OpenMP standard;
it should remain possible to implement Ada in a variety of ways (including on
bare machines). Moreover, it's not clear that ISO rules would allow referencing
OpenMP at all; generally, we can only reference other ISO standard. Thus, any
OpenMP-specific mechanisms have to be implementation-defined.
Ada does need additional mechanisms for tuning (many of the capabilities of
OpenMP are aimed at tuning particular parallel code). In the case of
declarations and overall configuration, aspects and pragams have the correct
semantics (including the ability to define implementation-defined aspects
and pragmas). It is possible that some of these aspects and pragmas can be
generalized enough to put them into the standard (for instance, the target
aspect to specify where a data object is stored could be language-defined
if we decide to have a standard parallelism library which would necessarily
include types/objects for defining target capabilities. The details of the
the supported targets would, of course, remain implementation-defined).
We do need a way to specify tuning mechanisms for individual parallel
constructs. Chunk_specifications provide this for one such parameter, but
there are many others that are relevant. We could use pragmas for this
purpose, but that is not ideal for the same reasons that we have moved aways
from entity-specific pragmas for declarations in favor of aspect
specifications.
[There's lots more to say here, but your editor is not the person to say it.]
!ASIS
[Not sure. It seems like some new capabilities might be needed, mainly for
aspect specifications on parallel code, but I didn't check - Editor.]
!ACATS test
ACATS B- and C-Tests are needed to check that the new capabilities are
supported.
!appendix
From: Tucker Taft
Sent: Friday, October 4, 2019 8:38 PM
has shared the following presentation:
Ada 202X Lightweight Parallelism and OpenMP.pptx
https://drive.google.com/file/d/156vq44aK2FF60cbd_I8hIOXmqEGjl1H7/view?usp=sharing_eil&ts=5d97f3fe
Here are some slides to help organize our discussion of Ada 202X and OpenMP
on Sunday. They are based on slides I presented at Ada-Europe in June, plus
some new slides talking about a possible "layered" approach to supporting
light-weight parallelism. Note that we are in the middle of a comprehensive
review of all Ada 202X features, but hopefully our OpenMP discussion can be
largely independent of the details of that. Looking forward to our discussion
on Sunday!
****************************************************************
From: Edward Fish
Sent: Saturday, October 5, 2019 1:34 PM
I believe these are the ones I found:
Safe Parallelism: Compiler Analysis Techniques for Ada and OpenMP
-- http://www.cister.isep.ipp.pt/docs/safe_parallelism__compiler_analysis_techniques_for_ada_and_openmp/1378/view.pdf
OpenMP parser for Ada
-- https://www.researchgate.net/profile/Przemyslaw_Stpiczynski/publication/309265124_OpenMP_parser_for_Ada/links/5807538f08ae5ad18817fae5/OpenMP-parser-for-Ada.pdf
OpenMP tasking model for Ada: safety and correctness
-- https://www.auto.tuwien.ac.at/~blieb/AE2017/presentations/AE_17_v3.pdf
****************************************************************
From: Tucker Taft
Sent: Sunday, January 12, 2020 8:35 PM
Here is an extract from a document developed internally to AdaCore as a
mapping from Ada 202X to OpenMP. It has been reviewed by various folks
involved with OpenMP, and seems to be consistent with what those attending
the International Real-Time Ada Workshop are recommending as an approach.
We need to think about how we might want to refer to this sort of a mapping
in the Ada RM, or some other technical specification. For now, this is
hopefully useful background reading for such a discussion.
[This is version /01 of the AI - Editor.]
****************************************************************
From: Jeff Cousins
Sent: Tuesday, January 14, 2020 1:10 PM
Thanks for that.
The paragraph beginning
>Handling secondary stack, exceptions, and transfers of control out of
>the code for a light-weight thread
seems incomplete.
(Also missing full stops/periods after "beginning" and "structures" in the
Java section).
****************************************************************
From: Tucker Taft
Sent: Tuesday, January 14, 2020 2:27 PM
>>Thanks for that.
>>The paragraph beginning
>
>Handling secondary stack, exceptions, and transfers of control out of
>the code for a light-weight thread
That is actually intended to be a section heading, not a paragraph... ;-)
>(Also missing full stops/periods after "beginning" and "structures" in the
>Java section).
OK, thanks.
****************************************************************
Questions? Ask the ACAA Technical Agent