Version 1.3 of ai12s/ai12-0346-1.txt

Unformatted version of ai12s/ai12-0346-1.txt version 1.3
Other versions for file ai12s/ai12-0346-1.txt

!standard 5.5(2/3)          20-01-12 AI12-0346-1/01
!standard 5.5.2(5/4)
!standard 5.5.2(7/3)
!class Amendment 19-10-11
!status work item 19-10-11
!status received 19-10-07
!priority Medium
!difficulty Medium
!subject Ada and OpenMP
!summary
** TBD.
!problem
Systems like OpenMP are commonly used to provide the framework for lightweight threading, GPU usage, and similar capabilities. (OpenMP directly supports C and Fortran.) Since these frameworks are already in industrial usage, it would be useful if Ada (particularly the new parallelism features) could use them.
!proposal
The following proposal is just one possible way to map Ada 202X to OpenMP. This is the one proposed for GNAT, though it has not been approved. The author believes it is a good approach, and matches what the IRTAW and other OpenMP experts such as Luis Miguel Pinho recommend. Be that as it may, it is not appropriate to mandate it as part of the Ada standard, but is is hopefully useful as an example that has been prototyped extensively and has been shown to provide useful speedups. It is not clear what if anything we should put into the Ada RM. A separate technical report might be appropriate, perhaps augmented with some implementation advice somewhere in the RM. In any case, here goes...
For the mapping to OpenMP, we are recommending a layered mapping to OpenMP (or other light-weight-threading (LWT) scheduler), where upon seeing the syntax for a parallel construct, the compiler generates calls on a top layer (dubbed “System.Parallelism” for now). This layer is independent of the particular LWT scheduler that will be controlling the light-weight threads that are spawned as a result of the parallel constructs. Below System.Parallelism will be a package dubbed “System.LWT”, which provides the LWT-scheduler-independent API, and implements it using a “plug-in” architecture. Specific LWT schedulers would be children of this package, for example “System.LWT.OpenMP”, and one of them would be “plugged in” to System.LWT and handle the various calls through the API.
The user will determine which particular LWT scheduler gets linked into the program by “with”ing a package (e.g. Interfaces.OpenMP) and declaring a “control” object of a type declared in that package (e.g. Interfaces.OpenMP.OMP_Parallel), in the task body for the Ada tasks (or the main subprogram for the environment task) where it is desired to have multiple threads of control. Discriminants of the control object can be used to control the level of parallelism desired (e.g. “Control : OMP_Parallel (Num_Threads => 5);”), as well as potentially other options that should apply by default across all parallel constructs in the given Ada task. This approach is modeled on the “#pragma omp parallel” of OpenMP which creates a “parallel region” in which work-sharing or other forms of parallelism can be used. In the absence of any LWT scheduler linked into the program, the System.LWT package would fall back to a purely sequential implementation. The Interfaces.OpenMP package might have other subprograms intended to be called directly by the user, in particular those that are part of the “official” OpenMP API, such as “omp_get_thread_num” or “omp_get_team_size” (for the full API, see the Runtime Library Routines section in https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0. pdf). We also have proposed that record extensions of “Ada.Aspects” (suggested name) could be used to pass in options to parallel constructs, in a kind of generalized “aspect specification.”
A prototype implementation of this structure is available, and can be provided as a zip archive.
Interaction with Ada tasks
We propose that each Ada task defines its own parallelism region, if any, recognizing that many Ada tasks will need no parallelism, and might continue to serve special roles in a real-time environment, such as managing particular hardware devices, or specific jobs of the real-time system, at a particular real-time priority. For each Ada task that does want internal parallelism, the expectation is that the underlying LWT scheduler (e.g. OpenMP) will start up (or re-use) additional (heavy-weight) “kernel threads” to act as servers for the LWTs that will be spawned somewhere within the Ada task. Each of these servers will run at the same priority as the enclosing Ada task, and will share the Ada “identity” and “attributes” of that Ada task. The LWTs served by these server threads will in turn get their Ada task “identity” and “attributes” from these servers.
Each light-weight thread is run in the context of an “LWT group,” which maps quite directly to an OpenMP “taskgroup.” All of the LWTs spawned during the scope of an LWT group are automatically awaited at the end of the LWT group. This ensures that the scope where an LWT is defined isn’t exited while such LWTs are still running and potentially making up-level references to objects from the enclosing scope.
In past discussions in the ARG and elsewhere we had presumed that vendor-specific pragmas or aspects could be used to communicate additional options to the underlying LWT scheduler. However, more recently we have discussed including a kind of “generalized aspect” syntax as part of the syntax for parallel constructs. So far we have discussed this in detail only on Ada-Comment, but at the upcoming ARG meeting we will probably discuss it further. The general idea is to allow for a variant of the aspect-specification syntax after the “parallel” reserved word, and allow it to be based on any record extension of a special Ada.Aspects.Root_Aspect tagged type. A parameter of type “access Root_Aspect’Class” could then be included in the various System.LWT and System.Parallelism APIs, allowing the underlying LWT scheduler to receive the options and use them as it sees fit (see below for an example of how such a parameter might appear). The advantage of this approach is that a user-provided LWT scheduler might be substituted for one provided by the compiler vendor, and it could also take advantage of the generalized aspect syntax without any need to add special handling into the compiler. This also allows us to reach additional options for some future OpenMP standard without further additions to the compiler.
This layered approach, along with the parallelism control object and the generalized aspects defined by types from the Interface.OpenMP package, should allow users to reach the OpenMP features of interest, and to accommodate evolution of the OpenMP standard as well as the ability to use other LWT schedulers which might come from, say, an RTOS vendor. If the user chooses to use no LWT scheduler, a sequential fall back will be part of System.LWT whenever there is no LWT scheduler “plugged in.”
Handling secondary stack, exceptions, and transfers of control out of the code for a light-weight thread
Independent of which particular LWT scheduler is present (if any), the code for a particular light-weight thread is defined by a function pointer and a data object. For Ada, the data object will typically be a tagged data object, and the function will be a simple wrapper that merely invokes a special “LWT_Body” dispatching operation on the object, and handles all exceptions propagated by the body (similar to the way a wrapper around an Ada task body handles all exceptions). Normal secondary stack and exception raising and handling can be performed inside the LWT_Body, because light-weight threads run on a given server until completion or cancelation. They aren’t swapped back and forth, so there is no added complexity in stack or exception management. Effectively, exceptions are being raised and handled on the stack of the server, in the usual way.
When an exception is propagated out of an LWT_Body, or if the code for an LWT_Body has an explicit transfer of control out of the code for the light-weight thread, an attempt is made to cancel the other threads in the LWT group. The first LWT to attempt cancelation receives an indication of “success” from this attempt. Later LWTs of the same group making such an attempt will not receive the “success” indicator. Cancelation in OpenMP, and in Ada 202X, allows cancelation to be implemented using a polling approach, where there are various well-defined “cancelation points.” When code within LWT_Body detects that the enclosing LWT group has been canceled, it generally just returns from the LWT_Body. The LWT that successfully initiated the cancelation records in a variable visible at the point where the LWT group ends, what action should be taken. The code at the end of the LWT group propagates the exception, or continues any other transfer of control, after waiting for all of the LWTs within the group to complete their execution.
Expansions for proposed Ada 202X features
Expansion for Ada 202X parallel block:
The OpenMP recommended approach to supporting a sequence of blocks to be (potentially) run in parallel is to create a loop around a switch/case statement. The “GOMP” implementation indicates the same approach in:
https://gcc.gnu.org/onlinedocs/libgomp/Implementing-SECTIONS-construct.html
We suggest the same approach for Ada. Here is an example expansion:
procedure Walk (Tree : Tree_Ptr) is -- Walk nodes of Tree in parallel begin if Tree = null then return; elsif Tree.Kind = Leaf then Process (Tree); else parallel do Walk (Tree.Left); and Walk (Tree.Right); end do; end if; end Walk;
expands into:
procedure Walk (Tree : Tree_Ptr) is -- Walk nodes of Tree in parallel begin if Tree = null then return; elsif Tree.Kind = Leaf then Process (Tree); else parallel for _I in 1 .. 2 loop case _I is when 1 => Walk (Tree.Left); when 2 => Walk (Tree.Right); end case; end loop; end if; end Walk;
which then expands further as a parallel loop, as described below. This approach makes it easy to turn parallelism on and off, and requires the creation of only one out-of-line procedure independent of the number of arms in the parallel block statement.
Expansions for Ada 202X parallel loop:
In this description, we show an optional intermediate step where GNAT might use a pragma Par_Loop so that parallel loops could be specified while remaining compilable by older Ada compilers, analogous to the way the "Pre" aspect expands into pragma Precondition in GNAT.
Ada 202X defines the notion of a “chunk specification” which can give a user-specified name to the index used to identify a chunk. When using a pragma instead of syntax, there would be no way to specify the chunk-index name, so the value of the chunk index can be referenced when inside the body of a parallel loop by calling the intrinsic parameterless function System.Parallelism.Chunk_Index, which will always return 1 in a version of the System.Parallelism package for use with sequential-only implementations. If there is an explicit "chunk parameter" in the chunk specification, references to the chunk parameter could be replaced by a computation based on the result of a call to this intrinsic Chunk_Index function.
parallel (Num_Chunks) for ... loop <loop body> end loop;
expands into:
pragma Par_Loop(Num_Chunks); for ... loop
<loop body>
end loop;
which expands further according to the kind of for-loop immediately following the pragma Par_Loop:
parallel loop over a range of values: pragma Par_Loop(Num_Chunks); for I in S range A..B loop <loop body> end loop;
expands into:
declare procedure I__Loop_Body (I__Low, I__High : Longest_Integer; I__Chunk_Index : Positive) is begin for I in S'Val (I__Low) .. S'Val (I__High) loop <loop body> end loop; end I__Loop_Body; begin System.Parallelism.Par_Range_Loop (S'Pos(A), S'Pos(B), Num_Chunks, Loop_Body => I__Loop_Body'access); end;
parallel loop over an array: pragma Par_Loop(Num_Chunks); for C of Arr loop <loop body> end loop;
expands into:
pragma Par_Loop(Num_Chunks); for C__Index in Arr'Range loop
declare C renames Arr(C__Index); begin <loop body> end; end loop;
which then expands according to expansion (1) above for a loop over a range. Note that a loop over a multidimensional array would be transformed effectively into a loop over a conceptually flattened array, as is done in the sequential loop case.
parallel loop over a generalized iterator:
pragma Par_Loop(Num_Chunks); for C of Iterator loop
<loop body>
end loop;
expands into:
declare package Inst renames <some instantiation of Ada.Iterator_Interfaces>; package Par_Iterator_Inst is new Inst.Par_Iterator_Loop;
procedure C__Loop_Body (C__Iterator : Inst.Parallel_Iterator'Class; C__Chunk_Index : Positive) is C : Inst.Cursor := C__Iterator.First (C__Chunk_Index); begin while Inst.Has_Element(C) loop <loop body> C := C__Iterator.Next (C, C__Chunk_Index); end loop; end C__Loop_Body;
begin Par_Iterator_Inst (Iterator, Num_Chunks, Loop_Body => C__Loop_Body'access); end;
parallel loop over a container:
pragma Par_Loop(Num_Chunks); for E of Container loop
<loop body>
end loop;
expands into:
pragma Par_Loop(Num_Chunks); for E__Cursor of Container'Default_Iterator(Container) loop
declare E renames Container(E__Cursor); begin <loop body> end; end loop;
which then expands according to (3) above for a loop over an iterator.
System.Parallelism package and Ada.Iterator_Interfaces.Par_Iterator_Loop child The System.Parallelism package spec might contain (at least) the following:
package System.Parallelism is type Longest_Integer is range System.Min_Int .. System.Max_Int; -- Not worrying about unsigned ranges -- with upper bound > System.Max_Int for now. -- Could be handled by having a version of Par_Range_Loop -- that operates on unsigned integers.
procedure Par_Range_Loop (Low, High : Longest_Integer; Num_Chunks : Positive; Aspects : access Ada.Aspects.Root_Aspect’Class := null; Loop_Body : access procedure (Low, High : Longest_Integer; Chunk_Index : Positive));
function Chunk_Index return Positive with Convention => Intrinsic;
end System.Parallelism;
A child unit of Ada.Iterator_Interfaces, Par_Iterator_Loop, could be provided as follows. Note that GNAT might want to instantiate this just once for each instantiation of Ada.Iterator_Interfaces, rather than at each parallel loop over an iterator.
generic procedure Ada.Iterator_Interfaces.Par_Iterator_Loop is (Iterator : Parallel_Iterator'Class; Num_Chunks : Positive; Aspects : access Ada.Aspects.Root_Aspect’Class := null; Loop_Body : access procedure (Iterator : Parallel_Iterator'Class; Chunk_Index : Positive));
Questions and Answers about Ada 202X parallelism and mapping to OpenMP
Here are some questions and answers, derived from initial comments on the document.
Question: What sort of changes to the language reference manual are needed to add support for light-weight parallelism (LWP)?
Answer: Ada 2012 and before presumes a one-to-one mapping between Ada tasks and logical threads of control. in Ada 2X, we allow a single Ada task to have multiple logical threads of control. So places where Ada 2012 and before talked about tasking, we introduce this distinction. See third paragraph of introduction for more on this.
Question: What is the impact in terms of performance for users if LWP is not used?
Answer: Performance impact is minimal on code that makes no use of parallelism. The Ada Current_Task function returns the same "Task_Id" value for all server threads working on behalf of a given Ada task. This is different than what happens now, but should be very close in terms of efficiency. See "Interaction with Ada tasks" section.
Changing the priority of a task needs to be spread to all of the servers working on its behalf, but this can likely be handled without any overhead on tasks making no use of parallelism, as priority changes are not expected to propagate instantly. There are well-defined points, basically at the beginning and end of a parallel construct, where any priority change in the "parent" task can be propagated to the server threads. Tasks with no parallelism would never encounter parallel constructs, and hence would not need to worry about this.
Question: What is the impact in terms of Ada semantics and performance for users if LWP is not implemented by the compiler?
Answer: in the current proposed approach, if no LWT scheduler is "plugged in" to the System.LWT package, then there is a "sequential fallback" that effectively ignores the "parallel" and executes the loop sequentially, using Ada 2012 semantics. A parallel block becomes equivalent to a sequence of statements. See last paragraph of "Interaction with Ada tasks" section.
Question: if an implementer wants to implement the core of Ada 202X (keeping annexes aside) can it just recognize the "parallel" keyword and do nothing more for LWP (except perhaps issuing a warning that parallel is ignored) or are there some semantic checks that still need to be carried out by the front end or in the run-time? Same question for the run-time.
Answer: From a dynamic semantics point of view, an implementor could do as you suggest. From the static semantics point of view, there are legality checks associated with potentially conflicting uses of variables visible across a parallel construct that are controlled by a "Conflict_Check_Policy":
http://www.ada-auth.org/standards/2xaarm/html/AA-9-10-1.html
There are three levels:
No checks (“No_Conflict_Checks”);
Shared-object checks (“Known_Conflict_Checks”);
This is based on the “known to denote the same object” check performed on OUT parameters of functions.
Synchronized-object checks (“All_Conflict_Checks”).
This check is based on disallowing any uplevel references to non-synchronized objects.
Also, there are separate policies for tasking (which defaults to “No checks”) and for parallel constructs (which defaults to “Synchronized-object checks”).
One suggestion is to move the more complex “shared-object” checks to an annex, and perhaps the tasking-related checks as well. The goal is to preserve some basic, default, safety checks in the core, so that Ada programs will by default not become less safe when the word “parallel” is inserted in front of a loop. A sophisticated user could decide to turn off the checks, or choose the subtler “shared-object” checks, but the early user of Ada 202X LWP will not be introducing “silent” bugs by adding “parallel.”
Question: Does OpenMP provide a light-weight-thread scheduler?
Answer: Yes, OpenMP provides a light-weight-thread scheduler, and that is the one we are talking about in this document. OpenMP generally relies on the underlying operating system to schedule the heavier-weight "server threads," sometimes called "kernel threads” that actually execute the lighter-weight OpenMP “tasks.”
Question: do Ada 202x LWP features easily map to OpenMP without requiring a complex executive?
Answer: Yes. OpenMP has a notion of a "parallel" team of servers which corresponds to what we are initiating on behalf of those Ada tasks where parallelism is desired, light-weight "task"s which map to Ada 2X "logical threads of control", and a "taskloop" construct that maps directly to Ada 2X "parallel for I in ...".
From an implementation point of view, the "System_LWT.OpenMP" which provides the actual "plug-in" for OpenMP is about 650 lines of code in the prototype, most of which are either declarations of OpenMP functions that are being imported, calls on such functions, or code generating debugging output. There is almost no logic beyond these call-throughs to the "GOMP" library (GNU's library for OpenMP and OpenACC). The layers above that (System_Parallelism and System_LWT) are also quite simple, again mostly just call-throughs to the next layer down.
Question: What are the implications if a user calls one of the routines in Interfaces.OpenMP that correspond to the OpenMP API?
Answer: Most of these routines are queries (omp_get_...), with no side effects. The few routines that have side effects are generally about setting certain parameters that affect OpenMP scheduling decisions, but should not interfere with Ada-level semantics at all.
-----
Survey of Light-Weight Parallelism Features in Comparable Languages
As part of this report, there was a request to survey the light-weight parallelism (LWP) features of other comparable languages. For this survey, we reviewed the features of Go, Rust, C++, Java, and C#. We have an appendix with more detail, but first is a summary of the LWP features of these languages. Some languages have only (chunked) data-parallel features (parallel loop over an array or other homogeneous data structure), some have only “fork/join” parallelism, which is good for irregular computations involving recursive divide-and-conquer algorithms, often over tree-like structures. Several languages have both loop/stream-oriented and divide-and-conquer-oriented.
Go
Since the beginning, Go has had “goroutines” which are syntactic constructs that represent light-weight threads which can be spawned at any point by simply writing “go <function call>;”. This makes an asynchronous function call. Typically goroutines communicate using “channels” which are essentially FIFO queues, with a user-specified level of buffering. Goroutines can be called from a loop, but they would typically be considered a variant of “fork/join” parallelism.
Go does no safety checking, but using channels for communication is always safe. Goroutines are implicitly awaited at certain points, such as on leaving the function from which they are spawned.
Rust
Rust originally focused on safe multi-threading, where the threads were heavy-weight “kernel” threads.
More recently Rust introduced a light-weight parallelism library called “Rayon” and that seems to be growing in popularity relative to the heavy-weight threading.
Rust’s safety is based on having no global variables, and a somewhat complicated “borrowing” mechanism plus the notion of lifetimes, but the net result is an ownership-based model with no globals, allowing compile-time checks that prevent threads (light- or heavy-weight) from having conflicts due to concurrent unsynchronized access to shared data.
C++
C++ added heavy-weight threading in C++ 11.
C++ added light-weight threading in C++ 17, providing parallel iterators, parallel quantifiers (any-of, all-of), parallel sorting, etc.
A fork/join capability is planned for a future C++ standard
Java
Java has had heavyweight threading from the beginning Java has a general streaming capability, and more recently it has been enhanced to support light-weight parallelism, which gives parallelization across homogeneous data structures
Java has specific parallelism operations on arrays
Java also has a fork/join library that provides light-weight parallelism for “irregular” divide-and-conquer style computations.
C#
Has heavyweight threading provided by System.Threading
Has lightweight threading based on System.Threading.Tasks, which includes both loop-oriented capabilities and ad-hoc light-weight task creation.
More details are available
!wording
** TBD.
!discussion
The OpenMP session at ARG meeting #62 showed that implementing the proposed parallel constructs of Ada on top of OpenMP is feasible. In many of the semantic choices, OpenMP arrived at the same solution that the ARG did (particularly in the case of termination semantics), so the amount of extra mapping needed seems fairly minimal.
The Ada Standard probably should not directly reference the OpenMP standard; it should remain possible to implement Ada in a variety of ways (including on bare machines). Moreover, it's not clear that ISO rules would allow referencing OpenMP at all; generally, we can only reference other ISO standard. Thus, any OpenMP-specific mechanisms have to be implementation-defined.
Ada does need additional mechanisms for tuning (many of the capabilities of OpenMP are aimed at tuning particular parallel code). In the case of declarations and overall configuration, aspects and pragams have the correct semantics (including the ability to define implementation-defined aspects and pragmas). It is possible that some of these aspects and pragmas can be generalized enough to put them into the standard (for instance, the target aspect to specify where a data object is stored could be language-defined if we decide to have a standard parallelism library which would necessarily include types/objects for defining target capabilities. The details of the the supported targets would, of course, remain implementation-defined).
We do need a way to specify tuning mechanisms for individual parallel constructs. Chunk_specifications provide this for one such parameter, but there are many others that are relevant. We could use pragmas for this purpose, but that is not ideal for the same reasons that we have moved aways from entity-specific pragmas for declarations in favor of aspect specifications.
[There's lots more to say here, but your editor is not the person to say it.]
!ASIS
[Not sure. It seems like some new capabilities might be needed, mainly for aspect specifications on parallel code, but I didn't check - Editor.]
!ACATS test
ACATS B- and C-Tests are needed to check that the new capabilities are supported.
!appendix

From: Tucker Taft
Sent: Friday, October 4, 2019  8:38 PM

has shared the following presentation:

Ada 202X Lightweight Parallelism and OpenMP.pptx

https://drive.google.com/file/d/156vq44aK2FF60cbd_I8hIOXmqEGjl1H7/view?usp=sharing_eil&amp;ts=5d97f3fe

Here are some slides to help organize our discussion of Ada 202X and OpenMP 
on Sunday. They are based on slides I presented at Ada-Europe in June, plus 
some new slides talking about a possible "layered" approach to supporting 
light-weight parallelism. Note that we are in the middle of a comprehensive
review of all Ada 202X features, but hopefully our OpenMP discussion can be 
largely independent of the details of that. Looking forward to our discussion
on Sunday!

****************************************************************

From: Edward Fish
Sent: Saturday, October 5, 2019  1:34 PM

I believe these are the ones I found:


Safe Parallelism: Compiler Analysis Techniques for Ada and OpenMP 
-- http://www.cister.isep.ipp.pt/docs/safe_parallelism__compiler_analysis_techniques_for_ada_and_openmp/1378/view.pdf


OpenMP parser for Ada
-- https://www.researchgate.net/profile/Przemyslaw_Stpiczynski/publication/309265124_OpenMP_parser_for_Ada/links/5807538f08ae5ad18817fae5/OpenMP-parser-for-Ada.pdf


OpenMP tasking model for Ada: safety and correctness

-- https://www.auto.tuwien.ac.at/~blieb/AE2017/presentations/AE_17_v3.pdf 


****************************************************************

From: Tucker Taft
Sent: Sunday, January 12, 2020  8:35 PM

Here is an extract from a document developed internally to AdaCore as a 
mapping from Ada 202X to OpenMP.  It has been reviewed by various folks 
involved with OpenMP, and seems to be consistent with what those attending 
the International Real-Time Ada Workshop are recommending as an approach.

We need to think about how we might want to refer to this sort of a mapping 
in the Ada RM, or some other technical specification.  For now, this is 
hopefully useful background reading for such a discussion.

[This is version /01 of the AI - Editor.]

****************************************************************

From: Jeff Cousins
Sent: Tuesday, January 14, 2020  1:10 PM

Thanks for that.
The paragraph beginning

>Handling secondary stack, exceptions, and transfers of control out of
>the code for a light-weight thread

seems incomplete.


(Also missing full stops/periods after "beginning" and "structures" in the 
Java section).

****************************************************************

From: Tucker Taft
Sent: Tuesday, January 14, 2020  2:27 PM

>>Thanks for that.
>>The paragraph beginning
>
>Handling secondary stack, exceptions, and transfers of control out of
>the code for a light-weight thread

That is actually intended to be a section heading, not a paragraph... ;-)

>(Also missing full stops/periods after "beginning" and "structures" in the 
>Java section).

OK, thanks.

****************************************************************


Questions? Ask the ACAA Technical Agent