AI22-0096-1

!standard 5.5(2/3)                                   24-04-22  AI22-0096-1/02

!standard 5.5.2(5/4)

!standard 5.5.2(7/3)

!class ramification 24-02-13

!status Corrigendum 1-2022 24-05-02

!status ARG Approved 14-0-0  24-05-02

!status work item 24-02-13

!status received 24-02-13

!priority Medium

!difficulty Medium

!subject Ada and OpenMP

!summary

An AARM Implementation Note is created to reference this technical report on Ada 22 and OpenMP.

This technical report outlines a "plug-in" approach to supporting lightweight threading.  The notion of user-defined "generalized" aspects has been proposed (in another AI) as a way to provide tuning parameters to a plug-in, without the compiler having to be updated every time a new tuning parameter is supported by the underlying plug-in.

!issue

Systems like OpenMP are commonly used to provide the framework for lightweight threading, GPU usage, and similar capabilities. (OpenMP directly supports C and Fortran.) Since these frameworks are already in industrial usage, it would be useful if Ada (particularly the new parallelism features) could use them.

!recommendation

We suggest adding an AARM implementation note to refer to this technical report.

The proposal outlined in this technical report is just one possible way to map Ada 2022 parallelism features to OpenMP.  The author believes it is a good approach, and matches what the IRTAW and other OpenMP experts such as Luis Miguel Pinho recommend. Be that as it may, it is not appropriate to mandate it as part of the Ada standard, but it is hopefully useful as an example that has been prototyped extensively and has been shown to provide useful speedups.

For the mapping to OpenMP, we are recommending a layered mapping to OpenMP (or other light-weight-threading (LWT) scheduler), where upon seeing the syntax for a parallel construct, the compiler generates calls on a top layer (dubbed "System.Parallelism" herein). This layer is independent of the particular LWT scheduler that will be controlling the light-weight threads that are spawned as a result of the parallel constructs. Below System.Parallelism will be a package dubbed "System.LWT", which provides the LWT-scheduler-independent API, and implements it using a "plug-in" architecture. Specific LWT schedulers would be children of this package, for example "System.LWT.OpenMP", and one of them would be "plugged in" to System.LWT and handle the various calls through the API.

The user will determine which particular LWT scheduler gets linked into the program by "with"ing a package (e.g. Interfaces.OpenMP) and declaring a "control" object of a type declared in that package (e.g. Interfaces.OpenMP.OMP_Parallel), in the task body for the Ada tasks (or the main subprogram for the environment task) where it is desired to have multiple threads of control. Discriminants of the control object can be used to control the level of parallelism desired (e.g. "Control : OMP_Parallel (Num_Threads => 5);"), as well as potentially other options that should apply by default across all parallel constructs in the given Ada task. This approach is modeled on the "#pragma omp parallel" of OpenMP which creates a "parallel region" in which work-sharing or other forms of parallelism can be used. In the absence of any LWT scheduler linked into the program, the System.LWT package would fall back to a purely sequential implementation. The Interfaces.OpenMP package might have other subprograms intended to be called directly by the user, in particular those that are part of the "official" OpenMP API, such as "omp_get_thread_num" or "omp_get_team_size" (for the full API, see the Runtime Library Routines section in https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0. pdf). We also have proposed that record extensions of "Ada.Aspects" (suggested name) could be used to pass in options to parallel constructs, in a kind of generalized "aspect specification." ([Reference a GitHub issue here]See AI12-0355-1.)

A prototype implementation of this structure is available on GitHub, at:

https://github.com/parasail-lang/parasail/tree/main/lwt

Interaction with Ada tasks

We propose that each Ada task defines its own parallelism region, if any, recognizing that many Ada tasks will need no parallelism, and might continue to serve special roles in a real-time environment, such as managing particular hardware devices, or specific jobs of the real-time system, at a particular real-time priority. For each Ada task that does want internal parallelism, the expectation is that the underlying LWT scheduler (e.g. OpenMP) will start up (or re-use) additional (heavy-weight) "kernel threads" to act as servers for the LWTs that will be spawned somewhere within the Ada task. Each of these servers will run at the same priority as the enclosing Ada task, and will share the Ada "identity" and "attributes" of that Ada task. The LWTs served by these server threads will in turn get their Ada task "identity" and "attributes" from these servers.

Each light-weight thread is run in the context of an "LWT group," which maps quite directly to an OpenMP "taskgroup." All of the LWTs spawned during the scope of an LWT group are automatically awaited at the end of the LWT group. This ensures that the scope where an LWT is defined isn't exited while such LWTs are still running and potentially making up-level references to objects from the enclosing scope.

In past discussions in the ARG and elsewhere we had presumed that vendor-specific pragmas or aspects could be used to communicate additional options to the underlying LWT scheduler. However, more recently we have discussed including a kind of "generalized aspect" syntax as part of the syntax for parallel constructs. The general idea is to allow for a variant of the aspect-specification syntax after the "parallel" reserved word, and allow it to be based on any record extension of a special Ada.Aspects.Root_Aspect tagged type. A parameter of type "access Root_Aspect'Class" could then be included in the various System.LWT and System.Parallelism APIs, allowing the underlying LWT scheduler to receive the options and use them as it sees fit (see below for an example of how such a parameter might appear). The advantage of this approach is that a user-provided LWT scheduler might be substituted for one provided by the compiler vendor, and it could also take advantage of the generalized aspect syntax without any need to add special handling into the compiler. This also allows us to reach additional options for some future OpenMP standard without further additions to the compiler.

This layered approach, along with the parallelism control object and the generalized aspects defined by types from the Interface.OpenMP package, should allow users to reach the OpenMP features of interest, and to accommodate evolution of the OpenMP standard as well as the ability to use other LWT schedulers which might come from, say, an RTOS vendor. If the user chooses to use no LWT scheduler, a sequential fall back will be part of System.LWT whenever there is no LWT scheduler "plugged in".

Handling secondary stack, exceptions, and transfers of control out of the code for a light-weight thread

Independent of which particular LWT scheduler is present (if any), the code for a particular light-weight thread is defined by a function pointer and a data object. For Ada, the data object will typically be a tagged data object, and the function will be a simple wrapper that merely invokes a special "LWT_Body" dispatching operation on the object, and handles all exceptions propagated by the body (similar to the way a wrapper around an Ada task body handles all exceptions). Normal secondary stack and exception raising and handling can be performed inside the LWT_Body, because light-weight threads run on a given server until completion or cancellation. They aren't swapped back and forth, so there is no added complexity in stack or exception management. Effectively, exceptions are being raised and handled on the stack of the server, in the usual way.

When an exception is propagated out of an LWT_Body, or if the code for an LWT_Body has an explicit transfer of control out of the code for the light-weight thread, an attempt is made to cancel the other threads in the LWT group. The first LWT to attempt cancellation receives an indication of "success" from this attempt. Later LWTs of the same group making such an attempt will not receive the "success" indicator. Cancellation in OpenMP, and in Ada 2022, allows cancellation to be implemented using a polling approach, where there are various well-defined "cancellation points". When code within LWT_Body detects that the enclosing LWT group has been canceled, it generally just returns from the LWT_Body. The LWT that successfully initiated the cancellation records in a variable visible at the point where the LWT group ends, what action should be taken.  The code at the end of the LWT group propagates the exception, or continues any other transfer of control, after waiting for all of the LWTs within the group to complete their execution.

Expansions for Ada 2022 features

Expansion for Ada 2022 parallel block:

The OpenMP recommended approach to supporting a sequence of blocks to be (potentially) run in parallel is to create a loop around a switch/case statement. The "GOMP" implementation indicates the same approach in:

   https://gcc.gnu.org/onlinedocs/libgomp/Implementing-SECTIONS-construct.html

We suggest the same approach for Ada. Here is an example expansion:

procedure Walk (Tree : Tree_Ptr) is
    --  Walk nodes of Tree in parallel
begin
    if Tree = null then
         return;
    elsif Tree.Kind = Leaf then
         Process (Tree);
    else
        parallel
        do
            Walk (Tree.Left);
        and
           Walk (Tree.Right);
       end do;
    end if;
end Walk;

 

expands into:

procedure Walk (Tree : Tree_Ptr) is
    --  Walk nodes of Tree in parallel
begin
    if Tree = null then
         return;
    elsif Tree.Kind = Leaf then
         Process (Tree);
    else
        parallel
        for _I in 1 .. 2 loop
            case _I is
            when 1 =>
                Walk (Tree.Left);
            when 2 =>
               Walk (Tree.Right);
            end case;
       end loop;
    end if;
end Walk;

 

which then expands further as a parallel loop, as described below. This approach makes it easy to turn parallelism on and off, and requires the creation of only one out-of-line procedure independent of the number of arms in the parallel block statement.

Expansions for Ada 2022 parallel loop:

In this description, we show an optional intermediate step where the compiler might use a pragma Par_Loop so that parallel loops could be specified while remaining compilable by older Ada compilers, analogous to the way the "Pre" aspect expands into pragma Precondition in GNAT.

Ada 2022 defines the notion of a "chunk specification" which can give a user-specified name to the index used to identify a chunk. When using a pragma instead of syntax, there would be no way to specify the chunk-index name, so the value of the chunk index can be referenced when inside the body of a parallel loop by calling the intrinsic parameterless function System.Parallelism.Chunk_Index, which will always return 1 in a version of the System.Parallelism package for use with sequential-only implementations. If there is an explicit "chunk parameter" in the chunk specification, references to the chunk parameter could be replaced by a computation based on the result of a call to this intrinsic Chunk_Index function.

parallel (Num_Chunks)
for ... loop
   <loop body>
end loop;

 

expands into:

pragma Par_Loop(Num_Chunks);
for ... loop
   <loop body>
end loop;

 

which expands further according to the kind of for-loop immediately following the pragma Par_Loop:

Parallel loop over a range of values:

pragma Par_Loop(Num_Chunks);
for I in S range A..B loop
      <loop body>
end loop;

 

expands into:

declare
    procedure I__Loop_Body
      (I__Low, I__High : Longest_Integer; I__Chunk_Index : Positive) is
    begin
        for I in S'Val (I__Low) .. S'Val (I__High) loop
            <loop body>
        end loop;
    end I__Loop_Body;
begin
    System.Parallelism.Par_Range_Loop
       (S'Pos(A), S'Pos(B), Num_Chunks,
        Loop_Body => I__Loop_Body'Access);
end;

 

Parallel loop over an array:

pragma Par_Loop(Num_Chunks);
for C of Arr loop
   <loop body>
end loop;

 

expands into:

pragma Par_Loop(Num_Chunks);
for C__Index in Arr'Range loop
   declare
      C renames Arr(C__Index);
   begin
      <loop body>
   end;
end loop;

 

which then expands according to expansion (1) above for a loop over a range. Note that a loop over a multidimensional array would be transformed effectively into a loop over a conceptually flattened array, as is done in the sequential loop case.

Parallel loop over a generalized iterator:

pragma Par_Loop(Num_Chunks);
for C of Iterator loop
   <loop body>
end loop;

 

expands into:

declare
   package Inst renames <some instantiation of Ada.Iterator_Interfaces>;
   package Par_Iterator_Inst is new Inst.Par_Iterator_Loop;

   procedure C__Loop_Body
     (C__Iterator : Inst.Parallel_Iterator'Class;

      C__Chunk_Index : Positive) is
       C : Inst.Cursor := C__Iterator.First (C__Chunk_Index);
   begin
       while Inst.Has_Element(C) loop
           <loop body>
           C := C__Iterator.Next (C, C__Chunk_Index);
       end loop;
    end C__Loop_Body;

begin
    Par_Iterator_Inst
       (Iterator, Num_Chunks,  Loop_Body => C__Loop_Body'Access);
end;

 

Parallel loop over a container:

pragma Par_Loop(Num_Chunks);
for E of Container loop
    <loop body>
end loop;

 

expands into:

pragma Par_Loop(Num_Chunks);
for E__Cursor of Container'Default_Iterator(Container) loop
   declare
      E renames Container(E__Cursor);
    begin
       <loop body>
    end;
end loop;

 

which then expands according to (3) above for a loop over an iterator.

System.Parallelism package and Ada.Iterator_Interfaces.Par_Iterator_Loop child The System.Parallelism package spec might contain (at least) the following:

package System.Parallelism is
   type Longest_Integer is range System.Min_Int .. System.Max_Int;
      --  Not worrying about unsigned ranges
      --  with upper bound > System.Max_Int for now.
      --  Could be handled by having a version of Par_Range_Loop
      --  that operates on unsigned integers.

   procedure Par_Range_Loop
     (Low, High : Longest_Integer;
      Num_Chunks : Positive;
      Aspects : access Ada.Aspects.Root_Aspect'Class := null;
      Loop_Body : access procedure
        (Low, High : Longest_Integer; Chunk_Index : Positive));

   function Chunk_Index return Positive
      with Convention => Intrinsic;

end System.Parallelism;

 

A child unit of Ada.Iterator_Interfaces, Par_Iterator_Loop, could be provided as follows.  Note that the compiler might want to instantiate this just once for each instantiation of Ada.Iterator_Interfaces, rather than at each parallel loop over an iterator.

generic
procedure Ada.Iterator_Interfaces.Par_Iterator_Loop is
    (Iterator : Parallel_Iterator'Class;
     Num_Chunks : Positive;
     Aspects : access Ada.Aspects.Root_Aspect'Class := null;
     Loop_Body : access procedure
        (Iterator : Parallel_Iterator'Class; Chunk_Index : Positive));

 

Questions and Answers about Ada 2022 parallelism and mapping to OpenMP

Here are some questions and answers, derived from initial comments on the document.

Question: What sort of changes to the language reference manual were needed to add support for light-weight parallelism (LWP)?

Answer: Ada 2012 and before presumes a one-to-one mapping between Ada tasks and logical threads of control. In Ada 2022, we allow a single Ada task to have multiple logical threads of control. So places where Ada 2012 and before talked about tasking, we introduce this distinction. See third paragraph of introduction for more on this.

Question: What is the impact in terms of performance for users if LWP is not used?

Answer: Performance impact is minimal on code that makes no use of parallelism. The Ada Current_Task function returns the same "Task_Id" value for all server threads working on behalf of a given Ada task. This is different from what happens now, but should be very close in terms of efficiency. See "Interaction with Ada tasks" section.

Changing the priority of a task needs to be spread to all of the servers working on its behalf, but this can likely be handled without any overhead on tasks making no use of parallelism, as priority changes are not expected to propagate instantly. There are well-defined points, basically at the beginning and end of a parallel construct, where any priority change in the "parent" task can be propagated to the server threads. Tasks with no parallelism would never encounter parallel constructs, and hence would not need to worry about this.

Question: What is the impact in terms of Ada semantics and performance for users if LWP is not implemented by the compiler?

Answer: In the current proposed approach, if no LWT scheduler is "plugged in" to the System.LWT package, then there is a "sequential fallback" that effectively ignores the "parallel" and executes the loop sequentially, using Ada 2012 semantics. A parallel block becomes equivalent to a sequence of statements. See last paragraph of "Interaction with Ada tasks" section.

Question: If an implementer wants to implement the core of Ada 2022 (keeping annexes aside) can it just recognize the "parallel" keyword and do nothing more for LWP (except perhaps issuing a warning that parallel is ignored) or are there some semantic checks that still need to be carried out by the front end or in the run-time? Same question for the run-time.

Answer: From a dynamic semantics point of view, an implementor could do as you suggest. From the static semantics point of view, there are legality checks associated with potentially conflicting uses of variables visible across a parallel construct that are controlled by a "Conflict_Check_Policy":

http://www.ada-auth.org/standards/2022aarm/html/AA-9-10-1.html

There are three levels:

  1. No checks ("No_Conflict_Checks");
  2. Shared-object checks ("Known_Conflict_Checks"); This is based on the "known to denote the same object" check performed on OUT parameters of functions.
  3. Synchronized-object checks ("All_Conflict_Checks"). This check is based on disallowing any uplevel references to non-synchronized objects.

Also, there are separate policies for tasking (which defaults to "No checks") and for parallel constructs (which defaults to "Synchronized-object checks").

We considered moving the more complex "shared-object" checks to an annex, and perhaps the tasking-related checks as well. The overall goal has been to preserve some basic, default, safety checks in the core, so that Ada programs will by default not become less safe when the word "parallel" is inserted in front of a loop. A sophisticated user could decide to turn off the checks, or choose the subtler "shared-object" checks, but the early user of Ada 2022 LWP will not be introducing "silent" bugs by adding "parallel."

Question: Does OpenMP provide a light-weight-thread scheduler?

Answer: Yes, OpenMP provides a light-weight-thread scheduler, and that is the one we are talking about in this document. OpenMP generally relies on the underlying operating system to schedule the heavier-weight "server threads," sometimes called "kernel threads" that actually execute the lighter-weight OpenMP "tasks."

Question: Do Ada 2022 LWP features easily map to OpenMP without requiring a complex executive?

Answer: Yes. OpenMP has a notion of a "parallel" team of servers which corresponds to what we are initiating on behalf of those Ada tasks where parallelism is desired, light-weight "task"s which map to Ada 2022 "logical threads of control", and a "taskloop" construct that maps directly to Ada 2022 "parallel for I in ...".

From an implementation point of view, the "System_LWT.OpenMP" which provides the actual "plug-in" for OpenMP is about 650 lines of code in the prototype, most of which are either declarations of OpenMP functions that are being imported, calls on such functions, or code generating debugging output. There is almost no logic beyond these call-throughs to the "GOMP" library (GNU's library for OpenMP and OpenACC). The layers above that (System_Parallelism and System_LWT) are also quite simple, again mostly just call-throughs to the next layer down.

Question: What are the implications if a user calls one of the routines in Interfaces.OpenMP that correspond to the OpenMP API?

Answer: Most of these routines are queries (omp_get_...), with no side effects. The few routines that have side effects are generally about setting certain parameters that affect OpenMP scheduling decisions, but should not interfere with Ada-level semantics at all.


 

Survey of Light-Weight Parallelism Features in Comparable Languages

As part of this report, we chose to include a survey of the light-weight parallelism (LWP) features of other comparable languages. For this survey, we reviewed the features of Go, Rust, C++, Java, and C#. We prepared a separate document with more detail, while this document provides a summary of the LWP features of these languages. Some languages have only (chunked) data-parallel features (parallel loop over an array or other homogeneous data structure), some have only "fork/join" parallelism, which is good for irregular computations involving recursive divide-and-conquer algorithms, often over tree-like structures.  Several languages have both loop/stream-oriented and divide-and-conquer-oriented.

Go

Since the beginning, Go has had "goroutines" which are syntactic constructs that represent light-weight threads which can be spawned at any point by simply writing "go <function call>;". This makes an asynchronous function call. Typically goroutines communicate using "channels" which are essentially FIFO queues, with a user-specified level of buffering. Goroutines can be called from a loop, but they would typically be considered a variant of "fork/join" parallelism.

Go does no safety checking for data races, but using channels for communication is always safe. Goroutines are implicitly awaited at certain points, such as on leaving the function from which they are spawned.

Rust

Rust originally focused on safe multi-threading, where the threads were heavy-weight "kernel" threads.

More recently Rust introduced a light-weight parallelism library called "Rayon" and that seems to be growing in popularity relative to the heavy-weight threading.

Rust’s safety is based on having no global variables, and a somewhat complicated "borrowing" mechanism plus the notion of lifetimes, but the net result is an ownership-based model with restricted global variables, allowing compile-time checks that prevent threads (light- or heavy-weight) from having conflicts due to concurrent unsynchronized access to shared data.

C++

C++ added heavy-weight threading in C++ 11.

C++ added light-weight threading in C++ 17, providing parallel iterators, parallel quantifiers (any-of, all-of), parallel sorting, etc.

A fork/join capability is planned for a future C++ standard.

Java

Java has had heavyweight threading from the beginning. Java has a general streaming capability, and more recently it has been enhanced to support light-weight parallelism, which gives parallelization across homogeneous data structures

Java has specific parallelism operations on arrays.

Java also has a fork/join library that provides light-weight parallelism for "irregular" divide-and-conquer style computations.

C#

C# has heavyweight threading provided by System.Threading

C# has lightweight threading based on System.Threading.Tasks, which includes both loop-oriented capabilities and ad-hoc light-weight task creation.

!wording

Add after 9(1/5):

AARM Implementation Note: One approach to supporting the scheduling of tasks and their logical threads of control, including within the infrastructure provided by OpenMP, is provided in the technical report represented by AI22-0096-1 Tech Report on Ada and OpenMP.

!discussion

The OpenMP session at ARG meeting #62 showed that implementing the proposed parallel constructs of Ada on top of OpenMP is feasible. In many of the semantic choices, OpenMP arrived at the same solution that the ARG did (particularly in the case of termination semantics), so the amount of extra mapping needed seems fairly minimal.

The Ada Standard probably should not directly reference the OpenMP standard; it should remain possible to implement Ada in a variety of ways (including on bare machines). Moreover, it's not clear that ISO rules would allow referencing OpenMP at all; generally, we can only reference other ISO standards. Thus, any OpenMP-specific mechanisms have to be implementation-defined.

Ada does need additional mechanisms for tuning (many of the capabilities of OpenMP are aimed at tuning particular parallel code). In the case of declarations and overall configuration, aspects and pragmas have the correct semantics (including the ability to define implementation-defined aspects and pragmas). It is possible that some of these aspects and pragmas can be generalized enough to put them into the standard (for instance, the target aspect to specify where a data object is stored could be language-defined if we decide to have a standard parallelism library which would necessarily include types/objects for defining target capabilities. The details of the supported targets would, of course, remain implementation-defined).

We do need a way to specify tuning mechanisms for individual parallel constructs. Chunk_specifications provide this for one such parameter, but there are many others that are relevant. This explains why we have added aspect_specifications to all parallel constructs, but have not at this point standardized any language-defined aspects to be used for this purpose.

!ACATS test

ACATS B- and C-Tests are needed to check that the new capabilities are

supported.

!appendix

[This technical report is based on AI12-0346-1.]

From: Tucker Taft

Sent: Friday, October 4, 2019  8:38 PM

has shared the following presentation:

Ada 202X Lightweight Parallelism and OpenMP.pptx

https://drive.google.com/file/d/156vq44aK2FF60cbd_I8hIOXmqEGjl1H7/view?usp=sharing_eil&amp;ts=5d97f3fe

Here are some slides to help organize our discussion of Ada 202X and OpenMP

on Sunday. They are based on slides I presented at Ada-Europe in June, plus

some new slides talking about a possible "layered" approach to supporting

light-weight parallelism. Note that we are in the middle of a comprehensive

review of all Ada 202X features, but hopefully our OpenMP discussion can be

largely independent of the details of that. Looking forward to our discussion

on Sunday!

****************************************************************

From: Edward Fish

Sent: Saturday, October 5, 2019  1:34 PM

I believe these are the ones I found:

Safe Parallelism: Compiler Analysis Techniques for Ada and OpenMP

-- http://www.cister.isep.ipp.pt/docs/safe_parallelism__compiler_analysis_techniques_for_ada_and_openmp/1378/view.pdf

OpenMP parser for Ada

-- https://www.researchgate.net/profile/Przemyslaw_Stpiczynski/publication/309265124_OpenMP_parser_for_Ada/links/5807538f08ae5ad18817fae5/OpenMP-parser-for-Ada.pdf

OpenMP tasking model for Ada: safety and correctness

-- https://www.auto.tuwien.ac.at/~blieb/AE2017/presentations/AE_17_v3.pdf

****************************************************************

From: Tucker Taft

Sent: Sunday, January 12, 2020  8:35 PM

Here is an extract from a document developed internally to AdaCore as a

mapping from Ada 202X to OpenMP.  It has been reviewed by various folks

involved with OpenMP, and seems to be consistent with what those attending

the International Real-Time Ada Workshop are recommending as an approach.

We need to think about how we might want to refer to this sort of a mapping

in the Ada RM, or some other technical specification.  For now, this is

hopefully useful background reading for such a discussion.

[This is version /01 of the AI - Editor.]

****************************************************************

From: Jeff Cousins

Sent: Tuesday, January 14, 2020  1:10 PM

Thanks for that.

The paragraph beginning

>Handling secondary stack, exceptions, and transfers of control out of

>the code for a light-weight thread

seems incomplete.

(Also missing full stops/periods after "beginning" and "structures" in the

Java section).

****************************************************************

From: Tucker Taft

Sent: Tuesday, January 14, 2020  2:27 PM

>>Thanks for that.

>>The paragraph beginning

>

>Handling secondary stack, exceptions, and transfers of control out of

>the code for a light-weight thread

That is actually intended to be a section heading, not a paragraph... ;-)

>(Also missing full stops/periods after "beginning" and "structures" in the

>Java section).

OK, thanks.

****************************************************************