Version 1.3 of ai12s/ai12-0119-1.txt

Unformatted version of ai12s/ai12-0119-1.txt version 1.3
Other versions for file ai12s/ai12-0119-1.txt

!standard 5.5.2 (2/3)          14-10-13 AI12-0119-1/01
!class Amendment 14-06-20
!status work item 14-06-20
!status received 14-06-17
!priority Medium
!difficulty Hard
!subject Parallel operations
New syntax and semantics to facilitate parallelism via Parallel Loops and Parallel Blocks
The increased presence of parallel computing platforms brings concerns to the general purpose domain that were previously prevalent only in the specific niche of high-performance computing. As parallel programming technologies become more prevalent in the form of new emerging programming languages and extensions of existing languages, safety concerns need to consider the paradigm shift from sequential to parallel behavior.
Ada needs mechanisms whereby the compiler is given the necessary semantic information to enable the implicit and explicit parallelization of code. After all, the current Ada language only allows parallel execution when it is semantically neutral (see 1.1.4(18) and 11.6(3/3)) or explicitly in the code as a task.
This proposal depends on the facilities for aspect Global (AI12-0079-1) and for aspect Potentially_Blocking (AI12-0064-1). Those proposals allow the compiler to statically determine where parallelism may be introduced without introducing data races.
This proposal informally introduces the semantic notion of a Parallel OPportunity (POP) and a tasklet to the language. POPs are places in program text where work can be spawned to parallel executing workers that work in concert to correctly execute the algorithm. Tasklets are the notational (logical) units within a task that are executed in parallel with each other.
The goals of this proposal are: - To permit existing programs to benefit from parallelism through
minimal restructuring of sequential programs into ones that permit more parallel execution;
- To support the development of complete programs that maximize
- To efficiently guide the compiler in the creation of effective
parallelism (without oversubscription) with minimal input from the programmer;
- To support the parallel reduction of results calculated by parallel
- To avoid syntax that would make a POP erroneous or produce noticeably
different results if executed sequentially vs in parallel.
An important part of this model is that if the compiler is not able to verify that the parallel computations are independent, then a warning will be issued at compile time (See AI12-0079-1 and AI12-0064-1 for ways on how this can happen).
Note that in this model the compiler will identify any code where a potential data race occurs (following the rules for concurrent access to objects as specified in the Language Reference Manual RM 9.10(23), and point out where objects cannot be guaranteed to be independently addressable. If not determinable at compile-time, the compiler may insert run-time checks to detect data overlap.
Another issue is the underlying runtime. In this model, tasklets are orthogonal to tasks. Regardless of implementation, tasklets are considered to execute in the semantic context of the task where they have been spawned, which means that any operation that identifies a task, such as those in Task_Identification, will identify the task in which the tasklet is spawned. On the other hand, calls by different tasklets of the same task into the same protected object are treated as different calls resulting in distinct protected actions; therefore synchronization between tasklets can be performed using non-blocking protected operations. Note that this is consistent with the current standard which already supports multiple concurrent calls by a single task in the presence of the asynchronous transfer of control capability (RM 9.7.4 (23)).
There are two areas of this proposal; one that introduce capabilities for parallel blocks and another that introduces capabilities for parallel loops.
parallel Blocks ---------------
parallel blocks may be used to specify that two or more parts of an algorithm may be executed in parallel with each other.
Semantics: A parallel block statement encloses two or more sequences of statements (two or more "parallel sequences") separated by the reserved word "and". Each parallel sequence represents a separate tasklet, but all within a single Ada task. task identity remains that of the enclosing Ada task, and a single set of task attributes is shared between the tasklets.
With respect to the rules for shared variables (see RM 9.10(13)), two actions occurring within two different parallel sequences of the same parallel block are not automatically sequential, so execution can be erroneous if one such action assigns to an object, and the other reads or updates the same object or a neighboring object that is not independently addressable from the first object. The appropriate use of atomic, protected, or task objects (which as a group we will call synchronized objects) can be used to avoid erroneous execution.
In addition, the new Global and Potentially_Blocking aspects may be specified to enable the static detection of such problems at compile time (see AI12-0079-1 and AI12-0064-1).
Any transfer of control out of one parallel sequence will initiate the aborting of the other parallel sequences not yet completed. Once all other parallel sequences complete normally or abort, the transfer of control takes place. If multiple parallel sequences attempt a transfer of control before completing, one is chosen arbitrarily and the others are aborted.
If an exception is raised by any of the parallel sequences, it is treated similarly to a transfer of control, with the exception being propagated only after all the other sequences complete normally or due to abortion. If multiple parallel sequences raise an exception before completing, one is chosen arbitrarily and the others are aborted. The parallel block completes when all of the parallel sequences complete, either normally or by being aborted.
Note that aborting a tasklet need not be preemptive, but should prevent the initiation of further nested parallel blocks or parallel loops.
parallel Loops --------------
A common approach to parallelizing loops is to break the loop iterations into chunks where each chunk consists of a subset of the entire iteration set, and may be as small as a single iteration, up to as large as the entire number of iterations, or somewhere in-between. In Ada, each chunk becomes a separate tasklet.
This proposal gives the programmer some degree of control over the parallelization of for loops into appropriately sized chunks, but without requiring that they specify the exact chunk size or the number of chunks. In addition, to deal with data dependences, this proposal supports per-thread copies of relevant data, and a mechanism of reducing these multiple copies down to a final result at the end of the computation.
To indicate that a loop is a candidate for parallelization, the reserved word "parallel" may be inserted immediately after the word "in" or "of" in a "for" loop, at the point where the "reverse" reserved word is allowed. Such a loop will be broken into chunks, where each chunk is processed sequentially.
For data that is to be updated within such a parallelized loop, the notion of a parallel array is provided, which corresponds to an array with one element per chunk of a parallel loop.
When a parallel array is used in a parallelized loop, the programmer is not allowed to specify the specific index, but rather uses "<>" to indicate the "current" element of the parallel array, appropriate to the particular chunk being processed.
The user may explicitly control the number of chunks into which a parallelized loop is divided by specifying the bounds of the parallel array(s) used in the loop. All parallel arrays used within a given loop must necessarily have the same bounds.
If parallel arrays with the same bounds are used in two consecutive parallelized loops over the same container or range, then the two loops will be chunked in the same way. Hence, it is possible to pass data across consecutive loops through the elements of a parallel array that is common across the loops.
parallel arrays are similar to normal arrays, except that they are always indexed by Standard.Integer, and they are likely to be allocated more widely spaced than strictly necessary to satisfy the algorithm, to avoid sharing cache lines between adjacent elements. This wide spacing means that two parallel arrays might be interspersed, effectively turning a set of separate parallel arrays with common bounds, into an array of records, with one record per loop chunk, from a storage layout point of view.
Note that the same rules presented for parallel blocks above apply to the update of shared variables and the transfer of control to a point outside of the loop, and for this purpose each iteration (or chunk) is treated as equivalent to a separate sequence of a parallel block.
Automatic reduction of parallel arrays --------------------------------------
It will be common for the values of a parallel array to be combined at the end of processing, using an appropriate reduction operator. Because this is a common operation, this proposal provides a language- defined attribute which will do this reduction, called "Reduced". This attribute can eliminate the need to write a final reduction loop.
The Reduced attribute will automatically reduce the specified parallel array using the operator that was used in the assignment statement that computed its value.
For large parallel arrays, this reduction can itself be performed in parallel, using a tree of computations. The reduction operator to be used can also be specified explicitly when invoking the Reduced attribute, using a Reducer and optionally an Identity parameter.
An explicit Reducer parameter is required when the parallelized loop contains multiple operations on the parallel array. More generally, the parameterized Reduced attribute with an explicit Reducer parameter may be applied to any array, and then the entire parallel reduction operation will be performed.
[note: this section is incomplete. More work is needed for wording]
Append to Introduction (28) "A paralel block statement requests that two or more sequences of
statements should execute in parallel with each other."
"5.6.1 Parallel Block Statements
[A parallel_block_statement encloses two or more sequence_of_statements where all the sequence_of_statements can execute in parallel with each other.]
parallel_block_statement ::= parallel sequence_of_statements and sequence_of_statements {and sequence_of_statements} end parallel;
Static Semantics
Each sequence_of_statements represents a separate thread of control that proceeds independently and concurrently between the points where they interact with other tasks and with each other.
AARM - An implementation should statically treat each sequence_of_statements as a separate thread of control, but whether they actually execute in parallel or sequentially should be a determination that is made dynamically at run time, dependent on factors such as the available computing resources.
Example of a parallel block statement:
parallel Foo(Z); and Bar(X, Y); and Put_Line ("Executing Foo and Bar in parallel with Other_Work"); Other_Work; end parallel;
Bounded (Run-Time) Errors
It is a bounded error to invoke an operation that is potentially blocking within a sequence_of_statements of a parallel block. "
Add C.7.1 (5.1)
The Task_Id value associated with each sequence_of_statements of a parallel_block_statement is the same as that of the enclosing parallel_block_statement.
AARM - Each sequence_of_statements of a parallel block are treated as though they are all executing as the task that encountered the parallel block statement.
Change 9.10 (13)
"Both actions occur as part of the execution of the same task {unless they are each part of a different sequence_of_statements of a parallel block statement}"
Modify H.5 (1/2)
"The following pragma forces an implementation to detect potentially blocking operations within a protected operation {or within a sequence_of_statements of a parallel_block_statement}"
H.5 (5/2)
"An implementation is required to detect a potentially blocking operation within a protected operation {or within a sequence_of_statements of a paralell_block_statment}, and to raise Program_Error (see 9.5.1). "
H.5 (6/2)
"An implementation is allowed to reject a compilation_unit if a potentially blocking operation is present directly within an entry_body or the body of a protected subprogram {or a sequence_of_statements of a parallel_block_statement}. "
There is a continuing trend of exponential growth of computational elements embedded on a single chip. This has led to significant challenges for software designers and implementers. Prior to 2005, the increase in transistor count corresponded to a similar increase in CPU clock speed, which boosted performance of sequential algorithms. Since then, CPU clock speeds have leveled off for reasons of practicality, and chip manufacturers have instead moved towards multicore technologies as a means of achieving performance increases.
As parallel technologies become more prevalent in the form of new emerging hardware platforms, safety concerns need to consider the paradigm shift from sequential to parallel behaviour. It is a paradigm that has been for a long time contained within a specialized domain of high-performance computing, but that is now required in all domains of computing systems.
It is important for the programmer to indicate places in the code where parallelism is desired, which allows the reader to better understand the algorithm, and allows the programmer to receive confirmation from the compiler whether the desired parallelism is safe, or even worthwhile.
This proposed model does not define syntax for the explicit parallelization of individual subprogram calls, since such parallelization can be performed implicitly by the compiler, when it knows that the calls are free of side-effects. This is facilitated by annotations identifying global variable usage on subprogram specifications (see AI12-0079-1).
parallel Blocks ---------------
example: declare X, Y : Integer; Z : Float; begin
parallel X := Foo(100); and Z := Sqrt(3.14) / 2.0; Y := Bar(Z); end parallel;
Put_Line(“X + Y=” & Integer'Image(X + Y)); end;
In this example, the calculation of Z and Y occur sequentially with respect to each other, but in parallel with the calculation of X. Note that the compiler, using the rules specified in AI12-0079-1, may complain if the parallel sequences might have conflicting global side-effects.
The parallel block construct is flexible enough to support recursive usage as well, such as:
function Fibonacci (N : Natural) return Natural is X, Y : Natural; begin if N < 2 then return N; end if;
parallel X := Fibonacci (N – 2); and Y := Fibonacci (N – 1); end parallel;
return X + Y; exception when others => Log ("Unexpected Error"); end Fibonacci;
We considered allowing the parallel block to be preceded with an optional declare part, and followed with optional exception handlers, but it was observed that it was more likely to be useful to have objects that are shared across multiple parallel sequences to outlive the parallel block, and that having exception handlers after the last parallel sequence could easily be misconstrued as applying only to the last sequence. Therefore we reverted to the simpler syntax proposed above. This simpler syntax is also more congruous with the syntax for select statements. Because there are no local declarations, there was also no point in having a statement_identifier (block label) for a parallel block. This is actually not entirely true. Allowing an exit statement to replace a goto that targets the location following the end of the select statement seems useful. To allow such a statement, a block label might be needed to identify the parallel block that is being exited. It was felt that the need for allowing an exit statement in a parallel block could be the subject of a separate AI.
parallel Loops --------------
In most compute-intensive applications, a significant proportion of the computation time is spent in loops, either iterating over arrays/container data structures, or systematically searching a large solution space. To benefit from parallel hardware, the computation associated with a loop should be spread across the available processors.
One approach, presuming the iterations of the loop have no data dependences between them, is to treat each iteration of the loop as a separate tasklet, and then have the processors work away on the set of tasklets in parallel. However, this introduces overhead from the queuing and de-queuing of work items, and the communication of results from each work item. Furthermore, there often are data dependences between iterations, and creating a separate work item for each iteration can introduce excessive synchronization overhead to deal safely with these interdependences. Therefore, it is common to break large arrays, and/or the loops that iterate over them, into chunks (or slices or tiles), where each chunk is processed sequentially, but multiple chunks can be processed in parallel with one another.
Note: Allowing the parallel keyword on while loops was considered, but was discarded since while loops cannot be easily parallelized, because the control variables are inevitably global to the loop.
For example, here is a simple use of a parallelized loop, with a parallel array of partial sums (with one element per chunk), which are then summed together (sequentially) to compute an overall sum for the array:
declare Partial_Sum : array (parallel <>) of Float := (others => 0.0); Sum : Float := 0.0; begin for I in parallel Arr'range loop Partial_Sum(<>) := Partial_Sum(<>) + Arr(I); end loop;
for J in Partial_Sum'range loop Sum := Sum + Partial_Sum(J); end loop;
   Put_Line ("Sum over Arr = " & Float'Image (Sum));  end;
In this example, the programmer has merely specified that the Partial_Sum array is to be a parallel array (with each element initialized to 0.0), but has not specified the actual bounds of the array, using "<>" instead of an explicit range such as "1 .. Num_Chunks". In this case, the compiler will automatically select the appropriate bounds for the array, depending on the number of chunks chosen for the parallelized loops in which the parallel array is used.
In the above case, we see "Partial_Sum(<>)" indicating we are accumulating the sum into a different element of the Partial_Sum in each distinct chunk of the loop.
To illustrate the use of common chunking used between consecutive loops, here is a pair of parallelized loops that produce a new array that is the cumulative sum of the elements of an initial array. The parallel arrays Partial_Sum and Adjust are used to carry data from the first parallelized loop to the second parallelized loop:
 declare    Partial_Sum : array (parallel <>) of Float := (others => 0.0);    Adjust : array(parallel Partial_Sum'Range) of Float
:= (others => 0.0);
   Cumulative_Sum : array (Arr'Range) of Float := (others => 0.0);  begin    --  Produce cumulative sums within chunks    for I in parallel Arr'Range loop      Partial_Sum(<>) := Partial_Sum(<>) + Arr(I);      Cumulative_Sum(I) := Partial_Sum(<>);    end loop;
-- Compute adjustment for each chunk for J in Partial_Sum'First..Partial_Sum'Last-1 loop
Adjust(J+1) := Adjust(J) + Partial_Sum(J);
end loop;
--  Adjust elements of each chunk appropriately for I in parallel Arr'range loop Cumulative_Sum(I):= Cumulative_Sum(I)+Adjust(<>); end loop;
-- Display result Put_Line("Arr, Cumulative_Sum");
for I in Cumulative_Sum'range loop Put_Line(Float'Image(Arr(I)) & ", " & Float'Image(Cumulative_Sum(I))); end loop;
Note that this feature eliminated the need to reference two different elements of the same array (element I and element I – 1) within any of the parallel loop bodies. This reduces expression complexity and eliminates data race issues at chunk boundaries, where the I – 1th element could refer to an element of another chunk.
Note also that chunking is not explicit in parallelized loops, and in the above example, the compiler is free to use as few or as many chunks as it decides is best, though it must use the same number of chunks in the two consecutive parallelized loops because they share parallel arrays with common bounds. The programmer could exercise more control over the chunking by explicitly specifying the bounds of Partial_Sum, rather than allowing it to default. For example, if the programmer wanted these parallelized loops to be broken into "N" chunks, then the declarations could have been:
Partial_Sum : array (parallel 1..N) of Float := (others => 0.0); Adjust : array (parallel Partial_Sum'range) of Float := (others => 0.0); ...
Automatic Reduction of parallel arrays. --------------------------------------- As is illustrated above by the first example, it will be common for the values of a parallel array to be combined at the end of processing, using an appropriate reduction operator. In this case, the Partial_Sum parallel array is reduced by "+" into the single Sum value.
The use of the 'Reduced attribute can can eliminate the need to write the final reduction loop in the first example, and instead we could have written simply:    Put_Line ("Sum over Arr = " & Float'Image (Partial_Sum'Reduced));
The Reduced attribute will automatically reduce the specified parallel array using the operator that was used in the assignment statement that computed its value -- in this case the "+" operator appearing in the statement:      Partial_Sum(<>) := Partial_Sum(<>) + Arr(I);
To illustate specification for the 'Reduced attribute including the optional Identity parameter, consider:
 Put_Line ("Sum over Arr = " &     Float'Image (Partial_Sum'Reduced(Reducer => "+",
Identity => 0.0)));
The parameter names are optional, so this could have been:  Put_Line("Sum over Arr = " &
Float'Image (Partial_Sum'Reduced("+", 0.0)));
Since we also allow the 'Reduced attribute to be applied to an array, the first example could have been completely replaced with simply:
 Put_Line ("Sum over Arr = " & Float'Image (Arr'Reduced("+", 0.0)));
** TBD.
!ACATS test
ACATS B-Test and C-Test are needed for these features.

From: Tucker Taft
Sent: Tuesday, June 17, 2014  10:18 AM

Here is the paper the "gang of 4" is submitting to HILT 2014 with a relatively
detailed proposal for parallel programming in Ada 202x:

If there is time at the upcoming ARG meeting, some discussion and feedback from
ARG members would be appreciated.


From: Randy Brukardt
Sent: Tuesday, June 17, 2014  12:52 PM

Are we going to get formal proposals for some/all of these features? For
instance, we have an open AI on global in/out (AI12-0079-1) which is certainly
more broadly applicable than just parallelism. There is also an AI for
Potentially_Blocking, AI12-0026-1, misnamed "Task_Safe", which is useful for any
program including tasking. In both of these cases, an important part of the
proposals will be how they interact with predefined and language-defined
subprograms. (If they're not specified for predefined "+", nothing useful could
be done, for example.)

As with all of these sorts of things, the devil is in the details. The basic
ideas look fine, but we'll never really know without detailed proposals that can
be tested against.


From: Tucker Taft
Sent: Tuesday, June 17, 2014   1:54 PM

True, but it is not worth making a more detailed proposal if the overall
reaction is "yuck"!  ;-)


From: Randy Brukardt
Sent: Tuesday, June 17, 2014   3:10 PM

Well, it looks great to me, but that's mainly because all of the details are
missing. A lot of stuff looks great without the details (it's how modeling
works, after all). The "yuck" only comes when the details are fleshed out...


From: Bob Duff
Sent: Tuesday, June 17, 2014   5:52 PM

Sounds like there are some good ideas in the paper, but I am very strongly
opposed to ARG getting "detailed proposals" about how to do parallelism, or even
wasting time discussing the matter.  ARG should be moving more to the
"standardize existing practise" way of doing.  And that goes double for features
whose ONLY purpose is efficiency -- such as parallelism.

We (ARG) don't need detailed proposals in that area; we need an experimental
implementation, and MEASUREMENTS of the efficiency of realistic programs.  (And
don't try to show me speedups of the recursive Fibonacci -- that's downright
dishonest!)  Until then, ARG should avoid wasting any time on the matter.  Once
we have proven practise, then ARG should take charge of standardizing it.

My comments above don't apply so much to things like Globals, which are more
about correctness/safety than efficiency.  ARG could reasonably discuss Globals,
although even there, an experimental implementation would be helpful.

Regarding those: You can put Globals and Potentially_Blocking on a package.
That's good.  I think it should apply to child and nested packages of that
package. I think you also want a way to give a default for Overlaps_Storage: "no
overlaps unless specified, in this package and children".


From: Robert Dewar
Sent: Wednesday, June 18, 2014  2:52 AM

> Sounds like there are some good ideas in the paper, but I am very
> strongly opposed to ARG getting "detailed proposals" about how to do
> parallelism, or even wasting time discussing the matter.  ARG should
> be moving more to the "standardize existing practise" way of doing.
> And that goes double for features whose ONLY purpose is efficiency --
> such as parallelism.

I am in full agreement with this.

At the last RTW, which I attended, I was struck by the fact that the RTW was
debating removing some features in Annex D which no one has ever implemented,
yet alone found useful in actual practice.

The RM is not a research proposal :-)


From: Alan Burns
Sent: Wednesday, June 18, 2014  5:59 AM

The policy currently in Ada has been implemented by the Ada run-time from
Santandar, and is supported in EDF operating systems (not Ada), though I agree
these are mainly research and experimental platforms, not mainstream industrial


From: Robert Dewar
Sent: Wednesday, June 18, 2014  10:39 AM

But I am talking about the specific Ada features, which for sure have not been
used by anyone in an Ada program :-)


From: Tucker Taft
Sent: Wednesday, June 18, 2014  8:38 AM

> The RM is not a research proposal :-)

I agree that many language standards come after the fact, and are used to try to
tame a wildly varying set of ad hoc extensions.  That has never been the story
with Ada.  There are relatively few extensions (other than as packages, pragmas,
or attributes) that Ada vendors or users try in any serious way, before the Ada
Rapporteur Group begins to talk seriously about them.  The ideal is that the ARG
gets the ball rolling, indicates a direction we are serious about, and then
vendors try prototyping the ideas, and then we come around to picking among the
"serious" ideas based on prototyping results.  I really don't see anyone beside
the ARG being taken seriously at this point as a fountain for "ideas worth
prototyping" (apologies to TED ;-).


From: Stephen Michell
Sent: Sunday, June 22, 2014  10:24 PM

Attached is a report that ARG requested from Brad, Tucker  and myself in
investigating how OpenMP could be used to implement Ada parallelism. For
discussion at the ARG meeting this week.


				Report to WG 9/ARG 
 		  on the use of OpenMP for Parallelism in Ada

Stephen Michell
with contributions from 
Luis Miguel Pinho
Brad Moore
Tucker Taft


At the ARG meeting November 2014, ideas were presented about ways to
implement parallelism in Ada, and ways to interoperate with other
language systems while using the multicore capabilities of modern
computers. We were requested to investigate OpenMP from a technical
perspective as well as a business perspective as a potential solution
to implementing multicore parallelism in Ada.

OpenMP Characteristics

OpenMP is a set of libraries and language extensions that permit a
user to access the capabilities of multicore systems from within
certain languages - namely Fortran, C and C++.

OpenMP's parallelism is "explicit parallelism" - i.e. all aspects of
the parallelism are specified by compiler directives in the code that
invoke library routines at the place of the directive through OpenMP API's.

OpenMP lets the user specify code segments (usually "for loops",
blocks and functions) that can be executed by threads (possibly
operating on multiple cores) that are co-ordinated to implement an
algorithm. The language extensions are converted into library calls to
interact with a runtime to implement the algorithm.

In trying to implement OpenMP on an Ada system, one would note that
the concurrency model is quite different than Ada's. The
"threads"  that are created interact with other threads that use OpenMP 
primitives. This means that they will not interact with Ada tasks 
using Ada tasking primitives. 

For the interoperability with Fortran, C and C++ systems that are
using OpenMP, an Ada OpenMP capability would be useful, whetehr or not
it interacted with Ada tasks. This brings us to the business model.

Business Model

To implement An OpenMP for Ada, one must participate in the OpenMP
community. One could develop compiler directives, either as aspects or
pragmas and convert calls to meet the API, but to have it
"standardized", one must be a full participant in the OpenMP group.

One can participate in OpenMP as a university, as an individual or as
a corporate participant. AdaCore was approached to see if they were
interested in joining OpenMP (and paying fees), but declined. There is
no point joining as individuals, as individuals cannot propose new
language specifications for standardization. 

This leaves participation through a university. Barcelona
Supercomputing Group is a participant, and Miguel Pinho has approached them about
participating through Barcelona. Barcelona is receptive, but
participation requires that Miguel find a PhD student willing to
participate. To date there has been no final resolution.

There is a GNU OpenMP binding (GOMP) that supports the GNU Fortran, C
and C++ GNU compilers. Ada could define a thin binding to that
interface. This would still leave the challenge of defining Ada syntax
for the Open MP compiler directives and determining how such a library
would interact with the Ada runtime (Tasking is not the only issue -
elaboration issues are also a potential issue).


There are business reasons and technical reasons for creating an
OpenMP Ada implementation. If an interface with OpenMP through
Barcelona Supercomputing Group becomes possible, further exploration should
happen. Otherwise, no further action in this area is anticipated.


Questions? Ask the ACAA Technical Agent