Version 1.5 of ai05s/ai05-0117-1.txt

Unformatted version of ai05s/ai05-0117-1.txt version 1.5
Other versions for file ai05s/ai05-0117-1.txt

!standard C.6(16)          10-10-18 AI05-0117-1/02
!class Amendment 08-10-16
!status work item 08-10-16
!status received 08-07-29
!priority Medium
!difficulty Hard
!subject Memory barriers and Volatile objects
!summary
Pragma Volatile has its implementation advice changed to ensure that it supports the required efficient serialisation on multiprocessor platforms.
!problem
Some Lock-free and wait-free algorithms rely on a specific order of loads and stores of some variables. However, modern multi-core architectures do not guarantee this ordering between processors unless special instructions are used. Ada should require some behavior which will allow these algorithms to be programmed in a straightforward manner.
!proposal
Weaken the definition of Volatile so that it provides the required ordering but does not require all read/writes to be performed to actual memory locations.
!wording
Delete C.6 (16)
Delete Implementation Note C.6(16.a).
Add a new Implementation Note to AARM (or Implementation Advice to LRM):
On a multiprocessor, any read or update of a volatile object should involve the use of an appropriate memory barrier to ensure that all tasks on all processors see the same order of updates to volatile variables.
!discussion
After considering a new pragma (Coherent) it was decided to modify the current definition of pragma Volatile to remove the requirement that caches etc cannot be used with volatile variables. This brings the Ada definition of volatile closer to that used in other languages.
Paragraph C.6(16) is removed as 1.1.3 (13) already makes it clear that any read or update of a volatile object has an external effect.
!example
The following will ensure that task 2 does get the value 42.
Data : Integer; pragma Volatile (Data);
Flag : Boolean := False; pragma Volatile (Flag);
in task 1:
Data := 42; Flag := True;
in task 2:
loop exit when Flag; end loop; Copy := Data;
!ACATS test
!appendix

!topic Memory barriers and pragma Atomic / Volatile
!reference Ada 2005 RM RM95-1.1.3(13)
!from Santiago Urueña-Pascual 2008-07-28
!keywords sequential consistency, relaxed consistency, transactional memory
!discussion

Some lock-free and wait-free algorithms rely in a specific order of
loads and stores of some variables:

Data : Integer_64;
pragma Volatile (Data);

Flag : Boolean;
pragma Atomic (Flag);


task 1:

 Data := ...;
 Flag := True;


task 2:

 loop
   exit when Flag;
 end loop;
 Copy := Data;


This code is guaranteed to work in a single processor. However, in a
multiprocessor, virtually all modern CPU architectures allow a relaxed
memory consistency model, a microarchitecture optimization to reorder
internally some loads and stores.[1] For example, in the CPU executing
task 1 the second assignment could be written to L1 cache before the
first assignment. The CPU views internally its own execution sequentially,
but other CPUs of the system could view a different order, and thus
task 2 could read an obsolete value of 'Data'.

Those architectures provide specific machine instructions to disable
that optimization, called memory barriers or memory fences.[2][3] For
example, inserting an adequate memory barrier between the two assignments
of task 1 and before reading 'Data' in task 2 means that the programmer
explicitly needs an specific order, at the cost of reduced performance.
It is worth noting that not all lock-free or wait-free algorithms require
memory barriers (highly architecture dependent).

It seems that the Ada Reference Manual requires that the order of any read
or update of an Atomic or Volatile object is guaranteed. That means that
memory barriers must be inserted by the compiler. Is that intended?

Current programming languages have taken different approaches. For example,
the C keyword 'volatile' just disables some compiler optimizations,[4] so
either the programmer should use the adequate OS synchronization API, like
POSIX mutex, or insert the adequate memory barriers (e.g. when programming
a specific spinlock implementation). It seems that future versions of C++
will provide sequential consistency by default, but a new low-level
mechanism is added for expert programmers mastering the relaxed consistency
model.[5]

Some projects, like LLVM or the Linux kernel, provide a portable API for
memory barriers:

package Machine_Code is
   procedure Memory_Fence;         -- x86: mfence   Alpha: mf   PowerPC: sync
   procedure Read_Memory_Fence;    -- x86: lfence   Alpha: mf   PowerPC: sync
   procedure Write_Memory_Fence;   -- x86: sfence   Alpha: wmb  PowerPC: sync
   procedure IO_Memory_Fence;      -- PowerPC: eieie

   procedure Read_Dependence_Memory_Fence;  -- Alpha only: mf
end;

It seems impossible for the compiler to decide whether a memory barrier is
actually needed, depending on the semantics of the specific algorithm, so
probably a brute force approach is required (inserting memory barriers before
and after all volatile accesses, even if not required) even if this kills
performance in current CPUs. It should be noted that there are some
architectures that require a different type of memory barrier for hardware
devices.

However, a sophisticated compiler could try to generate code for some protected
objects using no locks, just atomic instructions. In this case, the compiler
could probably be able to analyse all the protected subprograms, and decide
whether memory barriers instructions are required (and, if needed, just the
minimum number of them). It is also worth noting that there is a separate
paradigm in modern architectures called 'transactional memory'.[6] A
sophisticated compiler could also generate code for protected objects without
any lock, with mutual exclusion directly guaranteed by this very new hardware.
But, does the Reference Manual allow these advanced protected object
implementations?

== References ==

[1] Shared Memory Consistency Models: A Tutorial
     -- http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf
[2] Memory Ordering in Modern Microprocessors, Part I
     -- http://www.linuxjournal.com/article/8211
[3] Memory Ordering in Modern Microprocessors, Part II
     -- http://www.linuxjournal.com/article/8212
[4] volatile -- http://www.airs.com/blog/archives/154
[5] Memory Consistency Models: Convergence at last!
     -- http://www.cse.iitk.ac.in/users/mtworkshop07/workshop/DayIII/adve.sarita.pdf
[6] Sun slots transactional memory into Rock
     -- http://www.theregister.co.uk/2007/08/21/sun_transactional_memory_rock/


****************************************************************

From: Robert A. Duff
Date: Thursday, July 31, 2008  10:34 AM

...
> This code is guaranteed to work in a single processor.

Atomic and volatile are hard to reason about, but I think the above is
guaranteed to work, no matter how many processors.

Ada doesn't even have a built-in way to query or control the number of
processors.

...
> It seems that the Ada Reference Manual requires that the order of any
> read or update of an Atomic or Volatile object is guaranteed. That
> means that memory barriers must be inserted by the compiler. Is that intended?

I think so.

...
> It seems impossible for the compiler to decide whether a memory
> barrier is actually needed, depending on the semantics of the specific
> algorithm, so probably a brute force approach is required (inserting
> memory barriers before and after all volatile accesses, even if not
> required) even if this kills performance in current CPUs. It should be
> noted that there are some architectures that require a different type
> of memory barrier for hardware devices.

I think any interfacing to hardware devices has to be implementation dependent.

> However, a sophisticated compiler could try to generate code for some
> protected objects using no locks, just atomic instructions. In this
> case, the compiler could probably be able to analyse all the protected
> subprograms, and decide whether memory barriers instructions are
> required (and, if needed, just the minimum number of them). It is also
> worth noting that there is a separate paradigm in modern architectures
> called 'transactional memory'.[6] A sophisticated compiler could also
> generate code for protected objects without any lock, with mutual
> exclusion directly guaranteed by this very new hardware. But, does the
> Reference Manual allow these advanced protected object implementations?

I think so.

> == References ==

Thanks for the interesting refs.

****************************************************************

From: Alan Burns
Date: Thursday, February 18, 2010  7:15 AM

I was charged with looking at AI-117 - Memory barriers and Volatile objects

AI05-0117 asks the question - on a modern multicore architecture can lock-free
and wait-free algorithms be implemented in a straightforward way using the
current provisions of Ada?

My opinion is a guarded, yes. But a weaker form of Volatile would be useful.

Lock-free and wait-free algorithms do not use protected types or other forms of
synchronisation but employ shared variables.

To work correctly these shared variables must not have their load and store
instructions reordered. Reordering, in general, is however used extensively to
improve performance. It is not however necessary for load and store operations
to be performed directly on memory as any intermediate cache will ensure
consistency. Reordering can be the result of compiler or hardware optimisations.

Hardware provides memory barriers (MBs) to prevent reordering.
Different forms of MBs are supported on different hardware. There are readMBs
and writeMBs - the readMB being more efficient if the shared variable is only
read.

Ada proves Volatile and Atomic pragmas:

"For a volatile object all reads and updates of the object as a whole are
performed directly to memory.

Implementation Note: This precludes any use of register temporaries, caches, and
other similar optimizations for that object."

Its is clear that this will prevent reordering.

So, if the compiler controls its own reordering and places MB instructions after
each assess to a volatile object the correct behaviour will ensue.

There is a question of efficiency. Most lock-free and wait-free algorithms make
use of a only a few shared variables and local variables can be used where
necessary to reduce the actual updates to the volatile ones. The only
inefficiency would therefore seem to be that Volatile prohibits the use of
caching, whereas this is not strictly required for these algorithms. A new
pragma, No_Reordering, could be provided that is weaker than Volatile.

The AI noted the possibility of providing an API to the operations on the MBs.
This is not recommended as there is no single set of operations; hardware
solutions differ. Other languages provide various forms of volatile (eg C, C++
and Java).

****************************************************************

From: Alan Burns
Date: Friday, April 23, 2010  2:37 AM

A question about where to place the definition of the pragma.

To recap - the point of this pragma is to ensure that the compiler/processor
does not reorder memory assesses (so that various useful algorithms will work on
multiprocessors).

This seems easy to ensure by defining any read or write on a coherent object to
be an 'external effect' (1.1.3(13)). This with 1.1.3(15) about ordering would
seem to be sufficient.

my reading of this is that if the program needs to ensure that:
write(x);
write(y);
does indeed occur in that order then BOTH x and y would need to be defined as
coherent - agree? or is it sufficient just to define x as coherent?

This pragma could be defined in C.6 with atomic and volatile, but to give it
equal status would require considerable rewriting. But as coherence is only
really needed with multiprocessor systems then I feel it would be easier to
define it in the new Multiprocessor section of the real-time annex - agree?

****************************************************************

From: Jean-Pierre Rosen
Date: Friday, April 23, 2010  2:53 AM

> This seems easy to ensure by defining any read or write on a coherent
> object to be an 'external effect' (1.1.3(13)).

Why isn't volatile sufficient (it is an external effect for volatile)?

****************************************************************

From: Alan Burns
Date: Friday, April 23, 2010  4:25 AM

This was discussed at last meeting - volatile is too strong, yes it gives what
is required, but it also requires all writes to be direct to memory (effectively
bypassing cache). This is not necessary as cache coherence is fine, the key is
that the complier does not reorder. See discussion in AI

****************************************************************

From: Robert Dewar
Date: Friday, April 23, 2010  11:49 AM

> A question about where to place the definition of the pragma.

I definitely think this pragma must be optional, i.e. the compiler is free to
reject it if it cannot be guaranteed, like pragma Atomic. I am pretty sure we
would just reject it all the time. Because we have no way of constraining the
code generator in this respect.

****************************************************************

From: Robert Dewar
Date: Friday, April 23, 2010  11:50 AM

> This was discussed at last meeting - volatile is too strong, yes it
> gives what is required, but it also requires all writes to be direct
> to memory (effectively bypassing cache). This is not necessary as
> cache coherence is fine, the key is that the complier does not
> reorder. See discussion

Ah, OK, so in fact this is easy to implement, we just treat it as Volatile, so I
withdraw my objection.

****************************************************************

From: Randy Brukardt
Date: Friday, April 23, 2010  1:01 PM

Right, that works, but as we're determined, that requires generating memory
barriers in many places for multicore/multiprocessor architectures. Which is a
real performance drag.

I don't know about GNAT specifically, but I would guess that what the majority
of compilers have been calling "Volatile" is really "Coherent". So I think you
are right that Coherent can be implemented as (current) Volatile, but Volatile
will often need more work to be implemented correctly. Of course, that depends
on whether customers care; I don't think there is any practical way for portable
tests (like the ACATS) to verify if this is done properly -- it requires
examining the emitted code for each architecture. So customer demand will be the
only incentive to implement Volatile correctly.

****************************************************************

From: Robert Dewar
Date: Saturday, April 24, 2010  11:47 AM

To me, if the Ada notion of volatile does not match C's notion of volatile then

a) it is junk

b) GNAT will not implement it, regardless of customer demand!

****************************************************************

From: Bob Duff
Date: Friday, April 23, 2010  9:09 PM

I'm not a C language lawyer, but my impression is that nobody really understands
precisely what "volatile" in C means. And I've heard that the committee
deliberately left it vague, so implementations can do more-or-less what they
like.

I also read a paper where they tested various compilers, and found that nobody
implements volatile in C properly (for whatever semantics the authors of that
paper think is "proper").

I had a conversation with somebody who was griping about the lack of clarity in
the C standard w.r.t. volatile, and I pointing him to the Ada definition, and I
think he said the Ada version was much superior.

The C definition has something to do with setjump/longjump (in part).  And C
doesn't have threads.

This is all rumor and hearsay -- sorry.

****************************************************************

From: Robert Dewar
Date: Sunday, April 25, 2010  4:56 AM

In practice GNAT does (and will for ever) give the same semantics for Volatile
that gcc does. People use volatile all the time both in GNAT and GNU C, without
any problems or reports of difficulty for 15 years, so I don't see that there is
a problem here that needs solving, or that a more precise solution will be of
any use.

I am not aware at all of the confusion over volatile of which Bob Duff presents
rumor and hearsay :-)

However, I am not concerned too much, sonds like 10 mins work to add pragma
Coherent and make it mean the same as Volatile.

****************************************************************

From: Bob Duff
Date: Sunday, April 25, 2010  8:14 AM

> I am not aware at all of the confusion over volatile of which Bob Duff
> presents rumor and hearsay :-)

The paper I mentioned is here:

    http://www.cs.utah.edu/~regehr/papers/emsoft08-preprint.pdf

    "Volatiles Are Miscompiled, and What to Do about It"

The rest of what I said remains rumor and hearsay.

****************************************************************

From: Alan Burns
Date: Monday, April 26, 2010  5:21 AM

> However, I am not concerned too much, sonds like 10 mins work to add
> pragma Coherent and make it mean the same as Volatile.

Just to recap - leaving aside 'atomic' there are two properties that 'volatile'
is aimed at:

1 - all rights/read go directly to the memory location - needed (only) when the
memory location is a register for some external device.

2 - control reordering (bound what reordering can take place due to compiler or
processor behaviour).

Not sure what GNAT does, but C seems to be aimed more at the second property.
Ada's 'volatile' is concerned with the first property. So GNAT C may not give
Ada's semantics - it will ensure no reordering but will leave it to memory
manager to decide when, for example, values can just stay in cache.

With multicore there is a real need to get control over reordering. Hence the
introduction, in Ada, of Coherent - this would be easy to implement as it just
requires the insertion of calls to a memory barrier which all such hardware
seems to support.

So I would guess that GNAT 'Volatile' implements Ada's new 'Coherent'.
But this leaves open the question of how to get Ada's Volatile.

I'm told that Java's volatile is defined in terms of Lamport's 'happens before'
operation - hence it is concerned with ordering.

PS the discussions did not however tell me where to define Coherent?

****************************************************************

From: Robert Dewar
Date: Monday, April 26, 2010  12:01 PM

> 1 - all rights/read go directly to the memory location - needed (only)
> when the memory location is a register for some external device.

it is completely bogus to use pragma Volatile for this purpose, since in
practice you will be making the non-portable assumption that a load or store
uses a single appropriate machine instruction. Even the use of Atomic is dubious
for this purpose! Device registers need to be accessed with very specific
instructions. The ONLY legitimate way to ensure this is to write the appropriate
ASM inserts. In practice Atomic may work fine for most cases, but I would NEVER
use Volatile for this purpose.

For example, if you have a 32-bit hardware register, and you set one bit, then
several sequences of instructions are possible on a 386 (use bit instructions,
use load/store, load/store byte rather than word etc). The use of Atomic pretty
much guarantees a 32-bit read followed by a 32-bit write. If that's what you
want it will probably work, but you cannot guarantee this from the RM. If you
use Volatile, all bets are off as to the exact instructions generated.

> 2 - control reordering (bound what reordering can take place due to
> compiler or processor behaviour).

Actually one of the most common uses of Volatile is to make sure that two
independent tasks correctly access a common version without making private
copies, e.g. a circular buffer where the input and output pointers are atomic,
and the actual buffer contents is volatile.

> Not sure what GNAT does, but C seems to be aimed more at the second
> property. Ada's 'volatile' is concerned with the first property. So
> GNAT C may not give Ada's semantics - it will ensure no reordering but
> will leave it to memory manager to decide when, for example, values
> can just stay in cache.

As I said, to me if Volatile in Ada does not do what Volatile in C does, then it
is confusing and useless.

> With multicore there is a real need to get control over reordering.
> Hence the
> introduction, in Ada, of Coherent - this would be easy to implement as
> it just requires the insertion of calls to a memory barrier which all
> such hardware seems to support.

Please do not assume that "instructions available on hardware" = "easy to
implement". This equation is simply false in large modern compiler systems.

> So I would guess that GNAT 'Volatile' implements Ada's new 'Coherent'.
> But this leaves open the question of how to get Ada's Volatile.

If Volatile is supposed to mean something different from Volatile in C, then I
would say, you can't expect to "get" it at all. As I noted before, we never had
a customer query or complaint in this area for the entire history of GNAT.

> PS the discussions did not however tell me where to define Coherent?

Since it's virtually the same issue as Volatile, I would put it in the same
place.

****************************************************************

From: Tucker Taft
Date: Monday, April 26, 2010  12:22 PM

>> 1 - all rights/read go directly to the memory location - needed
>> (only) when the memory location is a register for some external device.
>
> it is completely bogus to use pragma Volatile for this purpose, since
> in practice you will be making the non-portable assumption that a load
> or store uses a single appropriate machine instruction. Even the use
> of Atomic is dubious for this purpose! Device registers need to be
> accessed with very specific instructions. The ONLY legitimate way to
> ensure this is to write the appropriate ASM inserts. In practice
> Atomic may work fine for most cases, but I would NEVER use Volatile
> for this purpose.

I agree.  You generally need to use pragma Atomic if you want to guarantee a
single instruction.  Volatile operations are allowed to result in multiple
instructions.

> For example, if you have a 32-bit hardware register, and you set one
> bit, then several sequences of instructions are possible on a 386 (use
> bit instructions, use load/store, load/store byte rather than word
> etc). The use of Atomic pretty much guarantees a 32-bit read followed
> by a 32-bit write. If that's what you want it will probably work, but
> you cannot guarantee this from the RM. If you use Volatile, all bets
> are off as to the exact instructions generated...

I don't happen to agree that it is always necessary to use special instructions.
A lot of device drivers have been written in the "Unix" era without ever using
assembler. I suppose some of these depended on the "|=" C operator, so Ada comes
up short there...

Of course if the device is not memory mapped, but instead uses some kind of
built-in "port," then you clearly need to resort to a special "intrinsic"
operation.

****************************************************************

From: Bob Duff
Date: Monday, April 26, 2010  12:48 PM

> As I said, to me if Volatile in Ada does not do what Volatile in C
> does, then it is confusing and useless.

I think you mean that if Volatile in GNAT does not do what Volatile in gcc C
does, then it is confusing and useless.  Because last time I looked at this part
of the C standard it was confusing and useless, but that doesn't mean a
particular C compiler can't do something useful.

Anyway, why are we adding a new feature, if in practice it will be implemented
the same as Volatile?

****************************************************************

From: Robert Dewar
Date: Monday, April 26, 2010  1:01 PM

> I think you mean that if Volatile in GNAT does not do what Volatile in
> gcc C does, then it is confusing and useless.  Because last time I
> looked at this part of the C standard it was confusing and useless,
> but that doesn't mean a particular C compiler can't do something
> useful.

Whether C volatile in the standard does anything useful, I have no idea.
But C volatile in practice is very useful, and works fine, and Ada need do no
better (in fact Ada already does better, from having Atomic -- a typical
implementation in C gets around this by treating volatile the same as atomic in
Ada if the type is suitable for that purpose (that of course is implementation
dependent, as it is in Ada).

> Anyway, why are we adding a new feature, if in practice it will be
> implemented the same as Volatile?

indeed.

****************************************************************

From: Tucker Taft
Date: Monday, April 26, 2010  1:13 PM

In practice, volatile works fine for mono-processors.
As we move toward multicore/multi-processors being the norm, volatile will
probably be insufficient.  I can believe GNAT will necessarily wait until
something akin to Coherent is handled by the GCC back end, but that doesn't mean
the distinction isn't important in the timeframe of Ada 2012.

****************************************************************


Questions? Ask the ACAA Technical Agent