Version 1.3 of ai05s/ai05-0117-1.txt

Unformatted version of ai05s/ai05-0117-1.txt version 1.3
Other versions for file ai05s/ai05-0117-1.txt

!standard C.6(23)          08-10-16 AI05-0117-1/00
!class Amendment 08-10-16
!status work item 08-10-16
!status received 08-07-29
!priority Medium
!difficulty Hard
!subject Memory barriers and Volatile objects
!summary
(See proposal.)
!problem
Some Lock-free and wait-free algorithms rely on a specific order of loads and stores of some variables. However, modern multi-core architectures do not guarantee that between processors unless special instructions are used. Ada should require some behavior which will allow these algorithms to be programmed in a straightforward manner.
More generally, the language ought to consider additional ways to support multi-core architectures.
!proposal
** TBD **
!wording
** TBD **
!discussion
** TBD **
!example
!ACATS test
!appendix

!topic Memory barriers and pragma Atomic / Volatile
!reference Ada 2005 RM RM95-1.1.3(13)
!from Santiago Urueña-Pascual 2008-07-28
!keywords sequential consistency, relaxed consistency, transactional memory
!discussion

Some lock-free and wait-free algorithms rely in a specific order of
loads and stores of some variables:

Data : Integer_64;
pragma Volatile (Data);

Flag : Boolean;
pragma Atomic (Flag);


task 1:

 Data := ...;
 Flag := True;


task 2:

 loop
   exit when Flag;
 end loop;
 Copy := Data;


This code is guaranteed to work in a single processor. However, in a
multiprocessor, virtually all modern CPU architectures allow a relaxed
memory consistency model, a microarchitecture optimization to reorder
internally some loads and stores.[1] For example, in the CPU executing
task 1 the second assignment could be written to L1 cache before the
first assignment. The CPU views internally its own execution sequentially,
but other CPUs of the system could view a different order, and thus
task 2 could read an obsolete value of 'Data'.

Those architectures provide specific machine instructions to disable
that optimization, called memory barriers or memory fences.[2][3] For
example, inserting an adequate memory barrier between the two assignments
of task 1 and before reading 'Data' in task 2 means that the programmer
explicitly needs an specific order, at the cost of reduced performance.
It is worth noting that not all lock-free or wait-free algorithms require
memory barriers (highly architecture dependent).

It seems that the Ada Reference Manual requires that the order of any read
or update of an Atomic or Volatile object is guaranteed. That means that
memory barriers must be inserted by the compiler. Is that intended?

Current programming languages have taken different approaches. For example,
the C keyword 'volatile' just disables some compiler optimizations,[4] so
either the programmer should use the adequate OS synchronization API, like
POSIX mutex, or insert the adequate memory barriers (e.g. when programming
a specific spinlock implementation). It seems that future versions of C++
will provide sequential consistency by default, but a new low-level
mechanism is added for expert programmers mastering the relaxed consistency
model.[5]

Some projects, like LLVM or the Linux kernel, provide a portable API for
memory barriers:

package Machine_Code is
   procedure Memory_Fence;         -- x86: mfence   Alpha: mf   PowerPC: sync
   procedure Read_Memory_Fence;    -- x86: lfence   Alpha: mf   PowerPC: sync
   procedure Write_Memory_Fence;   -- x86: sfence   Alpha: wmb  PowerPC: sync
   procedure IO_Memory_Fence;      -- PowerPC: eieie

   procedure Read_Dependence_Memory_Fence;  -- Alpha only: mf
end;

It seems impossible for the compiler to decide whether a memory barrier is
actually needed, depending on the semantics of the specific algorithm, so
probably a brute force approach is required (inserting memory barriers before
and after all volatile accesses, even if not required) even if this kills
performance in current CPUs. It should be noted that there are some
architectures that require a different type of memory barrier for hardware
devices.

However, a sophisticated compiler could try to generate code for some protected
objects using no locks, just atomic instructions. In this case, the compiler
could probably be able to analyse all the protected subprograms, and decide
whether memory barriers instructions are required (and, if needed, just the
minimum number of them). It is also worth noting that there is a separate
paradigm in modern architectures called 'transactional memory'.[6] A
sophisticated compiler could also generate code for protected objects without
any lock, with mutual exclusion directly guaranteed by this very new hardware.
But, does the Reference Manual allow these advanced protected object
implementations?

== References ==

[1] Shared Memory Consistency Models: A Tutorial
     -- http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf
[2] Memory Ordering in Modern Microprocessors, Part I
     -- http://www.linuxjournal.com/article/8211
[3] Memory Ordering in Modern Microprocessors, Part II
     -- http://www.linuxjournal.com/article/8212
[4] volatile -- http://www.airs.com/blog/archives/154
[5] Memory Consistency Models: Convergence at last!
     -- http://www.cse.iitk.ac.in/users/mtworkshop07/workshop/DayIII/adve.sarita.pdf
[6] Sun slots transactional memory into Rock
     -- http://www.theregister.co.uk/2007/08/21/sun_transactional_memory_rock/


****************************************************************

From: Robert A. Duff
Date: Thursday, July 31, 2008  10:34 AM

...
> This code is guaranteed to work in a single processor.

Atomic and volatile are hard to reason about, but I think the above is
guaranteed to work, no matter how many processors.

Ada doesn't even have a built-in way to query or control the number of
processors.

...
> It seems that the Ada Reference Manual requires that the order of any
> read or update of an Atomic or Volatile object is guaranteed. That
> means that memory barriers must be inserted by the compiler. Is that intended?

I think so.

...
> It seems impossible for the compiler to decide whether a memory
> barrier is actually needed, depending on the semantics of the specific
> algorithm, so probably a brute force approach is required (inserting
> memory barriers before and after all volatile accesses, even if not
> required) even if this kills performance in current CPUs. It should be
> noted that there are some architectures that require a different type
> of memory barrier for hardware devices.

I think any interfacing to hardware devices has to be implementation dependent.

> However, a sophisticated compiler could try to generate code for some
> protected objects using no locks, just atomic instructions. In this
> case, the compiler could probably be able to analyse all the protected
> subprograms, and decide whether memory barriers instructions are
> required (and, if needed, just the minimum number of them). It is also
> worth noting that there is a separate paradigm in modern architectures
> called 'transactional memory'.[6] A sophisticated compiler could also
> generate code for protected objects without any lock, with mutual
> exclusion directly guaranteed by this very new hardware. But, does the
> Reference Manual allow these advanced protected object implementations?

I think so.

> == References ==

Thanks for the interesting refs.

****************************************************************

From: Alan Burns
Date: Thursday, February 18, 2010  7:15 AM

I was charged with looking at AI-117 - Memory barriers and Volatile objects

AI05-0117 asks the question - on a modern multicore architecture can lock-free
and wait-free algorithms be implemented in a straightforward way using the
current provisions of Ada?

My opinion is a guarded, yes. But a weaker form of Volatile would be useful.

Lock-free and wait-free algorithms do not use protected types or other forms of
synchronisation but employ shared variables.

To work correctly these shared variables must not have their load and store
instructions reordered. Reordering, in general, is however used extensively to
improve performance. It is not however necessary for load and store operations
to be performed directly on memory as any intermediate cache will ensure
consistency. Reordering can be the result of compiler or hardware optimisations.

Hardware provides memory barriers (MBs) to prevent reordering.
Different forms of MBs are supported on different hardware. There are readMBs
and writeMBs - the readMB being more efficient if the shared variable is only
read.

Ada proves Volatile and Atomic pragmas:

"For a volatile object all reads and updates of the object as a whole are
performed directly to memory.

Implementation Note: This precludes any use of register temporaries, caches, and
other similar optimizations for that object."

Its is clear that this will prevent reordering.

So, if the compiler controls its own reordering and places MB instructions after
each assess to a volatile object the correct behaviour will ensue.

There is a question of efficiency. Most lock-free and wait-free algorithms make
use of a only a few shared variables and local variables can be used where
necessary to reduce the actual updates to the volatile ones. The only
inefficiency would therefore seem to be that Volatile prohibits the use of
caching, whereas this is not strictly required for these algorithms. A new
pragma, No_Reordering, could be provided that is weaker than Volatile.

The AI noted the possibility of providing an API to the operations on the MBs.
This is not recommended as there is no single set of operations; hardware
solutions differ. Other languages provide various forms of volatile (eg C, C++
and Java).

****************************************************************


Questions? Ask the ACAA Technical Agent