!standard C.6.5(0) 20-02-06 AI12-0364-1/01 !class Amendment 20-02-06 !status work item 20-02-06 !status received 19-12-18 !priority Low !difficulty Easy !subject Add a modular atomic arithmetic package !summary Add a modular atomic arithmetic package. !problem A number of common algorithms use modular types in tasking/parallel applications. For instance, an index to a ring buffer typically would be a modular type. However, we don't have atomic arithmetic operations for modular types. !proposal (See Summary.) !wording Modify C.6.4(1/5): The language-defined generic package System.Atomic_Operations.Arithmetic provides operations to perform arithmetic atomically on objects of {signed} integer types. C.6.5 The Package System.Atomic_Operations.Modular_Arithmetic The language-defined generic package System.Atomic_Operations.Modular_Arithmetic provides operations to perform arithmetic atomically on objects of modular types. Static Semantics The generic library package System.Atomic_Operations.Modular_Arithmetic has the following declaration: generic type Atomic_Type is mod <> with Atomic; package System.Atomic_Operations.Modular_Arithmetic with Pure, Nonblocking is procedure Atomic_Add (Item : aliased in out Atomic_Type; Value : Atomic_Type) with Convention => Intrinsic; procedure Atomic_Subtract (Item : aliased in out Atomic_Type; Value : Atomic_Type) with Convention => Intrinsic; function Atomic_Fetch_And_Add (Item : aliased in out Atomic_Type; Value : Atomic_Type) return Atomic_Type with Convention => Intrinsic; function Atomic_Fetch_And_Subtract (Item : aliased in out Atomic_Type; Value : Atomic_Type) return Atomic_Type with Convention => Intrinsic; function Is_Lock_Free (Item : aliased Atomic_Type) return Boolean with Convention => Intrinsic; end System.Atomic_Operations.Modular_Arithmetic; The operations of this package are defined as follows: procedure Atomic_Add (Item : aliased in out Atomic_Type; Value : Atomic_Type) with Convention => Intrinsic; Atomically performs: Item := Item + Value; procedure Atomic_Subtract (Item : aliased in out Atomic_Type; Value : Atomic_Type) with Convention => Intrinsic; Atomically performs: Item := Item - Value; function Atomic_Fetch_And_Add (Item : aliased in out Atomic_Type; Value : Atomic_Type) return Atomic_Type with Convention => Intrinsic; Atomically performs: Tmp := Item; Item := Item + Value; return Tmp; function Atomic_Fetch_And_Subtract (Item : aliased in out Atomic_Type; Value : Atomic_Type) return Atomic_Type with Convention => Intrinsic; Atomically performs: Tmp := Item; Item := Item - Value; return Tmp; !discussion This package was originally omitted as we worried about the cost of the intrinsic "mod" that is needed unless the modulus happens to match that of the machine word. It has been pointed out that the general compare-and-swap operation can easily handle this, and current theory is that such operations are no more expensive than a direct fetch-and-add would be. [Editor's note: Tucker had suggested a reading routine for Test-and-Set, but that's unnecessary since the test-and-set object is a normal public type. Just reading the object will get the value. We could change it to be a private type and add a getter routine, but that seems like a lot of work of dubious benefit for what is already a low level package.] !ASIS No new ASIS capabilities needed (this is a package). !ACATS test ACATS C-Tests are needed to check that the new package is supported. !appendix From: Tucker Taft Sent: Wednesday, December 18, 2019 7:09 AM In recent use of System.Atomic_Operations packages, I saw the need for two additional things: 1) Unsigned_Arithmetic, e.g.: generic type Atomic_Type is mod <>; -- TBD: with Atomic; package System_Atomic_Operations.Unsigned_Arithmetic with Pure is procedure Atomic_Add (Item : aliased in out Atomic_Type; Value : Atomic_Type) ... --- There are situations where you want a counter that never overflows, because it is used for a unique id, but it is OK if after a very long period it re-visits the same values. 2) A way to test the state of a test-and-set flag, e.g. function Atomic_Test (Item : aliased Test_And_Set_Flag) return Boolean; -- Returns True if the Item is /= 0 at the time of the call; -- returns False otherwise --- This might seem like an inevitable race condition, but it is a useful way to test the "final" state of such a flag, after waiting for all of the concurrent activities to complete. As an example, we have used a test-and-set flag for a "Cancelation" flag. The logical thread that actually set the flag at a time when it was *not* previously set (i.e. the Test_And_Set returned False) did the work for canceling, but we would also like to know whether this actually took place, once all of the logical threads have completed. **************************************************************** From: Randy Brukardt Sent: Thursday, January 9, 2020 10:15 PM A comment on a older message: > In recent use of System.Atomic_Operations packages, I saw the need for > two additional things: > > 1) Unsigned_Arithmetic, e.g.: > > generic > type Atomic_Type is mod <>; -- TBD: with Atomic; > package System_Atomic_Operations.Unsigned_Arithmetic > with Pure is > > procedure Atomic_Add (Item : aliased in out Atomic_Type; > Value : Atomic_Type) ... My recollection is that we left this out on purpose. The problem here is that a modular add is really: Item := (Item + Value) mod Atomic_Type'Modulus; And that can't be done in hardware atomically unless the Modulus matches that of one of the machine words for the target (so that the mod can be omitted in favor of wrap-around). One could use one of the software techniques for making any operation atomic, but that would be a terrible implementation for the common case when the modulus is 2**16 or some other word-size. As such, a generic seemed like an incorrect mapping -- most possible actuals don't work (or everything is rather inefficient). We could revisit this decision, but in order to do that, we'd need some new information beyond "I saw the need", since we saw the need before and decided that the implementation was problematical enough to leave it out. (We've all "seen the need" for some feature or other over the years that didn't make the cut!) > There are situations where you want a counter that never overflows, > because it is used for a unique id, but it is OK if after a very long > period it re-visits the same values. Surely. We all (I hope) knew that, and *still* left it out. There are plenty of use cases for modular types, and pretty much anything that you could imagine could require an atomic version if you want to use it in parallel. My concern here is just revisiting every freaking decision because someone "sees the need". We'll be going nowhere fast if we do that. > 2) A way to test the state of a test-and-set flag, e.g. > > function Atomic_Test (Item : aliased Test_And_Set_Flag) return Boolean; > -- Returns True if the Item is /= 0 at the time of the call; > -- returns False otherwise > > --- > > This might seem like an inevitable race condition, but it is a useful > way to test the "final" state of such a flag, after waiting for all of > the concurrent activities to complete. As an example, we have used a > test-and-set flag for a "Cancelation" flag. The logical thread that > actually set the flag at a time when it was *not* previously set (i.e. > the Test_And_Set returned False) did the work for canceling, but we > would also like to know whether this actually took place, once all of > the logical threads have completed. Don't have any opinion on this one, although Brad might know if we originally had something like this and dropped it, or if it was never there. **************************************************************** From: Tucker Taft Sent: Thursday, January 9, 2020 10:36 PM > My recollection is that we left this out on purpose. The problem here is > that a modular add is really: > > Item := (Item + Value) mod Atomic_Type'Modulus; > > And that can't be done in hardware atomically unless the Modulus matches > that of one of the machine words for the target (so that the mod can be > omitted in favor of wrap-around). These are straightforward to implement using the already provided Compare_And_Exchange operations. In fact, that is how the other Atomic_Add and Atomic_Subtract are implemented in the version I created. Some hardware might have actual atomic fetch-and-add signed or unsigned, but it is simpler to just rely on the compare-and-exchange, which is much more general. ... > We could revisit this decision, but in order to do that, we'd need some new > information beyond "I saw the need", since we saw the need before and > decided that the implementation was problematical enough to leave it out. > (We've all "seen the need" for some feature or other over the years that > didn't make the cut!) I don't remember this discussion, though that doesn't prove much these days! In any case, I ended up implementing this for my own lock-free work-stealing queue, and did so using Compare-and-Exchange. I don't know whether you consider that new information... ... > Don't have any opinion on this one, although Brad might know if we > originally had something like this and dropped it, or if it was never there. It would definitely have been useful in a couple of use-cases. For my own purposes, I ended up implementing an Atomic_Fetch_And_Max/Min operation, which can be applied to Boolean to get roughly the same thing. **************************************************************** From: Randy Brukardt Sent: Friday, January 10, 2020 12:12 AM > These are straightforward to implement using the already provided > Compare_And_Exchange operations. In fact, that is how the other > Atomic_Add and Atomic_Subtract are implemented in the version I > created. Some hardware might have actual atomic fetch-and-add > signed or unsigned, but it is simpler to just rely on the > compare-and-exchange, which is much more general. I think the issue was that we didn't want to be depending on general mechanisms when most hardware has a direct atomic add/subtract (which presumably is much faster than compare-and-exchange). I suspect that the Add_and_Fetch operations are probably the same as compare-and-exchange. Admittedly, it's been a while since I grubbed around at that hardware level, so the costs of some of these operations may have changed. In any case, the reason we defined Atomic_Add was so that it could be mapped directly to hardware. Emulating it out of compare-and-exchange seems to rather defeat the purpose (anybody can do that without any help from us, assuming we've provided Atomic_Operations.Exchange). If the definition is such that is necessary to implement it, it seems dubious to provide. ... > I don't remember this discussion, though that doesn't prove much > these days! It's in the !discussion of the AI (AI12-0321-1), which I noticed after sending the original message. (I didn't want to add to the noise by sending another message to add a pointer to that.) I don't know if this came from an e-mail discussion or a meeting discussion (I don't have the energy to check). >In any case, I ended up implementing this for my own lock-free work-stealing >queue, and did so using Compare-and-Exchange. I don't know whether you >consider that new information... I suppose that's evidence that it does come up in practice. I personally have never doubted that, I just worry about the implementation. Anyway, I've left the decision on how to proceed here in our leader's hands (i.e. Steve). **************************************************************** From: Tucker Taft Sent: Friday, January 10, 2020 8:20 AM Actually Fetch-and-Add (FAA) and Compare-and-Swap (CAS) have almost identical performance on many machines, as reported in: https://spcl.inf.ethz.ch/Publications/.pdf/atomic-bench.pdf Here is a quote from the cited paper: "... Our analysis provides novel in-sights. It turns out that, contrary to the common view [22], the latency of CAS, FAA, and SWP is in most cases identical and sometimes (L1 on Haswell and L3/memory on AMD) CAS is faster than FAA...." In any case, fetch-and-add probably does *not* do overflow detection "natively," so in fact might be a better fit for unsigned operations. Because the modulus of a formal type is known statically, for compilers that do macro expansion of generics (or inlining, as you have suggested RR might do), the special case of mod 2**32 or mod 2**64 can be recognized and used to select FAA over CAS as the implementation approach for unsigned fetch-and-add, if FAA is significantly faster in a given target environment. None of this is a big deal, but it was interesting to me that the unsigned operations with wrap-around were exactly what was needed for the lock-free double-ended queue used by work-stealing, and so could show that having the unsigned variants of the atomic arithmetic might be justified. **************************************************************** From: Brad Moore Sent: Friday, January 10, 2020 1:52 PM Some history on AI12-0321-1, which originally started out as being part of AI12-0234-1 .... In the original version of the AI, there was support for both unsigned and signed arithmetic. When we split AI12-0234-1 into AI12-0321-1, the offering was simplified to only provide modular arithmetic, not signed arithmetic, I think because it was thought that the unsigned operations would be closer to being atomic operations supported by minimal hardware instructions. The discussion of the AI at that point said.... "It was considered whether signed arithmetic functions should be provided. While it is possible, the routines would be trickier to write because we would want Ada semantics on arithmetic overflow, which doesn't happen with the low-level gcc instrinsic functions. This would mean using the pre-fetch library calls to obtain the value before the operation, and then applying the operation again on a result variable in addition to the atomic variable, so that any overflow could be detected, and an exception raised. THe modular arithmetic functions are simpler, and can be simply mapped to the underlying calls, since arithmetic overflow handling is not needed. We could add signed arithmetic in the future, if the need is great enough." However, someone pointed out before the second version of the new AI that modular types are only simple if the modulus is a multiple of the machine byte size. For languages like C which mostly consider such types, that is fine, but for Ada with support for a richer modulus set, it gets similarly tricky to support atomic operations more generally on all modular types. I don't see it captured in the minutes, but my recollection is that it was felt that if we needed to choose between signed and unsigned support, that there would be a greater need for signed support. So for the second version of AI12-0234-1, the generic formal was changed to be a signed integer type instead of a modular integer type, and the discussion paragraph was reversed to say; "It was considered whether modular arithmetic functions should be provided. While it is possible, the routines would be trickier to write because in the cases where the modulus of the modular types is not a multiple of System.Storage_Unit. For such cases, updating an atomic variable might need to include a write back to the the atomic variable to handle the wrap around. With signed arithmetic, we need to generate Constraint_Error for overflow checks, so checking for overflow is needed. But if extra performance is needed, those checks can be suppressed." That is, the signed atomic versions had the potential to be more efficient than the modular versions, because the overflow checks could be suppressed, but for modular arithmetic, the wrap arounds could not be suppressed. So, I think the reason we only have signed arithmetic support currently is mostly due to a flow sequence of reasoning spread across different AI versions that was somewhat flawed; 1) Support for overflow checks needed for Ada for signed operations is over and above what is provided by gcc for C, and moves farther away from single instruction hardware operations, so simplify the AI to only provide modular support, since overflow checks are not needed. Then later... 2) Ada needs to support modular arithmetic for types where the modulus is not a multiple of the machine byte size, and this cannot be avoided, while checks for overflow can be avoided for signed arithmetic, so change the AI from modular arithmetic to signed arithmetic. I think if these details were known when AI12-0234-1 was split, I would have left support for both signed and unsigned arithmetic atomic operations. And I think it would worthwhile to support both. Offhand, I would think a relatively common case for unsigned atomic types might be iterating through a ring buffer with multiple readers, where the index of the ring buffer is a modular type. **************************************************************** From: Brad Moore Sent: Friday, January 10, 2020 2:53 PM >> Don't have any opinion on this one, although Brad might know if we >> originally had something like this and dropped it, or if it was never there. > > It would definitely have been useful in a couple of use-cases. For my > own purposes, I ended up implementing an Atomic_Fetch_And_Max/Min > operation, which can be applied to Boolean to get roughly the same thing. This was never a part of the AI, because it didn't exist in the gcc builtins from which the AI was based on. See https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html But I also am not understanding the need here. The Test_And_Set_Flag type is an atomic type. If you want to atomically read the value of that flag, does that not work by simply reading the variable? **************************************************************** From: Tucker Taft Sent: Friday, January 10, 2020 3:12 PM > This was never a part of the AI, because it didn't exist in the gcc > builtins from which the AI was based on. > > See https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html > > But I also am not understanding the need here. The Test_And_Set_Flag > type is an atomic type. If you want to atomically read the value of > that flag, does that not work by simply reading the variable? It is a private type. So I am looking for a way to get a Boolean value from it. Nothing earth shattering! In C it is not a private type, but rather a bool or a char, so you can just read it. By the way, you referenced an obsolete version of the GCC atomic builtins, I believe. I think this one is more up-to-date: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html **************************************************************** From: Brad Moore Sent: Friday, January 10, 2020 5:07 PM > It is a private type. So I am looking for a way to get a Boolean value from > it. Nothing earth shattering! In C it is not a private type, but rather a > bool or a char, so you can just read it. In the AI, I believe it is an implementation-defined public type, not a private type, so you should be able to convert to a Boolean type. But, I can see that it would be nice to have provided a convenience function that reads the value and coverts the value to a Boolean result. If we had that, then we probably could make the Test_And_Set_Flag type a private type, which makes more sense to me, would also would make for a better standard interface, since it hides the implementation-defined part. > By the way, you referenced an obsolete version of the GCC atomic > builtins, I believe. I think this one is more up-to-date: > > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html Good to know, thanks! **************************************************************** From: Tucker Taft Sent: Friday, January 10, 2020 9:16 PM > In the AI, I believe it is an implementation-defined public type, not a > private type, so you should be able to convert to a Boolean type. My mistake! I had implemented it as a private type for some reason in my prototype. > But, I can see that it would be nice to have provided a convenience function > that reads the value and coverts the value to a Boolean result. > If we had that, then we probably could make the Test_And_Set_Flag type a > private type, which makes more sense to me, would also would make for a > better standard interface, since it hides the implementation-defined part. That would seem a bit cleaner. **************************************************************** From: Arnaud Charlet Sent: Saturday, January 11, 2020 4:19 AM > This was never a part of the AI, because it didn't exist in the gcc > builtins from which the AI was based on. > > See https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html Note that the above doc is obsolete and contains also obsolete builtins. The up to date document with the preferred builtins on which the GNAT implementation of this AI is based can be found here: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html **************************************************************** From: Arnaud Charlet Sent: Saturday, January 11, 2020 4:32 AM > In the AI, I believe it is an implementation-defined public type, not > a private type, so you should be able to convert to a Boolean type. Right, I am appending below the GNAT implementation for reference. -- package System.Atomic_Operations.Test_And_Set with Pure, Nonblocking is type Test_And_Set_Flag is mod 2 ** 8 with Atomic, Default_Value => 0, Size => 8; function Atomic_Test_And_Set (Item : aliased in out Test_And_Set_Flag) return Boolean with Convention => Intrinsic; procedure Atomic_Clear (Item : aliased in out Test_And_Set_Flag) with Convention => Intrinsic; function Is_Lock_Free (Item : aliased Test_And_Set_Flag) return Boolean with Convention => Intrinsic; private pragma Inline_Always (Atomic_Test_And_Set); pragma Inline_Always (Atomic_Clear); pragma Inline_Always (Is_Lock_Free); end System.Atomic_Operations.Test_And_Set; ****************************************************************