This issue serves as a design document and a discussion on how atomics will/should be implemented.
CUDA Background
CUDA has had atomics for basically forever in the form of a few functions like atomicAdd
, atomicCAS
, etc. See the docs on it here. It also has _system
and _block
variants of them.
This has always been the overwhelmingly popular way of doing atomic things in CUDA, and for a while it was the only way, until compute 7.x. sm_70 introduced the .sem
qualifier on the atom
PTX instruction. This allowed users to specify a specific ordering for atomic operations.
CUDA decided to implement this by replicating std::atomic
as its own thing called cuda::std::atomic
. Atomic provides a generic container for atomic operations on types such as int. It offers atomic operations with user-specified orderings.
Usage of cuda::std::atomic
Despite NVIDIA pushing for users to use atomic, it has not seen wide adoption, presumably because of the following reasons:
cuda::std::atomic
is a mess of templates and inheritance because CUDA wanted to make it compatible with the GPU, the CPU (with every compiler's weird atomic semantics), and user-defined functions. This yields weird errors and confusing dependency graphs.
- Every CUDA example, sample, docs, tutorials, course, etc uses
atomicAdd
and similar. Unless you are deeply knowledgeable about CUDA you would not switch to atomic, if you even knew it existed.
- atomic has had a rocky past in terms of it sometimes working or not working, for example, CUDA 10.2 had many issues with std::atomic
- atomic for some reason does not support float add, i am totally unsure why, the PTX ISA has instructions for it
Importance of great atomics
Atomics are the core of many algorithms, therefore it is imperative for a project of this scale to implement them once and implement them well. Otherwise a poor implementation of them might mean users being stuck with such an implementation forever, as with CUDA's case. Therefore, i believe we should take our time with atomics and implement them once and do it well.
Low level implementation
The low level implementation of such atomics is not very difficult, it can mostly be taken from how cuda::std::atomic
does it at the low level. It implements them in the following way:
If the CUDA Arch is >= 7.0 then it uses specialized PTX instructions with asm:
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_acq_rel_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.acq_rel.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_acquire_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.acquire.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_relaxed_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.relaxed.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_release_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.release.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_volatile_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
With seqcst additionally containing a fence before it:
switch (__memorder) {
case __ATOMIC_SEQ_CST: __cuda_fence_sc_device();
case __ATOMIC_CONSUME:
case __ATOMIC_ACQUIRE: __cuda_fetch_add_acquire_32_device(__ptr, __tmp, __tmp); break;
case __ATOMIC_ACQ_REL: __cuda_fetch_add_acq_rel_32_device(__ptr, __tmp, __tmp); break;
case __ATOMIC_RELEASE: __cuda_fetch_add_release_32_device(__ptr, __tmp, __tmp); break;
case __ATOMIC_RELAXED: __cuda_fetch_add_relaxed_32_device(__ptr, __tmp, __tmp); break;
default: assert(0);
}
This can very easily be replicated by us since we have full support for inline asm.
Otherwise, if the arch is less than 7.0, it "emulates" it with barriers:
switch (__memorder) {
case __ATOMIC_SEQ_CST:
case __ATOMIC_ACQ_REL: __cuda_membar_device();
case __ATOMIC_CONSUME:
case __ATOMIC_ACQUIRE: __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); __cuda_membar_device(); break;
case __ATOMIC_RELEASE: __cuda_membar_device(); __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); break;
case __ATOMIC_RELAXED: __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); break;
default: assert(0);
}
You can find the code for this in CUDA_ROOT\include\cuda\std\detail\libcxx\include\support\atomic\atomic_cuda_generated.h
for CUDA 11.5, and CUDA_ROOT\include\cuda\std\detail\__atomic_generated
for older versions.
That file provides functions as intrinsics that the rest of libcu++ build off of:
template<class _Type, typename cuda::std::enable_if<sizeof(_Type)==4, int>::type = 0>
__device__ _Type __atomic_fetch_add_cuda(volatile _Type *__ptr, _Type __val, int __memorder, __thread_scope_device_tag) {
/* ... */
}
Rust Intrinsic implementation
I propose we follow a similar approach of raw unsafe intrinsics for:
- 32 bit and 64 bit operations, loads, stores, compare_exchange, exchange, and fetch_{add, and, max, min, or, sub, xor}
- block, device, and system operations
sm_70+ intrinsics are implemented in cuda_std::atomic::intrinsics::sm_70
, emulated intrinsics are in cuda_std;:atomic::intrinsics::emulated
.
Wrappers of the sm-specific intrinsics are in cuda_std::atomic::intrinsics
. For example:
pub unsafe fn atomic_fetch_add_f32_device(ptr: *const f32, ordering: Ordering, val: f32) -> f32;
High level types
And finally, we expose high level types in cuda_std::atomic
such as AtomicF32
, AtomicF64
, etc.
Block atomics (BlockAtomicF32
) will need to be unsafe, this is because for device atomics, it is up to the caller of the kernels to ensure buffers and kernels do not contain data races, and systems prevent this. However, block atomics do not, it would be very easy to accidentally cause data races if the accesses are not intra-threadblock.
Atomic types will expose operations that they specifically allow, for example, per the ISA spec:
- Every type has
fetch_and
, fetch_or
, fetch_xor
, compare_and_swap
, and exchange
.
- Signed and unsigned Integers have
fetch_add
, fetch_inc
, fetch_dec
, fetch_min
, and fetch_max
.
- Unsigned integers have
fetch_inc
and fetch_add
that clamp to [0..b]
(unsure if this means 0..MAX or something else).
- Floats have
fetch_add
Compatibility with core atomics
Core exposes atomics with a couple of things:
- Every target has a cfg on whether an atomic width is supported on the target. I have not checked what atomic sizes core thinks nvptx has.
- The codegen then generates atomic instructions as such:
fn atomic_load(
&mut self,
ty: &'ll Type,
ptr: &'ll Value,
_order: AtomicOrdering,
_size: Size,
) -> &'ll Value {
/* ... */
}
In addition to atomic_store, atomic_rmw, atomic_cmpxchg, and a couple more. We currently trap in all of these functions, partly because libnvvm doesn't support atomic instructions for many types, and partly because we want to bikeshed how to implement them nicely.
However, as expected, things are not quite the same on the CPU and the GPU, there are some very important differences:
- CUDA has 32 bit and 64 bit atomics (16 bit too if you count f16), while core expects 8 bit atomics too (that is, unless its cfg-gated).
- Core expects some operations to be available while they are not available, such as
fetch_nand
, we could implement this as a CAS loop but its a bit of an opaque behavior so im not too happy to do that.
- CUDA has (limited) float atomics, which are the most used types of atomics by far, since GPU code often deals with floats. Core does not have atomic floats, so we would need a separate type, causing lots of confusion.
- CUDA as previously mentioned has block and system atomics, which are unique to it.
Because of these limitations, we have a few options for implementing atomics:
- Try our best to support core atomics, emulate anything thats missing with CAS loops, make
AtomicF32
and AtomicF64
different types in cuda_std. Add block and system atomics as their own types in cuda_std::atomic
. This maintains compat with core but splits up atomic types, which is not ideal.
- Don't try to support core intrinsics at all, add everything in
cuda_std::atomic
, add only the methods that cuda natively supports without CAS loops. Don't try to make the atomics work on the CPU. This is easiest, has the nicest API, but doesn't work on the CPU.
Implementation Roadmap
Atomics will likely be implemented incrementally, most of the work is transferring over the raw intrinsics, after that, the hard part is done and we can just focus on the stable public API.
Device float atomics will be first, since it is by far the most used type of intrinsic. After that, the order will probably
follow:
Integer Device Atomics -> Float System Atomics -> Integer System Atomics -> Float Block Atomics -> Integer Block Atomics -> Anything that's missing
Feedback
I'd love to hear any feedback you have! We must make sure this is implemented once and implemented correctly.