alpaka-group / alpaka Goto Github PK

View Code? Open in Web Editor NEW

340.0 340.0 69.0 1.16 GB

Abstraction Library for Parallel Kernel Acceleration :llama:

Home Page: https://alpaka.readthedocs.io

License: Mozilla Public License 2.0

C++ 88.08% CMake 5.93% Shell 3.19% Python 2.43% Ruby 0.36%

cpp cpp17 cuda gpu header-only heterogeneous-parallel-programming hip hpc openacc openmp rocm tbb

alpaka's People

Contributors

Stargazers

Watchers

alpaka's Issues

allow functor calls with (grid, block) instead of (workSize) for convenience

The call should create a workSize from (grid, block) and call the base implementation.

allow selection of default accelerator

Do not require to select an accelerator to run on. Use a default.
This default should be settable by a function like "setDefaultAccelerator()"

Singleton to store the default in?

memory management policies

add policies for:

async, sync
1D, 2D, 3D
Peer
pitched memory

streams

Add streams to:

kernel execution call:

  auto exec = buildKernelExecutor<Acc>();
  exec(workSize, streamHandle : StreamHandle<Acc> = 0)

eventEnqueue
Device:

  // cudaStreamCreate, cudaStreamDestroy
  Stream<AccCuda> Device<AccCuda>::getStream(size_t index) 

  // cudaStreamQuery
  bool streamTest(stream);

  // cudaStreamWaitEvent
  void streamWaitEvent(stream, event) {
    // Just wait for it directly. 
    eventWait(event)
  }

  // cudaStreamSynchronize == "eventEnqueue(ev);eventWait(ev);" ?

add math methods for accelerators

sqrtf, sinf, cosf, ...

events

Add events

Event ev; // cudaEventCreate && cudaEventCreateWithFlags && cudaEventDestroy
exec(ev, stream | device); // cudaEventRecord

eventWait(ev); // stream: cudaEventSynchronize
eventTest(ev); // cudaEventQuery

cudaEventElapsedTime ?

The synchronous stream could be used for debugging purposes. It can be implemented by adding cudaStreamSynchronize after each call.
export CUDA_LAUNCH_BLOCKING=1 does the same but can not be set per stream.

add extra parenthesis around workSize and stream

auto exec = createKernelExecutor<Acc, Kernel>();
exec(workSize, stream)(params);

add random number generator support

allow the user to extend EnabledAccelerators

add base class specialization to random number generators

make the rand implementations euqal to to atomic, idx, ...

add namespace end markers

namespace alpaka
{
} // alpaka

conceptify the device managers

-> they should not provide members

rename WorkSize

Current: WorkSize(getSize)
Proposals: Extent (getExtent)

support copies of buffers of different dimensions (if pitch allows)

add doxygen documentation to gh-pages

Github allows to host webpages by using a gh-pages branch.
The generated doxygen documentation could be made available there.

See: https://github.com/smartell/APIDemo/tree/master
or: http://rickfoosusa.blogspot.de/2011/10/howto-use-doxygen-with-github.html

This could be enhanced even more by automatically regenerating and pushing the documentation by Travis CI.

add allocators

Add support for allocation inside the kernel and multiple allocators for kernel external allocation.

DevCuda:

AllocatorCuda (cudaMalloc, malloc when inside kernel)
ScatterAlloc

DevCpu:

AllocatorMalloc
AllocatorNew
AllocatorAligned (Boost.Align)
AllocatorPageAligned

support for dynamic parallelism

CUDA allows to invoke kernels recursively.

extend device properties

SharedMemorySizeBytes
MaxClockFrequencyHz

Maybe also CoresPerMultiProcessor. This should be the number of cores with fast shared memory. On a two Processor node using OpenMP this should be the number of cores on one of the processors ( x const).
Do we really need this? BlockKernelsCountMax delivers the maximum block size which is more relevant.

add cuda kernel predicate voting operations

__all/__all_sync(predicate)
Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for all of them. Supported by devices of compute capability 1.2 and higher.
__any/__any_sync(predicate)
Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for any of them. Supported by devices of compute capability 1.2 and higher.
__ballot/__ballot_sync(predicate)
Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active. Supported by devices of compute capability 2.0 and higher.

For each of these warp vote operations, the result excludes threads that are inactive (e.g., due to warp divergence). Inactive threads are represented by 0 bits in the value returned by __ballot() and are not considered in the reductions performed by __all() and __any().

Align shared memory

The shared memory for the Serial, Fibers, Threads an OpenMP accelerators should be aligned at the specified types alignement.
Currently we allocate uint8_t`s via new and cast the memory to the required type.
With the alignment being wrong the performance could be influenced negatively (depending on the system).

Thrust support or custom library ?

On the way porting HASEonGPU to alpaka I need to use thrust vectors and some methods like reduce and exclusive scan. I had a discussion about a coexistence of alpaka and thrust last week with @psychocoderHPC and gave it a try.

But, the more I think about it, it turns out for me that thrust is not the right way. There is no possibility to address more than one accelerator, because you need to specify the backend CUDA, OpenMP or TBB during compilation as compile option. A better approach would be the explicit set of the accelerator and device in the code e.g.: vector<Acc, Type, Size>(dev, size, init)

Its not a big problem to remove the thrust vectors and replace them again by plain memory buffers, initialize them and copy them to the device. But by providing vectors similar to the stl ones, thrust simplified and reduced the code base a lot.

Therefore, a library that provides stl like container / data structures and algorithms with explicit accelerator and device setting would by very handy in my eyes.

make the accelerator a template depending on the block/grid dimensions

BasicWorkDiv should have a TDim template parameter.
Calling alpaka::exec::create(...) should then return an ExecXXX<TDim,...> which creates an AccXXX<TDim,...> for each invocation.
The getIdx and getWorkDiv provided by the acccelerator should then by default return a Vec<TDim> instead of a Vec3<>.

execute fibers row by row

By executing the fibers randomly we prevent memory prefetching.
Iterating X 1st, Y 2nd, Z 3rd (native C memory order) we would assist the prefetcher by using the expected default access pattern.
This could be extended to support arbitrary thread (and block) execution schemes.

Add a method that copies memory only if it is in a different memory

This allows some methods that have const memory to prevent double buffering on the host, a method copyIfDifferentMem should be implemented.

make kernel execution thread safe

Is it allowed for the host code to be multithreaded itself?
Restricting it is neither useful nor realistically enforceable.
Maybe one thread calculates something using a CUDA device while the other thread uses OpenMP?
This use case requires locking and the removal of some hidden state.

None of the KernelTasks are thread safe. If multiple threads share the same KernelTask and issue its execution (possibly with different arguments on different devices of the same accelerator type), they interfere and override their values.
-> Each operator() of a KernelTask has to be secured by a lock.
If the host code is multithreaded and issues two kernels to the same CUDA device in parallel, the argument setup functions and kernel launch functions can overlap.
-> A process global mutex lock is required.
OpenMP usage nested in other threading primitives (std::thread, pthreads) is undefined behaviour:
http://stackoverflow.com/questions/13197510/why-do-c11-threads-become-unjoinable-when-using-nested-openmp-pragmas
http://stackoverflow.com/questions/13837696/can-i-safely-use-openmp-with-c11
-> Therefore a process global mutex for the execution of OpenMP kernels is required

add special kernel methods

clock_t clock();
long long int clock64();
T __ldg(const T* address);
The read-only data cache load function is only supported by devices of compute capability 3.5 and higher.
__shfl_sync, __shfl_up_sync, __shfl_down_sync, __shfl_xor_sync
Exchange a variable between threads within a warp.

add DeviceView concept

A view of a device could select a subset of CPU cores (and memory?) that are used for execution.
The current Device concept always models the DeviceView concept by always using all the cores.
With this it would be possible to partition a 8 core CPU into e.g. 6 cores for one DeviceView and 2 cores for another DeviceView. Kernels executed in the corresponding streams then will only use the number of cores they are allowed to. This would allow to prevent independent streams from mutually influencing each other. Furthermore this would allow to guarantee selected compute resources for time critical tasks.

Make the integral type a template parameter of the accelerator

To remove the Uint typedef the integral type used in computations should be a template parameter for the Accelerator, the Executor, etc...

abstract the memory access inside kernels

There should not be direct access to memory buffers.
This always implies knowledge about the memory layout (row or col) which is not necessarily correct on the underlying accelerator.

use hwloc to query the topology of the host

restrict kernels to fulfill std::is_trivially_copyable for all accelerators

Currently only the CUDA-Accelerator requires the user kernel to fulfill std::is_trivially_copyable.
Should all accelerators require this to prevent surprises when changing the accelerator?

add syncBlockThreadsPredicate template

to support predicates:

int __syncthreads_count(int predicate)
evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.

int __syncthreads_and(int predicate)
evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.

int __syncthreads_or(int predicate)
evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for any of them.

add buffer type

By using a dedicated buffer type instead of raw pointers, memory operations get much easier.
By making the Buffer type dependent on a memory space (Buffer< MemSpace >) it would be much easier to reduce unnecessay double buffering and corresponding copies.

enable usage of lambdas as kernels

CUDA 7.5 allows to pass lambdas into the global kernel with the flag --expt-extended-lambda. See here.

__device__ int result;
template <typename T>
__global__ void kernel(T in) { result = in(4); }

void foo(void) {
    int x = 30;
    kernel<<<1,1>>>(
        [=] __device__ (int in) {
            return in * x; // x captured by value
        })
}

We have to make sure that ALPAKA_FN_ACC can be used in the place of __device__ especially in non-cuda code.

Add free function + trait for memory pinning

template<typename TMemSpace>
struct MemPin;

template<typename TBuf>
void pin(TBuf & memBuf)
{
    MemPin<GetMemSpaceT<TBuf>>::memPin(memBuf);
}

add an accelerator back-end that uses OpenMP 2 for block and thread parallelization

This requires OpenMP 2.5 and omp_set_nested(1).
See e.g.: here

add surface and texture access

improve automatic work division for CUDA devices

Use cudaOccupancyMaxPotentialBlockSize or cudaOccupancyMaxPotentialBlockSizeVariableSMem for CUDA automatic work division.

make accelerators extensible

Make accelerators policy based.
They inherit from all policy types given (boost::mpl::forward_sequence).
A default version of the accelerators is typedef'd but the user can replace implementations.

rename accelerator to backend

Accelerator can be confused with a device because sometimes both are used synonymously.
Backend is more clear in that it is a type of a parallel execution environment.

add Travis CI for integration tests

Test if a pull request or a commit broke the build of or the results of the test cases.

vectorization & warps

I opened this issue just to keep track of a discussion we had two weeks ago.

imho, one of the main screws we should deploy to allow fine-tuning and full utilization of hardware is to provide the possibility to actually steer vectorization (and to be warp-aware in CUDA-speak, which is actually the same as vectorization on CPUs).

Implementations on CPUs could achieve that by adding inner, constant-size loops and pragmas. (Example: loop 1 to n with n non-constant will not likely be vectorized, a loop 1 to n%8 with inner constant loops 0 to 7 might be more likely be vectorized.) Also, users could use that feature to avoid bank conflicts in "one thread per block" access schemes and it might be very useful on GPUs for synchronization and ptx/asm-specific calls.

I think we also defined the way to include this in the alpaka policy scheme, did you document that @BenjaminW3 ?

call cudaDeviceReset at application exit

for (all devices used during operation)
{
        cudaSetDevice(i);

        // cudaDeviceReset causes the driver to clean up all state. 
        // While not mandatory in normal operation, it is good practice.  
        // It is also needed to ensure correct operation when the application is being profiled. 
        // Calling cudaDeviceReset causes all profile data to be flushed before the application exits.
        cudaDeviceReset();
}

add threadFence<>

__threadfence_block()
Waits until all GM and SM accesses made by the calling thread prior to __threadfence_block() are visible to all threads in the thread block.
__threadfence()
Waits until all GM and SM accesses made by the calling thread prior to the call are visible to:
• threads in the thread block for shared memory accesses and
• threads in the device for global memory accesses.
Prevents the compiler from optimizing by caching shared memory writes in registers.
It does not synchronize the threads and it is not necessary for all threads to actually reach this instruction.
gcc: __sync_synchronize();
intel: _mm_mfence();
MSVC: MemoryBarrier();
OpenMP: #pragma omp flush
__threadfence_system()
waits until all GM and SM accesses made by the calling thread prior to __threadfence_system() are visible to:
• threads in the thread block for shared memory accesses,
• all threads in the device for global memory accesses,
• host threads for page-locked host memory accesses.

add thread/fiber-pools

... to speedup exectuion dramatically and reduce the number of memory allocations required.

Add support for memory mapping

Memory pointers on the host allocated with cudaHostAlloc(&pHost, size, cudaHostAllocMapped) or registered with cudaHostRegister(pHost, size, cudaHostRegisterMapped) can be used with cudaHostGetDevicePointer to get a device pointer.

NOTE: cudaSetDeviceFlags() must have been called with the cudaDeviceMapHost flag in order for the cudaHostAllocMapped flag to have any effect.

add extended wait interface

Add a variadic alpaka::wait::wait(waited1, ...) method to let the current thread wait for multiple objects at once.

It would also be useful to have a method alpaka::wait::wait(waiter, waited1, ...) to let a waiter wait for multiple objects at once.
This would require the two methods to have different names because then they can not be distinguished by the argument count anymore. What would be useful names?

Both methods could possibly be implemented by just sequentially calling the underlying standard wait methods because the events are one shot events only and do not trigger waits in the executing threads and therefore no deadlocks.

devices

Add accelerator device selection support

template < typename TAcc >
DeviceHandle;

template < typename TDeviceHandle >
class IDeviceHandle
{
// Interfaces...
};

DeviceHandleCuda
{
int m_iDevice;
}

template<>
DeviceHandle< AccCuda > :
IDeviceHandle
{};

template < typename TAcc >
DeviceManager;

template < typename TDevices >
class IDeviceManager
{
// Interfaces...
};

class DeviceManagerCuda
{
setDevice(DeviceHandle< AccCuda > const & deviceHandle); // cudaSetDevice
DeviceHandle< AccCuda > getDevice(); // cudaGetDevice
size_t getDeviceCount() //cudaGetDeviceCount
};

template <>
DeviceManager :
IDeviceManager
{};

Add device property support
One big struct with all relevant values:

struct DeviceProperties
{
size_t BlockSizeMax;
size_t GridSizeMax;
size_t ExecutionUnitCount;
size_t GlobalMemorySizeBytes;
size_t ClockFrequencyHz;
}

parallelize fiber block execution

Enhance the fibers implementation by parallelizing the execution of the blocks.

alpaka-group / alpaka Goto Github PK

alpaka's People

Contributors

Stargazers

Watchers

Forkers

alpaka's Issues

Recommend Projects

Recommend Topics

Recommend Org