Giter VIP home page Giter VIP logo

alpaka's People

Contributors

auroraperego avatar ax3l avatar benjamin-worpitz-goto avatar benjaminw3 avatar bernhardmgruber avatar bertwesarg avatar chaever avatar chillenzer avatar erikzenker avatar felicepantaleo avatar frobnitzem avatar fwyzard avatar ichinii avatar j-stephan avatar jkelling avatar jkrude avatar kloppstock avatar mehmetyusufoglu avatar mxmlnkn avatar parsifal-2045 avatar philsquared avatar psychocoderhpc avatar q-p avatar sbastrakov avatar simeonehrig avatar sliwowitz avatar stewmh avatar theziz avatar tonydp03 avatar tu-maurice avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alpaka's Issues

allow selection of default accelerator

Do not require to select an accelerator to run on. Use a default.
This default should be settable by a function like "setDefaultAccelerator()"

Singleton to store the default in?

streams

Add streams to:

  • kernel execution call:
  auto exec = buildKernelExecutor<Acc>();
  exec(workSize, streamHandle : StreamHandle<Acc> = 0)
  • eventEnqueue
  • Device:
  // cudaStreamCreate, cudaStreamDestroy
  Stream<AccCuda> Device<AccCuda>::getStream(size_t index) 

  // cudaStreamQuery
  bool streamTest(stream);

  // cudaStreamWaitEvent
  void streamWaitEvent(stream, event) {
    // Just wait for it directly. 
    eventWait(event)
  }

  // cudaStreamSynchronize == "eventEnqueue(ev);eventWait(ev);" ?

events

Add events

Event ev; // cudaEventCreate && cudaEventCreateWithFlags && cudaEventDestroy
exec(ev, stream | device); // cudaEventRecord

eventWait(ev); // stream: cudaEventSynchronize
eventTest(ev); // cudaEventQuery

cudaEventElapsedTime ?

add StreamCudaRtSync

The synchronous stream could be used for debugging purposes. It can be implemented by adding cudaStreamSynchronize after each call.
export CUDA_LAUNCH_BLOCKING=1 does the same but can not be set per stream.

add allocators

Add support for allocation inside the kernel and multiple allocators for kernel external allocation.

DevCuda:

  • AllocatorCuda (cudaMalloc, malloc when inside kernel)
  • ScatterAlloc

DevCpu:

  • AllocatorMalloc
  • AllocatorNew
  • AllocatorAligned (Boost.Align)
  • AllocatorPageAligned

extend device properties

SharedMemorySizeBytes
MaxClockFrequencyHz

Maybe also CoresPerMultiProcessor. This should be the number of cores with fast shared memory. On a two Processor node using OpenMP this should be the number of cores on one of the processors ( x const).
Do we really need this? BlockKernelsCountMax delivers the maximum block size which is more relevant.

add cuda kernel predicate voting operations

  • __all/__all_sync(predicate)
    Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for all of them. Supported by devices of compute capability 1.2 and higher.
  • __any/__any_sync(predicate)
    Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for any of them. Supported by devices of compute capability 1.2 and higher.
  • __ballot/__ballot_sync(predicate)
    Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active. Supported by devices of compute capability 2.0 and higher.

For each of these warp vote operations, the result excludes threads that are inactive (e.g., due to warp divergence). Inactive threads are represented by 0 bits in the value returned by __ballot() and are not considered in the reductions performed by __all() and __any().

Align shared memory

The shared memory for the Serial, Fibers, Threads an OpenMP accelerators should be aligned at the specified types alignement.
Currently we allocate uint8_t`s via new and cast the memory to the required type.
With the alignment being wrong the performance could be influenced negatively (depending on the system).

Thrust support or custom library ?

On the way porting HASEonGPU to alpaka I need to use thrust vectors and some methods like reduce and exclusive scan. I had a discussion about a coexistence of alpaka and thrust last week with @psychocoderHPC and gave it a try.

But, the more I think about it, it turns out for me that thrust is not the right way. There is no possibility to address more than one accelerator, because you need to specify the backend CUDA, OpenMP or TBB during compilation as compile option. A better approach would be the explicit set of the accelerator and device in the code e.g.: vector<Acc, Type, Size>(dev, size, init)

Its not a big problem to remove the thrust vectors and replace them again by plain memory buffers, initialize them and copy them to the device. But by providing vectors similar to the stl ones, thrust simplified and reduced the code base a lot.

Therefore, a library that provides stl like container / data structures and algorithms with explicit accelerator and device setting would by very handy in my eyes.

make the accelerator a template depending on the block/grid dimensions

  • BasicWorkDiv should have a TDim template parameter.
  • Calling alpaka::exec::create(...) should then return an ExecXXX<TDim,...> which creates an AccXXX<TDim,...> for each invocation.
  • The getIdx and getWorkDiv provided by the acccelerator should then by default return a Vec<TDim> instead of a Vec3<>.

execute fibers row by row

By executing the fibers randomly we prevent memory prefetching.
Iterating X 1st, Y 2nd, Z 3rd (native C memory order) we would assist the prefetcher by using the expected default access pattern.
This could be extended to support arbitrary thread (and block) execution schemes.

make kernel execution thread safe

Is it allowed for the host code to be multithreaded itself?
Restricting it is neither useful nor realistically enforceable.
Maybe one thread calculates something using a CUDA device while the other thread uses OpenMP?
This use case requires locking and the removal of some hidden state.

add special kernel methods

  • clock_t clock();
    long long int clock64();
  • T __ldg(const T* address);
    The read-only data cache load function is only supported by devices of compute capability 3.5 and higher.
  • __shfl_sync, __shfl_up_sync, __shfl_down_sync, __shfl_xor_sync
    Exchange a variable between threads within a warp.

add DeviceView concept

A view of a device could select a subset of CPU cores (and memory?) that are used for execution.
The current Device concept always models the DeviceView concept by always using all the cores.
With this it would be possible to partition a 8 core CPU into e.g. 6 cores for one DeviceView and 2 cores for another DeviceView. Kernels executed in the corresponding streams then will only use the number of cores they are allowed to. This would allow to prevent independent streams from mutually influencing each other. Furthermore this would allow to guarantee selected compute resources for time critical tasks.

abstract the memory access inside kernels

There should not be direct access to memory buffers.
This always implies knowledge about the memory layout (row or col) which is not necessarily correct on the underlying accelerator.

add syncBlockThreadsPredicate template

to support predicates:

int __syncthreads_count(int predicate)
evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.

int __syncthreads_and(int predicate)
evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.

int __syncthreads_or(int predicate)
evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for any of them.

add buffer type

By using a dedicated buffer type instead of raw pointers, memory operations get much easier.
By making the Buffer type dependent on a memory space (Buffer< MemSpace >) it would be much easier to reduce unnecessay double buffering and corresponding copies.

enable usage of lambdas as kernels

CUDA 7.5 allows to pass lambdas into the global kernel with the flag --expt-extended-lambda. See here.

__device__ int result;
template <typename T>
__global__ void kernel(T in) { result = in(4); }

void foo(void) {
    int x = 30;
    kernel<<<1,1>>>(
        [=] __device__ (int in) {
            return in * x; // x captured by value
        })
}

We have to make sure that ALPAKA_FN_ACC can be used in the place of __device__ especially in non-cuda code.

make accelerators extensible

Make accelerators policy based.
They inherit from all policy types given (boost::mpl::forward_sequence).
A default version of the accelerators is typedef'd but the user can replace implementations.

rename accelerator to backend

Accelerator can be confused with a device because sometimes both are used synonymously.
Backend is more clear in that it is a type of a parallel execution environment.

vectorization & warps

I opened this issue just to keep track of a discussion we had two weeks ago.

imho, one of the main screws we should deploy to allow fine-tuning and full utilization of hardware is to provide the possibility to actually steer vectorization (and to be warp-aware in CUDA-speak, which is actually the same as vectorization on CPUs).

Implementations on CPUs could achieve that by adding inner, constant-size loops and pragmas. (Example: loop 1 to n with n non-constant will not likely be vectorized, a loop 1 to n%8 with inner constant loops 0 to 7 might be more likely be vectorized.) Also, users could use that feature to avoid bank conflicts in "one thread per block" access schemes and it might be very useful on GPUs for synchronization and ptx/asm-specific calls.

I think we also defined the way to include this in the alpaka policy scheme, did you document that @BenjaminW3 ?

call cudaDeviceReset at application exit

for (all devices used during operation)
{
        cudaSetDevice(i);

        // cudaDeviceReset causes the driver to clean up all state. 
        // While not mandatory in normal operation, it is good practice.  
        // It is also needed to ensure correct operation when the application is being profiled. 
        // Calling cudaDeviceReset causes all profile data to be flushed before the application exits.
        cudaDeviceReset();
}

add threadFence<>

  • __threadfence_block()
    Waits until all GM and SM accesses made by the calling thread prior to __threadfence_block() are visible to all threads in the thread block.
  • __threadfence()
    Waits until all GM and SM accesses made by the calling thread prior to the call are visible to:
    • threads in the thread block for shared memory accesses and
    • threads in the device for global memory accesses.
    Prevents the compiler from optimizing by caching shared memory writes in registers.
    It does not synchronize the threads and it is not necessary for all threads to actually reach this instruction.
    gcc: __sync_synchronize();
    intel: _mm_mfence();
    MSVC: MemoryBarrier();
    OpenMP: #pragma omp flush
  • __threadfence_system()
    waits until all GM and SM accesses made by the calling thread prior to __threadfence_system() are visible to:
    • threads in the thread block for shared memory accesses,
    • all threads in the device for global memory accesses,
    • host threads for page-locked host memory accesses.

add thread/fiber-pools

... to speedup exectuion dramatically and reduce the number of memory allocations required.

Add support for memory mapping

Memory pointers on the host allocated with cudaHostAlloc(&pHost, size, cudaHostAllocMapped) or registered with cudaHostRegister(pHost, size, cudaHostRegisterMapped) can be used with cudaHostGetDevicePointer to get a device pointer.

NOTE: cudaSetDeviceFlags() must have been called with the cudaDeviceMapHost flag in order for the cudaHostAllocMapped flag to have any effect.

add extended wait interface

Add a variadic alpaka::wait::wait(waited1, ...) method to let the current thread wait for multiple objects at once.

It would also be useful to have a method alpaka::wait::wait(waiter, waited1, ...) to let a waiter wait for multiple objects at once.
This would require the two methods to have different names because then they can not be distinguished by the argument count anymore. What would be useful names?

Both methods could possibly be implemented by just sequentially calling the underlying standard wait methods because the events are one shot events only and do not trigger waits in the executing threads and therefore no deadlocks.

devices

Add accelerator device selection support

template < typename TAcc >
DeviceHandle;

template < typename TDeviceHandle >
class IDeviceHandle
{
// Interfaces...
};

DeviceHandleCuda
{
int m_iDevice;
}

template<>
DeviceHandle< AccCuda > :
IDeviceHandle
{};

template < typename TAcc >
DeviceManager;

template < typename TDevices >
class IDeviceManager
{
// Interfaces...
};

class DeviceManagerCuda
{
setDevice(DeviceHandle< AccCuda > const & deviceHandle); // cudaSetDevice
DeviceHandle< AccCuda > getDevice(); // cudaGetDevice
size_t getDeviceCount() //cudaGetDeviceCount
};

template <>
DeviceManager :
IDeviceManager
{};

Add device property support
One big struct with all relevant values:

struct DeviceProperties
{
size_t BlockSizeMax;
size_t GridSizeMax;
size_t ExecutionUnitCount;
size_t GlobalMemorySizeBytes;
size_t ClockFrequencyHz;
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.