alpaka-group / alpaka Goto Github PK
View Code? Open in Web Editor NEWAbstraction Library for Parallel Kernel Acceleration :llama:
Home Page: https://alpaka.readthedocs.io
License: Mozilla Public License 2.0
Abstraction Library for Parallel Kernel Acceleration :llama:
Home Page: https://alpaka.readthedocs.io
License: Mozilla Public License 2.0
The call should create a workSize from (grid, block) and call the base implementation.
Do not require to select an accelerator to run on. Use a default.
This default should be settable by a function like "setDefaultAccelerator()"
Singleton to store the default in?
add policies for:
Add streams to:
auto exec = buildKernelExecutor<Acc>();
exec(workSize, streamHandle : StreamHandle<Acc> = 0)
// cudaStreamCreate, cudaStreamDestroy
Stream<AccCuda> Device<AccCuda>::getStream(size_t index)
// cudaStreamQuery
bool streamTest(stream);
// cudaStreamWaitEvent
void streamWaitEvent(stream, event) {
// Just wait for it directly.
eventWait(event)
}
// cudaStreamSynchronize == "eventEnqueue(ev);eventWait(ev);" ?
sqrtf, sinf, cosf, ...
Add events
Event ev; // cudaEventCreate && cudaEventCreateWithFlags && cudaEventDestroy
exec(ev, stream | device); // cudaEventRecord
eventWait(ev); // stream: cudaEventSynchronize
eventTest(ev); // cudaEventQuery
cudaEventElapsedTime ?
The synchronous stream could be used for debugging purposes. It can be implemented by adding cudaStreamSynchronize after each call.
export CUDA_LAUNCH_BLOCKING=1
does the same but can not be set per stream.
auto exec = createKernelExecutor<Acc, Kernel>();
exec(workSize, stream)(params);
make the rand implementations euqal to to atomic, idx, ...
namespace alpaka
{
} // alpaka
-> they should not provide members
Current: WorkSize(getSize)
Proposals: Extent (getExtent)
Github allows to host webpages by using a gh-pages
branch.
The generated doxygen documentation could be made available there.
See: https://github.com/smartell/APIDemo/tree/master
or: http://rickfoosusa.blogspot.de/2011/10/howto-use-doxygen-with-github.html
This could be enhanced even more by automatically regenerating and pushing the documentation by Travis CI.
Add support for allocation inside the kernel and multiple allocators for kernel external allocation.
DevCuda:
DevCpu:
CUDA allows to invoke kernels recursively.
SharedMemorySizeBytes
MaxClockFrequencyHz
Maybe also CoresPerMultiProcessor. This should be the number of cores with fast shared memory. On a two Processor node using OpenMP this should be the number of cores on one of the processors ( x const).
Do we really need this? BlockKernelsCountMax delivers the maximum block size which is more relevant.
For each of these warp vote operations, the result excludes threads that are inactive (e.g., due to warp divergence). Inactive threads are represented by 0 bits in the value returned by __ballot() and are not considered in the reductions performed by __all() and __any().
The shared memory for the Serial, Fibers, Threads an OpenMP accelerators should be aligned at the specified types alignement.
Currently we allocate uint8_t`s via new and cast the memory to the required type.
With the alignment being wrong the performance could be influenced negatively (depending on the system).
On the way porting HASEonGPU to alpaka I need to use thrust vectors and some methods like reduce and exclusive scan. I had a discussion about a coexistence of alpaka and thrust last week with @psychocoderHPC and gave it a try.
But, the more I think about it, it turns out for me that thrust is not the right way. There is no possibility to address more than one accelerator, because you need to specify the backend CUDA, OpenMP or TBB during compilation as compile option. A better approach would be the explicit set of the accelerator and device in the code e.g.: vector<Acc, Type, Size>(dev, size, init)
Its not a big problem to remove the thrust vectors and replace them again by plain memory buffers, initialize them and copy them to the device. But by providing vectors similar to the stl ones, thrust simplified and reduced the code base a lot.
Therefore, a library that provides stl like container / data structures and algorithms with explicit accelerator and device setting would by very handy in my eyes.
BasicWorkDiv
should have a TDim
template parameter.alpaka::exec::create(...)
should then return an ExecXXX<TDim,...>
which creates an AccXXX<TDim,...>
for each invocation.getIdx
and getWorkDiv
provided by the acccelerator should then by default return a Vec<TDim>
instead of a Vec3<>
.By executing the fibers randomly we prevent memory prefetching.
Iterating X 1st, Y 2nd, Z 3rd (native C memory order) we would assist the prefetcher by using the expected default access pattern.
This could be extended to support arbitrary thread (and block) execution schemes.
This allows some methods that have const memory to prevent double buffering on the host, a method copyIfDifferentMem
should be implemented.
Is it allowed for the host code to be multithreaded itself?
Restricting it is neither useful nor realistically enforceable.
Maybe one thread calculates something using a CUDA device while the other thread uses OpenMP?
This use case requires locking and the removal of some hidden state.
A view of a device could select a subset of CPU cores (and memory?) that are used for execution.
The current Device concept always models the DeviceView concept by always using all the cores.
With this it would be possible to partition a 8 core CPU into e.g. 6 cores for one DeviceView and 2 cores for another DeviceView. Kernels executed in the corresponding streams then will only use the number of cores they are allowed to. This would allow to prevent independent streams from mutually influencing each other. Furthermore this would allow to guarantee selected compute resources for time critical tasks.
To remove the Uint typedef the integral type used in computations should be a template parameter for the Accelerator, the Executor, etc...
There should not be direct access to memory buffers.
This always implies knowledge about the memory layout (row or col) which is not necessarily correct on the underlying accelerator.
Currently only the CUDA-Accelerator requires the user kernel to fulfill std::is_trivially_copyable.
Should all accelerators require this to prevent surprises when changing the accelerator?
to support predicates:
int __syncthreads_count(int predicate)
evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.
int __syncthreads_and(int predicate)
evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.
int __syncthreads_or(int predicate)
evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for any of them.
By using a dedicated buffer type instead of raw pointers, memory operations get much easier.
By making the Buffer type dependent on a memory space (Buffer< MemSpace >) it would be much easier to reduce unnecessay double buffering and corresponding copies.
CUDA 7.5 allows to pass lambdas into the global kernel with the flag --expt-extended-lambda
. See here.
__device__ int result;
template <typename T>
__global__ void kernel(T in) { result = in(4); }
void foo(void) {
int x = 30;
kernel<<<1,1>>>(
[=] __device__ (int in) {
return in * x; // x captured by value
})
}
We have to make sure that ALPAKA_FN_ACC can be used in the place of __device__
especially in non-cuda code.
template<typename TMemSpace>
struct MemPin;
template<typename TBuf>
void pin(TBuf & memBuf)
{
MemPin<GetMemSpaceT<TBuf>>::memPin(memBuf);
}
This requires OpenMP 2.5 and omp_set_nested(1).
See e.g.: here
Use cudaOccupancyMaxPotentialBlockSize
or cudaOccupancyMaxPotentialBlockSizeVariableSMem
for CUDA automatic work division.
Make accelerators policy based.
They inherit from all policy types given (boost::mpl::forward_sequence).
A default version of the accelerators is typedef'd but the user can replace implementations.
Accelerator can be confused with a device because sometimes both are used synonymously.
Backend is more clear in that it is a type of a parallel execution environment.
Test if a pull request or a commit broke the build of or the results of the test cases.
I opened this issue just to keep track of a discussion we had two weeks ago.
imho, one of the main screws we should deploy to allow fine-tuning and full utilization of hardware is to provide the possibility to actually steer vectorization (and to be warp-aware in CUDA-speak, which is actually the same as vectorization on CPUs).
Implementations on CPUs could achieve that by adding inner, constant-size loops and pragmas. (Example: loop 1 to n with n non-constant will not likely be vectorized, a loop 1 to n%8 with inner constant loops 0 to 7 might be more likely be vectorized.) Also, users could use that feature to avoid bank conflicts in "one thread per block" access schemes and it might be very useful on GPUs for synchronization and ptx/asm-specific calls.
I think we also defined the way to include this in the alpaka policy scheme, did you document that @BenjaminW3 ?
for (all devices used during operation)
{
cudaSetDevice(i);
// cudaDeviceReset causes the driver to clean up all state.
// While not mandatory in normal operation, it is good practice.
// It is also needed to ensure correct operation when the application is being profiled.
// Calling cudaDeviceReset causes all profile data to be flushed before the application exits.
cudaDeviceReset();
}
... to speedup exectuion dramatically and reduce the number of memory allocations required.
Memory pointers on the host allocated with cudaHostAlloc(&pHost, size, cudaHostAllocMapped)
or registered with cudaHostRegister(pHost, size, cudaHostRegisterMapped)
can be used with cudaHostGetDevicePointer
to get a device pointer.
NOTE: cudaSetDeviceFlags() must have been called with the cudaDeviceMapHost flag in order for the cudaHostAllocMapped flag to have any effect.
Add a variadic alpaka::wait::wait(waited1, ...)
method to let the current thread wait for multiple objects at once.
It would also be useful to have a method alpaka::wait::wait(waiter, waited1, ...)
to let a waiter wait for multiple objects at once.
This would require the two methods to have different names because then they can not be distinguished by the argument count anymore. What would be useful names?
Both methods could possibly be implemented by just sequentially calling the underlying standard wait methods because the events are one shot events only and do not trigger waits in the executing threads and therefore no deadlocks.
Add accelerator device selection support
template < typename TAcc >
DeviceHandle;
template < typename TDeviceHandle >
class IDeviceHandle
{
// Interfaces...
};
DeviceHandleCuda
{
int m_iDevice;
}
template<>
DeviceHandle< AccCuda > :
IDeviceHandle
{};
template < typename TAcc >
DeviceManager;
template < typename TDevices >
class IDeviceManager
{
// Interfaces...
};
class DeviceManagerCuda
{
setDevice(DeviceHandle< AccCuda > const & deviceHandle); // cudaSetDevice
DeviceHandle< AccCuda > getDevice(); // cudaGetDevice
size_t getDeviceCount() //cudaGetDeviceCount
};
template <>
DeviceManager :
IDeviceManager
{};
Add device property support
One big struct with all relevant values:
struct DeviceProperties
{
size_t BlockSizeMax;
size_t GridSizeMax;
size_t ExecutionUnitCount;
size_t GlobalMemorySizeBytes;
size_t ClockFrequencyHz;
}
Enhance the fibers implementation by parallelizing the execution of the blocks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.