eyalroz / cuda-api-wrappers Goto Github PK

View Code? Open in Web Editor NEW

733.0 28.0 79.0 2.38 MB

Thin, unified, C++-flavored wrappers for the CUDA APIs

License: BSD 3-Clause "New" or "Revised" License

CMake 0.71% C++ 99.29%

api-wrapper cuda cuda-runtime-api modern-cpp gpu gpu-memory gpu-computing gpgpu gpgpu-computing cuda-api-wrappers

cuda-api-wrappers's People

Contributors

Stargazers

Watchers

cuda-api-wrappers's Issues

Device arrays with static storage duration

First, I want to say that this is an excellent and extremely useful project as it is, so this issue is not to question the design decisions or something like that, but to point out a potential caveat and maybe start a discussion about the best solution.

Let's say there is an algorithm which requires a temporary array on the device for storing the intermediate results. One practical example is the parallel reduction, which needs an array of the length equal to the number of CUDA blocks launched. A common optimization for repeated launch of such algorithms is to avoid the allocation and deallocation by declaring the array with static storage duration and reallocating it only when a larger size is necessary. In practice that means either creating a global variable or declaring the local variable with static.

OK, let's create the problem. The vectorAdd.cu example could be modified to make the d_A array a global variable:

cuda::memory::device::unique_ptr<float[]> d_A;

int main(void)
{
        ...
	auto current_device = cuda::device::current::get();
	d_A = cuda::memory::device::make_unique<float[]>(current_device, numElements);
	auto d_B = cuda::memory::device::make_unique<float[]>(current_device, numElements);
	auto d_C = cuda::memory::device::make_unique<float[]>(current_device, numElements);
        ...
}

When I compile it an run the program, it fails at shutdown:

[Vector addition of 50000 elements]
CUDA kernel launch with 196 blocks of 256 threads
Test PASSED
SUCCESS
terminate called after throwing an instance of 'cuda::runtime_error'
  what():  Freeing device memory at 0x0x0500960000: driver shutting down
Aborted (core dumped)

This can be explained with the unspecified initialization order of global objects and since the destructors are called in reversed order, something in the CUDA library is apparently destructed before cudaFree is called from the d_A's destructor.

But we can control the order of initialization by using static variables declared at the block level. This is where it gets interesting, because we don't know on which part of the CUDA library the d_A's destructor depends. So if I write just

int main(void)
{
	...
	auto current_device = cuda::device::current::get();
	static auto d_A = cuda::memory::device::make_unique<float[]>(current_device, numElements);
	auto d_B = cuda::memory::device::make_unique<float[]>(current_device, numElements);
	auto d_C = cuda::memory::device::make_unique<float[]>(current_device, numElements);
	...
}

then it works as expected (at least on my system). But if I write

int main(void)
{
	static cuda::memory::device::unique_ptr<float[]> d_A;
	...
	auto current_device = cuda::device::current::get();
	d_A = cuda::memory::device::make_unique<float[]>(current_device, numElements);
	auto d_B = cuda::memory::device::make_unique<float[]>(current_device, numElements);
	auto d_C = cuda::memory::device::make_unique<float[]>(current_device, numElements);
	...
}

or even

int main(void)
{
	...
	auto current_device = cuda::device::current::get();
	static cuda::memory::device::unique_ptr<float[]> d_A;
	d_A = cuda::memory::device::make_unique<float[]>(current_device, numElements);
	auto d_B = cuda::memory::device::make_unique<float[]>(current_device, numElements);
	auto d_C = cuda::memory::device::make_unique<float[]>(current_device, numElements);
	...
}

then it fails the same way as in the first case.

This apparently puts considerable constraints on writing nice, portable and efficient C++/CUDA code, if we'd like to use static storage variables for optimizations or something else. Compared to the C way of doing things, one would have to manually call cudaFree to deallocate the global array at the end of main() (or inside my_lib_deinit() which would have to be called at the end of main()). Not that I find it appealing, but it's a working approach and I even think that remembering to call cudaFree is easier than declaring static variables at the right place (if there is actually a right place for the practical case). So unless I'm missing something, I think this means that C++ idioms don't fully replace the whole spectrum of C and therefore C++ wrappers for C libraries may be inherently imperfect, although they may still be very useful for the 99% cases.

Support calling cuda::launch() with (type-erased) device_function_t's

At the moment, our lowest-level launch facility, cuda::launch() and cuda::enqueue_launch() (in kernel_launch.cuh) must take the actual raw, typed, kernel function pointers. Why not let them also accept device_function_t's? We know what the type needs to be, after all.

Drop type-templated allocation methods and functions

While cuda ostensibly supports templated allocation, that actually does nothing but reinterpret pointers. So let's drop the charade...

Better handling for the case of no CUDA deivces on the system

Some of the API wrapper code, or the example programs, assume(s) there's at least one CUDA device on the system; yet we can't make that assumption (especially since at times the CUDA devices may magically disappear from the host computers and not be reported).

The most seriously problem is that cuda::device::count() throws an exception when there are no devices, instead of returning 0. Beyond that, we should at least make sure we don't throw unnecessarily from the example programs when no device is found.

Grid synchronization requires relocatable device code

The sample execution_control must be compiled with nvcc flag -rdc=true. Otherwise the grid.sync() causes a linking error for -arch sm_60 or -arch sm_61. See CUDA C Programming Guide chapter C.3. Grid Synchronization.

Reproduceable by:

cd examples/by_runtime_api_module
nvcc -std=c++11 -I../../src -arch sm_61 execution_control.cu -o execution_control

Fixed by:

cd examples/by_runtime_api_module
nvcc -rdc=true -std=c++11 -I../../src -arch sm_61 execution_control.cu -o execution_control

This flag should be specified in the CMakeLists.txt. Is there a possibility to add
target specific compile time flags? I tried to add target_compile_options(execution_control PUBLIC "-rdc=true") in CMakeLists.txt without any effect. Adding -rdc=true to the general compile flags seems not to be a solution.

Cover the Occupancy and Unified Addressing modules of the runtime API

Of the uncovered modules of the runtime API (other than peer access which has a separate issue opened), I'm motivated to also add the following Occupancy and Unified Addressing; this issue will track that.

(The other modules all relate to graphics mostly, so I'm not planning on covering them unless someone wants to collaborate on that.)

Mark relevant methods and functions as noexcept

Compilers never infer infer a noexcept for a function or method. This isn't usually an issue for our wrappers, because their code gets inlined, plus, a lot of them actually do throw exceptions on failure. For some wrappers, though, this is more of a problem. Specifically, stream and event wrappers which can be owning or non-owning, so it's important to call their move rather than copy constructor in the owning case, when we're actually moving them. Now, std::vector and perhaps other standard library containers just won't use a move constructor if it's not noexcept... and subtle failures will ensure. I'm sure there are other cases this can happen, even more obscure.

So let's mark a bunch of methods noexcept, when they are noexcept, at least in our three main classes: device_t, event_t and stream_t.

cuda::memory::device::make_unique implicitly uses the current device

One of the (unstated) design goals of these wrappers are to forego the use of global variables - at least from the wrapper-using code's perspective. It should not need to be aware of what the current device is unless it specifically wants to; and it certainly should not be made to have to use the "set device; do something with that device being current; maybe set device back to something else" pattern.

Unfortunately, that's not how things stand right now with the cuda::memory::device::make_unique functions: They allocate memory on the current device.

So instead of make_unique variants taking the number of elements, or nothing for a single element - we'll ad a first parameter, which can be either a device ID (in which case we'll always set the current device) or a device proxy (device_t; in which case we'll set the current device based on whether or not the device is assumed to be current).

Cover the Peer Device Access module of the Runtime API

It's both relevant and easy to also cover the Peer Device Access module of the Runtime API; let's do it.

Use smart pointers in simpleStreams.cu

simpleStreams.cu currently has a pair of functions AllocateHostMemory() and FreeHostMemory(), which use raw pointers. Instead, let's use cuda::memory::device::unique_ptr<>s.

Support event creation using device_t's rather than device IDs

During my GTC Europe "lab" this week, I was asked why we needed to extract a device ID in order to create event. I didn't have a very good answer, and it seems that I must just be able to make this happen - now that we have the multi-container implementation file.

NVTX library not found on ubuntu-17.10

On Ubuntu-17.10 CUDA is installed directly under /usr, and libraries are installed alongside system libraries in /usr/lib/x86_64-linux-gnu.

find_library fails to find NVTX because NO_DEFAULT_PATH has been specified. Removing NO_DEFAULT_PATH makes detection work and library builds as expected.

I believe removing NO_DEFAULT_PATH should be safe because it is only used as a fallback option, but I'm not 100% sure.

Support for cooperative kernel launches

I think this can be achieved by adding

template<typename KernelFunction, typename... KernelParameters>
inline void launch_cooperative(
	const KernelFunction&       kernel_function,
	launch_configuration_t      launch_configuration,
	void**                      args)
{
	cudaLaunchCooperativeKernel(
		kernel_function.ptr(),
		launch_configuration.grid_dimensions,
		launch_configuration.block_dimensions,
		args,
		launch_configuration.dynamic_shared_memory_size
	);
}

in kernel_launch.cuh.

stream_mangement throws on enqueue_memset()

Some change from yesterday causes the enqueue_memset() in stream_management.cu to throw an exception:

terminate called after throwing an instance of 'cuda::runtime_error'
  what():  memsetting an on-device buffer: invalid argument

[TRACKING] Missing coverage by example programs

The example programs double as sort-of-unit tests - as well as the simpler tests of compilation, as msot code does not get instantiated from templated when you just build the library. It's therefore important that every single function and method of the API wrappers be covered by one of the examples.

This issue will be continuous edited as we notice wrapper code not covered by the examples, and as we add/change examples to cover it.

Add an example program using mapped memory

We absolutely must have a program using mapped memory.

Doxygenate the code!

Almost all of the code is missing doxygen comments, first and foremost being the methods and the classes. Let's write it.

Allow cuda::device::current::get() to not immediately determine its ID

When using the CUDA Runtime API directly, one can make calls regarding the current device without "knowing" its ID (i.e. without ever having called cudaGetDevice()); in our wrappers, acting on the current device usually happens by obtaining a proxy to it, using cuda::device::current::get(), which does:
return device_t<detail::assume_device_is_current>(current::get_id());
so we always get the ID, even though we don't need to. We could probably avoid that in favor of lazily getting the ID the first time it's necessary. An std::optional<cuda::device::id_t> could replace the current field we use for an id - or we could just have a boolean and handle things ourselves to avoid depndency on the non-C++11 std::optional. Of course this would require some kind of work of the device_setter set_device_for_this_scope(id_); statements we use now in the implementations of many of the methods.

Support event creation on enqueue

Events are very often created simply to be enqueued on a stream, then waited on elsewhere. Instead of forcing the user to use device::create_event() or event::create(), we can just (as an option) offer a fused event creation + enqueueing, by having stream_t::enqueue_t expose an event() method which doesn't take a pre-existing event.

Let device_t's create event_t's

We already have device_t<AssumedCurrent>::create_stream(), and there's even a event::create(device_t<AssumedCurrent>, etc, etc). So there's really no good reason not to also then have a device_t<AssumedCurrent>::create_event().

Build sometimes breaks failing to locate CUDA libraries

I recently tried to build the wrapper use examples after switching from GNU/Linux Kubuntu 16.04 to Mint 18.1 (on the same machine). Strangely enough, I encountered a linking issue. The command:

/usr/bin/c++   -Wall -std=c++11 -g -g   CMakeFiles/device_management.dir/examples/by_runtime_api_module/device_management.cpp.o -o examples/bin/device_management -rdynamic lib/libcuda-api-wrappers.a -Wl,-Bstatic -lcudart_static -Wl,-Bdynamic -lpthread -ldl -lrt

fails to find the CUDA runtime library, and I get:

CMakeFiles/device_management.dir/examples/by_runtime_api_module/device_management.cpp.o: In function `cuda::device::peer_to_peer::get_attribute(cudaDeviceP2PAttr, int, int)':
/home/eyalroz/src/mine/cuda-api-wrappers/src/cuda/api/device.hpp:38: undefined reference to `cudaDeviceGetP2PAttribute'
collect2: error: ld returned 1 exit status

but if I add -L/usr/local/cuda/lib64 it builds fine. This doesn't happen on bricks02
...
Fix this.

Supposedly-asynchronous streams ignoring an event they should wait on

In the newly-committed sample program, io_compute_overlap_with_streams, we have several streams, created with flags cuda::stream::async, which should not start working on their queue items until they're done with the first item - which is waiting on an event. They somehow skip that one, and go one to carry out the rest of the queued work.

This should not happen, but it's not at all clear to me where the bug is (and whether it's even in my wrapper code - although probably it is.)

Refactor `event_t` similarly to the recent changes to `stream_t`

We've made stream_t only constructible with a stream id, and having a const id and device id - with stream::make() functions to call the creation code (which is outside of the class). Let's do the same for event_t (as well as keep their device IDs explicitly).

Providing a cuda-api-wrappersConfig.cmake?

Instead of providing a find module a talk by Daniel Pfeiffer:

https://www.youtube.com/watch?v=bsXLMQ6WgIk

summarised in:

https://pabloariasal.github.io/2018/02/19/its-time-to-do-cmake-right/

suggest to provide rather a cmake config file, if the package also uses cmake as a build system.
It can be also very handy if the package registers within the cmake user package registry at ~/.cmake/packages.

What do you think?

Wrapper for cudaMemPrefetchAsync

Will there be a wrapper for the functions named in the title?

Named status/error code troubles...

We have cuda::status_t, and we also have specific constant values for stati, e.g. invalid_host_pointer etc.

I had created the cuda::status::alias_t enum, inheriting from status_t, for the named values - so that they not take up storage on one hand and be stored in a cuda::status_t on the other. Unfortunately, C++ won't automatically cast the former the alias_t's into status_t's; and there's no other way to guarantee these don't take up extra storage. If we were allowed to take up the extra storage we could have simply defined:
constexpr const invalid_host_pointer = cudaErrorInvalidHostPointer,
With this not happening, we can't write
throw cuda::runtime_error(invalid_host_pointer, "whateer");
which we really want to be able to do, and for users of the API wrappers to be able to do.

So I guess there's no escape but to add more constructors to cuda::runtime_error; and also rename it, since it's not quite an alias.

Credit to @codecircuit who first noticed this (and even fixed it on his fork).

Use range-based copy, and fill - like in the C++ standard library

We currently have copy(destination, source, length); perhaps we should change that to the same style / calling convention as the standard libary?

Get rid of device::flags_t

Having an integral type for device flags is ugly and too much of an exposition of the implementation. People can very well make individual setting, and it's not as though we have innumerable flags which require setting all at once.

So - let's get rid of flags_t, or rather let's not expose it to client code and only use it internally.

Dependence on CUB?

Hi,

I was trying to update our version of cuda-api-wrappers to the HEAD (fe0f79b), when I stumbled on the recently-added dependence on CUB (3f16528). The dependence itself is fine (even if I didn't manage to find any actual use of that in cuda-api-wrappers), my exact problem being how to build cuda-api-wrappers now. After adding CUB as a dependence of cuda-api-wrappers in our (non-CMake) build system, I get the following error from CMake when building cuda-api-wrappers

CMake Error at CMakeLists.txt:96 (find_package):
  By not providing "FindCUB.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "CUB", but
  CMake did not find one.

  Could not find a package configuration file provided by "CUB" (requested
  version 1.5.0) with any of the following names:

    CUBConfig.cmake
    cub-config.cmake

  Add the installation prefix of "CUB" to CMAKE_PREFIX_PATH or set "CUB_DIR"
  to a directory containing one of the above files.  If "CUB" provides a
  separate development package or SDK, be sure it has been installed.

Is the problem something that should be fixed in cuda-api-wrapper build rules, or in our build rules (I'm rather newbie concerning CMake)? (or maybe the dependence on CUB could be removed for time being if it is not really needed?)

Thanks,
Matti

Actually write in the code for the IPC module example program

By mistake, the examples/ directory has a committed skeleton of a test program for the IPC CUDA features via our IPC wrappers. The program currently does nothing, but has a bunch of comments indicating what it should be able to do.

The code for this should be filled in. Of course, the problem is, "processes" are meaningless in C++. We would need to either assume a POSIX environment (perhaps using CMake to test for this? Or just leave it in the .cpp file?), or use Boost::Process. We'll probably stick to the first option for now.

Switch to using CMake's native CUDA support

CMake 3.8 introduced native support for CUDA as a language (no more FindCUDA.cmake and related constructs). I think I'll try switching to using that.

Add a CMake module for locating the wrappers

I'm going to add a CMake module - not for use when building the CUDA API wrappers, but for CMake project which need to build against the wrappers - under cmake/Modules.

Fix support for installation

Right now, make install'ing doesn't quite do what it should. Let's fix it.

Bring the API for callbacks and for kernels closer together

A callback and a kernel are in many ways the same thing: Some (compiled) code that the CUDA driver arranges to have executed; the difference is that a kernel is code which runs on a GPU, and a callback runs on the CPU (I'm not even sure it's on the enqueueing thread actually; but never mind). So - why should they not have similar API for enqueueing them? It should be something like

my_stream.enqueue.device_execution(function_name, launch_config, arguments...);
my_stream.enqueue.host_execution(function_name, arguments...);

(or even just enqueue execution and have the compiler figure out whether it's a kernel function or not - although I'm not sure that can be arranged reliably, so we'll not go that far).

`enqueue_launch` does not support functions of type `device_function_t`

#include "cuda/api_wrappers.h"

__global__ void kernel() {}

int main() {
	cuda::device_function_t dfunc = kernel;
	auto c = cuda::make_launch_config(1, 1);
/*	cuda::enqueue_launch(cuda::thread_blocks_may_cooperate,
	                                     dfunc,
	                                     cuda::default_stream_id,
	                                     c);*/ // compilation fails
	cuda::launch(dfunc, c); // compilation succeeds
}

CMakeFiles assuming a Unix-like system when determining CUDA SM value

Build type: Release
Selecting Windows SDK version 10.0.16299.0 to target Windows 10.0.17134.
The C compiler identification is MSVC 19.13.26132.0
The CXX compiler identification is MSVC 19.13.26132.0
Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2017/Enterprise/VC/Tools/MSVC/14.13.26128/bin/Hostx86/x86/cl.exe
Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2017/Enterprise/VC/Tools/MSVC/14.13.26128/bin/Hostx86/x86/cl.exe -- works
Detecting C compiler ABI info
Detecting C compiler ABI info - done
Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2017/Enterprise/VC/Tools/MSVC/14.13.26128/bin/Hostx86/x86/cl.exe
Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2017/Enterprise/VC/Tools/MSVC/14.13.26128/bin/Hostx86/x86/cl.exe -- works
Detecting CXX compiler ABI info
Detecting CXX compiler ABI info - done
Detecting CXX compile features
Detecting CXX compile features - done
Found CUDA: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v9.2 (found suitable version "9.2", minimum required is "7.0") 
/bin/bash: C:/_Dev/_Build/cuda-api-wrappers/scripts/get_cuda_sm.sh: No such file or directory

CUDA device-side code will assume compute capability 
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDA_NVTX_LIBRARY (ADVANCED)
    linked by target "inlinePTX" in directory C:/_Dev/_Build/cuda-api-wrappers
    linked by target "simpleIPC" in directory C:/_Dev/_Build/cuda-api-wrappers
    linked by target "cuda-api-wrappers" in directory C:/_Dev/_Build/cuda-api-wrappers
    linked by target "vectorAdd" in directory C:/_Dev/_Build/cuda-api-wrappers
    linked by target "simpleStreams" in directory C:/_Dev/_Build/cuda-api-wrappers
    linked by target "version_management" in directory C:/_Dev/_Build/cuda-api-wrappers
    linked by target "error_handling" in directory C:/_Dev/_Build/cuda-api-wrappers
    linked by target "execution_control" in directory C:/_Dev/_Build/cuda-api-wrappers
    linked by target "stream_management" in directory C:/_Dev/_Build/cuda-api-wrappers
    linked by target "event_management" in directory C:/_Dev/_Build/cuda-api-wrappers
    linked by target "io_compute_overlap_with_streams" in directory C:/_Dev/_Build/cuda-api-wrappers

Configuring incomplete, errors occurred!
See also "C:/_Dev/_Build/cuda-api-wrappers/build_32/CMakeFiles/CMakeOutput.log".

CMakeOutput.log:
https://pastebin.com/wjD8a0gM (Link will expire in 6 months)

My system specs are already listed in the error msg.

IDK why it shows /bin/bash: C:/_Dev/_Build/cuda-api-wrappers/scripts/get_cuda_sm.sh: No such file or directory the file is present on my system (and is binary identical with the one on your repo).

The same error occurs when i try to compile for x64 also

Consider using <system_error> for wrapping CUDA errors with exceptions

At the moment, our CUDA error wrapper code is almost entirely custom-written, tying exceptions to CUDA error codes. C++11 has a similar mechanism for wrapping libc and other similar "system errors" involving error codes returned by API calls; that is the std::system_error class and its associated std::error_code (both from the <system_error> header). I think these might be usable for CUDA errors as well; and that would be a nice win, cutting out dozens of LOC.

Further reading on <system_error>: A series of posts on the Think Async blog,

Remove unused user-defined conversion?

When compiled with clang-5 and -Weverything, this rather magical function triggers a lot of warnings:

https://github.com/eyalroz/cuda-api-wrappers/blob/master/src/cuda/api/device.hpp#L834

warning: conversion function converting 'cuda::device_t' to itself will never be used

Can this be written in such a way as to avoid generating the device_t to device_t case? This c++ is a bit advanced for me :-(

Consider replacing the stream_t::enqueue dummy object with a tag mechanism

At the moment, we enqueue using my_stream.enqueue.copy(...), my_stream.enqueue.kernel_launch(...), etc - where stream_t::enqueue is a dummy object which holds a reference to the stream. It helps us avoid having functions named stream_t::enqueue_kernel_launch(), stream_t::enqueue_kernel_copy(), enqueue_this and enqueue_that.

Well, we can do better: Just one templated enqueue function, with an initial argument being a tag class with different values for kernel launches, copies, etc. - every kind of possible operation. This can be specialized separately for the different operations, without even needing any special dispatching code.

Consider using gsl::span's

We currently have more than a few wrappers taking a void* and a num_bytes or size_in_bytes etc. Instead, why not use spans?

And if we do that, perhaps we could/should use spans elsewhere, where we now just use pointers.

Kernel gets wrong argument values

Executing the example code in #40 results with the recent commits now into the following output:

Launching non-cooperative
a = 1, b = 2, c = 0
Launching cooperative
a = 3, b = -490680, c = 0

My environment:

Cuda compilation tools, release 9.0, V9.0.176
gcc 5.1.0 or gcc 6.3.0

I compiled with:

nvcc -std=c++11 -I cuda-api-wrappers/src test.cu -o test
nvcc -std=c++14 -I cuda-api-wrappers/src test.cu -o test

simpleStream failing, execution seems to not take enough time

the modified simpleStreams example has been failing without me noticing, since printing SUCCESS or FAILURE was disabled. It's exhibiting much too short execution times, leading me to believe that there's some problem with the stream synchronization.

Support cudaMemRangeGetAttribute, cudaMemRangeGetAttributes

We currently don't have API wrappers for obtaining shared memory region attributes.

Add dynamic compilation functionality from libNVVM?

libNVVM, not part of the runtime API, allows for dynamic compilation of... umm, I would have liked to say PTX, but rather it's "NVVM IR", whatever that means. Perhaps we should wrap this functionality as well? I wonder.

Add functionality from the CUDA driver, not just the runtime API

The CUDA driver has all sorts of functionality not accessible directly, or at all, through the runtime API. Some of it is even very useful for implementing some of the proxy classes' methods (e.g. getting a single pointer attribute rather than all of them). Perhaps we should start adding them.

... this could theoretically be within #ifndef RUNTIME_API_ONLY.

the compilation error on clang++-5.0

Hi Eyal

First thanks for the nice library, it is very useful.

Your library works fine for g++(5.4 or 6.2) compiler, but it seems to have a small compilation error on clang compiler. The beneath is the detail:

Check for working C compiler: /usr/bin/cc
Check for working C compiler: /usr/bin/cc -- works
Detecting C compiler ABI info
Detecting C compiler ABI info - done
Detecting C compile features
Detecting C compile features - done
Check for working CXX compiler: /usr/bin/clang++-5.0
Check for working CXX compiler: /usr/bin/clang++-5.0 -- works
Detecting CXX compiler ABI info
Detecting CXX compiler ABI info - done
Detecting CXX compile features
Detecting CXX compile features - done
Building for Compute Capability 5.2.
--------------------------------------------------------------
Paths Used
--------------------------------------------------------------
C compiler (for non-CUDA code):    /usr/bin/cc
C++ compiler (for non-CUDA code):  /usr/bin/clang++-5.0
CUDA C++ forward-to host compiler: /usr/bin/clang-5.0
CUB header-only library root:      
CUDA include directories:          /usr/local/cuda/include
--------------------------------------------------------------
Package and Library versions
--------------------------------------------------------------
nVIDIA CUDA toolkit:               9.0
--------------------------------------------------------------

georgeliao@dw064:~/software_projects/test_code/cuda-api-wrappers/build$ make -j8
Scanning dependencies of target cuda-api-wrappers
[ 33%] Building CXX object CMakeFiles/cuda-api-wrappers.dir/src/cuda/api/device_properties.cpp.o
[ 66%] Building CXX object CMakeFiles/cuda-api-wrappers.dir/src/cuda/api/profiling.cpp.o
/home/georgeliao/software_projects/test_code/cuda-api-wrappers/src/cuda/api/profiling.cpp:21:38: error: 
      definition or redeclaration of 'profiler_mutex' not allowed inside a function
        std::lock_guard<std::mutex>(detail::profiler_mutex);
                                    ~~~~~~~~^
/home/georgeliao/software_projects/test_code/cuda-api-wrappers/src/cuda/api/profiling.cpp:36:38: error: 
      definition or redeclaration of 'profiler_mutex' not allowed inside a function
        std::lock_guard<std::mutex>(detail::profiler_mutex);
                                    ~~~~~~~~^
2 errors generated.
CMakeFiles/cuda-api-wrappers.dir/build.make:86: recipe for target 'CMakeFiles/cuda-api-wrappers.dir/src/cuda/api/profiling.cpp.o' failed
make[2]: *** [CMakeFiles/cuda-api-wrappers.dir/src/cuda/api/profiling.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
CMakeFiles/Makefile2:203: recipe for target 'CMakeFiles/cuda-api-wrappers.dir/all' failed
make[1]: *** [CMakeFiles/cuda-api-wrappers.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

My linux system is
Linux dw064 4.10.0-37-generic #41~16.04.1-Ubuntu SMP Fri Oct 6 22:42:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Best regards
Jia Liao

Kernel argument order for cooperative launches is reversed

Compile and execute:

#include <iostream>

#include "cuda/api_wrappers.h"

__global__ void foo(int a, int b, int c) {
	if (threadIdx.x == 0) {
		printf("a = %d, b = %d, c = %d\n", a, b, c);
	}
	__syncthreads();
}


int main() {
	constexpr cuda::grid_block_dimension_t block_dim = 32;
	constexpr cuda::grid_dimension_t grid_dim = 1;
	int a = 1;
	int b = 2;
	int c = 3;
	std::cout << "Launching non-cooperative" << std::endl;
	auto config = cuda::make_launch_config(grid_dim, block_dim);
	cuda::enqueue_launch(cuda::thread_blocks_may_not_cooperate,
	                     foo,
	                     cuda::stream::default_stream_id,
	                     config,
	                     a, b, c);
	auto device = cuda::device::current::get();
	device.synchronize();

	std::cout << "Launching cooperative" << std::endl;
	cuda::enqueue_launch(cuda::thread_blocks_may_cooperate,
	                     foo,
	                     cuda::stream::default_stream_id,
	                     config,
	                     a, b, c);
	device.synchronize();
}

Expected output:

Launching non-cooperative
a = 1, b = 2, c = 3
Launching cooperative
a = 1, b = 2, c = 3

Actual output:

Launching non-cooperative
a = 1, b = 2, c = 3
Launching cooperative
a = 3, b = 2, c = 1

CUDA 9.0 compatibility issues

Trouble compiling with CUDA 9.0 RC:

ptx.h special register getters for 64 bits aren't compiling.
profiling.h - complaints about functions with default arguments.

Transition more of the API wrapper functions to use wrapper classes over IDs

At the moment, quite a bit of the wrappers' functionality involves ID types: Events, Streams and Devices essentially. We have more than a few methods and freestanding functions which take them or return them, even though they could have been working with event_t's, stream_t's and device_t's.

We should phase the methods and functions using the IDs and try to use the wrapper classes whenever possible (except of course in methods actually requiring their use, e.g. wrapping existing IDs or extracting the ID of an entity).

Support enqueuing memzero'ing

We have a non-class-member function for zero'ing memory; let's allow enqueueing it onto a stream_t.

eyalroz / cuda-api-wrappers Goto Github PK

cuda-api-wrappers's People

Contributors

Stargazers

Watchers

Forkers

cuda-api-wrappers's Issues

Recommend Projects

Recommend Topics

Recommend Org