odlgroup / odlcuda Goto Github PK

View Code? Open in Web Editor NEW

5.0 3.0 2.0 367 KB

C++ backend for ODL

License: GNU General Public License v3.0

Python 56.00% CMake 6.75% C++ 15.70% Cuda 21.08% C 0.47%

odlcuda's Introduction

odlcuda

CUDA backend for ODL

Introduction

This is a default CUDA backend for ODL. It contains a default implementation of a CUDA based Rn space with associated methods.

By installing this package, CUDA spaces become available to rn, uniform_discr and related spaces in ODL through the impl='cuda' option.

Building

The project uses CMake to enable builds. Using cmake-gui is recommended

Cmake webpage

Unix:

The binaries are typically built in a build folder. Starting at the repository's top level directory, do

mkdir build
cd build
cmake-gui ../

And set the required variables.

To build and install the package to your python installation, run (as root)

make pyinstall

To verify your installation, run (in the odlcuda root directory)

py.test

This requires the pytest package.

Windows

To build on windows, open the CMake gui, run configure-generate and set the required variables. Then open the project with Visual Studio and build pyinstall.

Code guidelines

The code is written in C++11/14.

Compilers

The code is intended to be usable with all major compilers. Current status (2015-06-22) is

Platform	Compiler	Cuda	Compute	Works
Windows 7	VS2013	7.0	2.0	✔
Windows 7	VS2013	7.0	5.2	✔
Windows 10	VS2015	7.0	5.2	TODO
Fedora 21	GCC 4.9	7.0	5.2	✔
Ubuntu ?.?	???	7.0	5.2	✔
Mac OSX	???	7.0	5.2	TODO

Formating

The code is formatted using CLang format as provided by the LLVM project. The particular style used is defined in the formatting file.

External Dependences

Current external dependencies are

#####Python The building block of ODL, odlcuda needs access to both python and numpy header files and compiled files to link against.

#####Boost General library with C++ code. This project specifically uses Boost.Python to handle the python bindings.

Boost webpage Prebuilt boost (windows), this version uses python 2.7

#####CUDA Used for GPU accelerated versions of algorithms. The code uses C++11 features in device code, so CUDA 7.0 is required. CUDA 6.5 may work on some platforms (notably windows). It should as of now compile with compute capbability >2.0, in the future certain features require higher compute capability may be added.

CUDA

Troublefinding

There are a few common errors encountered, this is the solution to some of these

Installation

When compiling if, you get a error like
```
  NumpyConfig.h not found
```
then it is likely that the variable PYTHON_NUMPY_INCLUDE_DIR is not correctly set.
Compiling
```
  error C1083: Cannot open include file: 'Eigen/Core': No such file or directory.
```
you have tried to build the default target, instead build target "pyinstall".

Compiling

  [ 20%] Building NVCC (Device) object odlcuda/CMakeFiles/cuda.dir//./cuda_generated_cuda.cu.o /usr/include/c++/4.9.2/bits/alloc_traits.h(248): error: expected a ">"

It may be that you are trying to compile with CUDA 6.5 and GCC 4.9, this combination is not supported by CUDA.

Compiling
```
  error LNK1112: module machine type 'x64' conflicts with target machine type 'X86'
```
You have a 64-bit library on your path (Boost for instance) while trying to build 32-bit odlcuda. Either change the lib, or configure to build 64-bit. On Windows, if you are using Visual Studio to compile use Configuration Manager to set platform to x64, if you are compiling on command line via CMake ensure that the Script Generator is for instance "Visual Studio 12 2013 Win64" (note the Win64 at the end).

If you get a error like

  Error	5	error LNK2019: unresolved external symbol "__declspec(dllimport) struct _object * __cdecl boost::python::detail::init_module(struct PyModuleDef &,void (__cdecl*)(void))" (__imp_?init_module@detail@python@boost@@YAPEAU_object@@AEAUPyModuleDef@@P6AXXZ@Z) referenced in function PyInit_utils	C:\Programming\Projects\odlcuda_bin\odlcuda\utils.obj	utils

then it is likely that you are trying to build against unmatched python header files and boost python version

Running

If, when running the tests in python, you get an error like
```
  RuntimeError: function_attributes(): after cudaFuncGetAttributes: invalid device function
```
It may be that the compute version used is not supported by your setup, try changing the cmake variable CUDA_COMPUTE.
If, when running the test in python, you encounter an error like
```
  ImportError: No module named odlcuda
```
It may be that you have not installed the package, run (as root) make pyinstall or equivalent.

License

GPL Version 3. See LICENSE file.

If you would like to get the code under a different license, please contact the developers.

Main developers

Jonas Adler (jonas-<ätt>-kth--se)
Holger Kohr (kohr-<ätt>-kth--se)

odlcuda's People

Contributors

Stargazers

Watchers

Forkers

fredrikhellman mehrhardt

odlcuda's Issues

Change from boost to pybind11

PyBind11 is an alternative to boost python that does not require a compiled part (and the rest of boost), this would simplify compilation and distribution by PyPI.

installation with local odl

I managed to install odlcuda with odl that comes from pip. However, that version is outdated so I would like to use odlcuda with the latest odl version. Following exactly the same installation steps as with the off-the-shelf odl version, my odl cannot find "cuda" as an implementation, e.g.

NotImplementedError: no corresponding data space available for space FunctionSpace(IntervalProd([-333.8016, -333.8016, 0. ], [ 333.8016 , 333.8016 , 257.96875]), out_dtype='float32') and implementation 'cuda'

This is all strange, because it does find my odl version within the installation and this odl version works perfectly fine without cuda. Also my system finds odlcuda as it shows up on auto-complete after import odl. Any ideas what might have gotten wrong here? What is the mechanism that tells odl that odlcuda is present?

slow maximum function

I noticed that the maximum function is very slow in odlcuda on the GPU. In fact, it is slower than computing the maximum on the CPU. Please see my example test case below. Any ideas why that is and how to fix it?

import odl

X = odl.rn(300 * 10**6, dtype='float32')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)

X = odl.rn(300 * 10**6, dtype='float32', impl='cuda')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)

CPU times: user 346 ms, sys: 200 µs, total: 346 ms
Wall time: 347 ms
CPU times: user 1.44 s, sys: 0 ns, total: 1.44 s
Wall time: 1.43 s
CPU times: user 838 ms, sys: 341 ms, total: 1.18 s
Wall time: 1.18 s
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 91.1 µs

performance of odlcuda

I was running some experiments and there seems to be a performance issue with CUDA. However, I am not sure whether odlcuda is causing it or this is just a general CUDA phenomenon. Below is the code that I ran with its output. I don't think it is very important to know what this code is doing but if need be, am happy to post that.

There seems to be a phenomenon that with CUDA the code scales linearly with the number of subsets even though in each run of the code approximately the same number of flops are executed. The timings are pretty constant for numpy. The question is whether this is an issue with odlcuda or this is a general CUDA phenomenon. In all of these timings, there is no copying from or to the GPU. I am aware that executing kernels generates an overhead but I would not have thought this is so dramatic. In particular as one can see that the smallest subset still contains 262 x 65000 = 17 million elements.

Do you think that this is related to the way ODL is written or do you think this is a general CUDA thing?

for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape,
                                               impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        %time fx = f(x)
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    %time fx = f(x)

impl:cuda, shape:[4200, 65000], nsubsets:1
CPU times: user 88.2 ms, sys: 24 ms, total: 112 ms
Wall time: 114 ms

impl:cuda, shape:[1050, 65000], nsubsets:4
CPU times: user 325 ms, sys: 28.7 ms, total: 354 ms
Wall time: 361 ms

impl:cuda, shape:[262, 65000], nsubsets:16
CPU times: user 1.43 s, sys: 31.7 ms, total: 1.46 s
Wall time: 1.49 s

impl:cuda, shape:[4200, 65000]
CPU times: user 93.6 ms, sys: 20.2 ms, total: 114 ms
Wall time: 116 ms

impl:numpy, shape:[4200, 65000], nsubsets:1
CPU times: user 10.7 s, sys: 352 ms, total: 11.1 s
Wall time: 6.88 s

impl:numpy, shape:[1050, 65000], nsubsets:4
CPU times: user 13.4 s, sys: 562 ms, total: 14 s
Wall time: 6.9 s

impl:numpy, shape:[262, 65000], nsubsets:16
CPU times: user 24.7 s, sys: 1 s, total: 25.7 s
Wall time: 7.04 s

impl:numpy, shape:[4200, 65000]
CPU times: user 10.8 s, sys: 380 ms, total: 11.1 s
Wall time: 6.92 s

for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape, impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        out = Y.element()
        
        t = 0        
        for i in range(len(Y)):
            f_prox = f[i].convex_conj.proximal(x[i])
            src.tic()
            f_prox(x[i], out=out[i])
            t += src.toc()
        print('time:{}, average:{}'.format(t, t / len(Y)))
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    out = Y.element()
    f_prox = f.convex_conj.proximal(x)
    %time f_prox(x, out=out)

impl:cuda, shape:[4200, 65000], nsubsets:1
time:0.607445955276, average:0.607445955276

impl:cuda, shape:[1050, 65000], nsubsets:4
time:0.405075311661, average:0.101268827915

impl:cuda, shape:[262, 65000], nsubsets:16

time:2.36511826515, average:0.147819891572

impl:cuda, shape:[4200, 65000]
CPU times: user 146 ms, sys: 127 µs, total: 146 ms
Wall time: 150 ms

impl:numpy, shape:[4200, 65000], nsubsets:1
time:3.18624901772, average:3.18624901772

impl:numpy, shape:[1050, 65000], nsubsets:4
time:3.12681221962, average:0.781703054905

impl:numpy, shape:[262, 65000], nsubsets:16
time:3.25435972214, average:0.203397482634

impl:numpy, shape:[4200, 65000]
CPU times: user 13.8 s, sys: 1.47 s, total: 15.2 s
Wall time: 3.19 s

Add optimized versions of p-norm and p-dist for p != 2

The standard p-norm can be implemented with the help of the CUDA sum and abs ufuncs, but this involves a copy. We need a C++ implementation if we want efficiency.

For p=inf I currently don't see a way of implementing in Python. We can include that case in the same function in C++, too, but while we're at it, the max() and min() functions would be good to have.

Installation issues

Hi when I try to build I get the following error;

CMake Error at CMakeLists.txt:28 (add_dependencies):
  add_dependencies called with incorrect number of arguments


CMake Warning (dev) in CMakeLists.txt:
  No cmake_minimum_required command is present.  A line of code such as

    cmake_minimum_required(VERSION 2.8)

  should be added at the top of the file.  The version specified may be lower
  if you wish to support older CMake versions for this project.  For more
  information run "cmake --help-policy CMP0000".
This warning is for project developers.  Use -Wno-dev to suppress it.

Remove pyinstall

We should change remove the custom pyinstall and instead use the built in install found in CMake. This is used in our STIR clone.

Ufuncs and reductions for all dtypes

Currently they only work with float32, our users would expect them to work with any dtype.

odl-cpp-utils doesn't build after Eigen removal

You removed Eigen in
9196001

But it is still referenced in
https://github.com/odlgroup/odl-cpp-utils/blob/master/utils/Ellipse.h

So Jenkins CI can't build the project from scratch, it gets Cannot open include file: 'Eigen/Core' No such file or directory.

Revert the Eigen deletion, or update odl-cpp-utils?

/Lars

Installation without admin rights

I have been trying to install odlcuda without admin rights. I have changed the "CMAKE_INSTALL_PREFIX" folder to a folder where I do have write access. This approach worked well to install other software packages. However, odlcuda seems to ignore this path and tries to install it into the default path

/usr/local/lib/python2.7/dist-packages/

instead.

I tried two options how to change the installation path, all of which should work according to some forums, but none does here.

cmake -DCMAKE_INSTALL_PREFIX=~/.local/lib/python2.7/site-packages ../
make DESTDIR=~/.local/lib/python2.7/site-packages pyinstall

Any ideas of what is going on here?

Remove Python enabled flag

Python is always needed so this flag should be removed.

Boost >= 1.65 not supported

The numeric module has been removed in boost 1.65, thus odlcuda currently doesn't support that version. For building we need version 1.64 or lower.

Remove Eigen dependency

Eigen should not be needed as a dependency for the CUDA part.

Error when trying to install odlcuda

We installed odl and run test, no issues. Then we tried to install odlcuda. But we had a problem when we ran this command "CUDA_ROOT=/usr/local/cuda-9.0 CUDA_COMPUTE=37 conda build ./conda". Error shows "conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {"odl[version='>=0.3.0']"}".
Do you what the problem is?

Better error with wrong GCC

currently using unmatched GCC and CUDA (particularly GCC 5.x and CUDA<8) gives weird errors. Perhaps we should check for this to help users.

Errors when installing odlcuda with GCC.

We have installed odl and run pytest, no errors. But when we tried to install odlcuda by using this command “CUDA_ROOT=/usr/local/cuda-10.2 CUDA_COMPUTE=75 conda build ./conda”. The following errors occurs,
"conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {"gcc[version='<5']"}".
I wonder what the problem is and how to slove this. Thanks a lot.

using odlcuda

This is kind of a continuation of issue odlgroup/odl#1074.

A few observations:

I can import odlcuda. The import also works without _install_location = __file__ but in the following I left it in.
The order matters. I first tried

import odlcuda
import odl

which causes odl not to know 'cuda' but

import odl
import odlcuda

works!

This is not really related to CUDA but still weird: I tried to test timings similar to my application.

domain_cpu = odl.uniform_discr([0], [1], [3e+8], impl='numpy')

and failed.

Traceback (most recent call last):

File "", line 4, in
domain_cpu = odl.uniform_discr([0], [1], [3e+8], impl='numpy')

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/lp_discr.py", line 1311, in uniform_discr
**kwargs)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/lp_discr.py", line 1222, in uniform_discr_fromintv
**kwargs)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/lp_discr.py", line 1136, in uniform_discr_fromspace
nodes_on_bdry)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/partition.py", line 940, in uniform_partition_fromintv
grid = uniform_grid_fromintv(intv_prod, shape, nodes_on_bdry=nodes_on_bdry)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/grid.py", line 1092, in uniform_grid_fromintv
shape = normalized_scalar_param_list(shape, intv_prod.ndim, safe_int_conv)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/util/normalize.py", line 149, in normalized_scalar_param_list
out_list.append(param_conv(p))

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/util/normalize.py", line 396, in safe_int_conv
raise ValueError('cannot safely convert {} to integer'.format(number))

ValueError: cannot safely convert 300000000.0 to integer

Now the CUDA issue: Instead of 1D, I went 3D:

domain_gpu = odl.uniform_discr([0, 0, 0], [1, 1, 1], [4000, 300, 400], impl='cuda')
x_gpu = domain_gpu.one()

error:

Traceback (most recent call last):

File "", line 4, in
x_gpu = domain_gpu.one()

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/discretization.py", line 473, in one
return self.element_type(self, self.dspace.one())

File "/home/me404/.local/lib/python2.7/site-packages/odlcuda-0.5.0-py2.7.egg/odlcuda/cu_ntuples.py", line 912, in one
return self.element_type(self, self._vector_impl(self.size, 1))

RuntimeError: function_attributes(): after cudaFuncGetAttributes: invalid device function

Any idea what is wrong here?

A better python installation

The current install script assumes the user wants to install for the python, and also happens to have a somewhat restrictive licence. When discussing this isuse in STIR with Kris he pointed out a better script, we should swap to that. Togeather with fixing issue #9, this would greatly improve the install process.

Add more ufuncs

Numpy supports a long list of ufuncs

we should support all of these.

Build for multiple CUDA compute versions?

Currently AFAICS you can only specify a single CUDA_COMPUTE value. For conda packages it would be good to build for a number of values so the package works for different GPU architectures.
I don't know very much about this topic so I don't know if this is necessary at all or if it's fine to just set the minimum version required. The only thing I know is that the packages on the conda channel are built with 52 and fail for lower versions due to "invalid device function" errors.