Giter VIP home page Giter VIP logo

odlcuda's Introduction

odlcuda

CUDA backend for ODL

Introduction

This is a default CUDA backend for ODL. It contains a default implementation of a CUDA based Rn space with associated methods.

By installing this package, CUDA spaces become available to rn, uniform_discr and related spaces in ODL through the impl='cuda' option.

Building

The project uses CMake to enable builds. Using cmake-gui is recommended

Cmake webpage

Unix:

The binaries are typically built in a build folder. Starting at the repository's top level directory, do

mkdir build
cd build
cmake-gui ../

And set the required variables.

To build and install the package to your python installation, run (as root)

make pyinstall

To verify your installation, run (in the odlcuda root directory)

py.test

This requires the pytest package.

Windows

To build on windows, open the CMake gui, run configure-generate and set the required variables. Then open the project with Visual Studio and build pyinstall.

Code guidelines

The code is written in C++11/14.

Compilers

The code is intended to be usable with all major compilers. Current status (2015-06-22) is

Platform Compiler Cuda Compute Works
Windows 7 VS2013 7.0 2.0
Windows 7 VS2013 7.0 5.2
Windows 10 VS2015 7.0 5.2 TODO
Fedora 21 GCC 4.9 7.0 5.2
Ubuntu ?.? ??? 7.0 5.2
Mac OSX ??? 7.0 5.2 TODO

Formating

The code is formatted using CLang format as provided by the LLVM project. The particular style used is defined in the formatting file.

External Dependences

Current external dependencies are

#####Python The building block of ODL, odlcuda needs access to both python and numpy header files and compiled files to link against.

#####Boost General library with C++ code. This project specifically uses Boost.Python to handle the python bindings.

Boost webpage Prebuilt boost (windows), this version uses python 2.7

#####CUDA Used for GPU accelerated versions of algorithms. The code uses C++11 features in device code, so CUDA 7.0 is required. CUDA 6.5 may work on some platforms (notably windows). It should as of now compile with compute capbability >2.0, in the future certain features require higher compute capability may be added.

CUDA

Troublefinding

There are a few common errors encountered, this is the solution to some of these

Installation

  • When compiling if, you get a error like

      NumpyConfig.h not found
    

    then it is likely that the variable PYTHON_NUMPY_INCLUDE_DIR is not correctly set.

  • Compiling

      error C1083: Cannot open include file: 'Eigen/Core': No such file or directory.
    

    you have tried to build the default target, instead build target "pyinstall".

  • Compiling

      [ 20%] Building NVCC (Device) object odlcuda/CMakeFiles/cuda.dir//./cuda_generated_cuda.cu.o /usr/include/c++/4.9.2/bits/alloc_traits.h(248): error: expected a ">"
    

    It may be that you are trying to compile with CUDA 6.5 and GCC 4.9, this combination is not supported by CUDA.

  • Compiling

      error LNK1112: module machine type 'x64' conflicts with target machine type 'X86'
    

    You have a 64-bit library on your path (Boost for instance) while trying to build 32-bit odlcuda. Either change the lib, or configure to build 64-bit. On Windows, if you are using Visual Studio to compile use Configuration Manager to set platform to x64, if you are compiling on command line via CMake ensure that the Script Generator is for instance "Visual Studio 12 2013 Win64" (note the Win64 at the end).

  • If you get a error like

      Error	5	error LNK2019: unresolved external symbol "__declspec(dllimport) struct _object * __cdecl boost::python::detail::init_module(struct PyModuleDef &,void (__cdecl*)(void))" (__imp_?init_module@detail@python@boost@@YAPEAU_object@@AEAUPyModuleDef@@P6AXXZ@Z) referenced in function PyInit_utils	C:\Programming\Projects\odlcuda_bin\odlcuda\utils.obj	utils
    

    then it is likely that you are trying to build against unmatched python header files and boost python version

Running

  • If, when running the tests in python, you get an error like

      RuntimeError: function_attributes(): after cudaFuncGetAttributes: invalid device function
    

    It may be that the compute version used is not supported by your setup, try changing the cmake variable CUDA_COMPUTE.

  • If, when running the test in python, you encounter an error like

      ImportError: No module named odlcuda
    

    It may be that you have not installed the package, run (as root) make pyinstall or equivalent.

License

GPL Version 3. See LICENSE file.

If you would like to get the code under a different license, please contact the developers.

Main developers

  • Jonas Adler (jonas-<ätt>-kth--se)
  • Holger Kohr (kohr-<ätt>-kth--se)

odlcuda's People

Contributors

adler-j avatar kohr-h avatar larswestergren avatar niinimaki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

odlcuda's Issues

Change from boost to pybind11

PyBind11 is an alternative to boost python that does not require a compiled part (and the rest of boost), this would simplify compilation and distribution by PyPI.

installation with local odl

I managed to install odlcuda with odl that comes from pip. However, that version is outdated so I would like to use odlcuda with the latest odl version. Following exactly the same installation steps as with the off-the-shelf odl version, my odl cannot find "cuda" as an implementation, e.g.

NotImplementedError: no corresponding data space available for space FunctionSpace(IntervalProd([-333.8016, -333.8016, 0. ], [ 333.8016 , 333.8016 , 257.96875]), out_dtype='float32') and implementation 'cuda'

This is all strange, because it does find my odl version within the installation and this odl version works perfectly fine without cuda. Also my system finds odlcuda as it shows up on auto-complete after import odl. Any ideas what might have gotten wrong here? What is the mechanism that tells odl that odlcuda is present?

slow maximum function

I noticed that the maximum function is very slow in odlcuda on the GPU. In fact, it is slower than computing the maximum on the CPU. Please see my example test case below. Any ideas why that is and how to fix it?

import odl

X = odl.rn(300 * 10**6, dtype='float32')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)

X = odl.rn(300 * 10**6, dtype='float32', impl='cuda')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)
CPU times: user 346 ms, sys: 200 µs, total: 346 ms
Wall time: 347 ms
CPU times: user 1.44 s, sys: 0 ns, total: 1.44 s
Wall time: 1.43 s
CPU times: user 838 ms, sys: 341 ms, total: 1.18 s
Wall time: 1.18 s
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 91.1 µs

performance of odlcuda

I was running some experiments and there seems to be a performance issue with CUDA. However, I am not sure whether odlcuda is causing it or this is just a general CUDA phenomenon. Below is the code that I ran with its output. I don't think it is very important to know what this code is doing but if need be, am happy to post that.

There seems to be a phenomenon that with CUDA the code scales linearly with the number of subsets even though in each run of the code approximately the same number of flops are executed. The timings are pretty constant for numpy. The question is whether this is an issue with odlcuda or this is a general CUDA phenomenon. In all of these timings, there is no copying from or to the GPU. I am aware that executing kernels generates an overhead but I would not have thought this is so dramatic. In particular as one can see that the smallest subset still contains 262 x 65000 = 17 million elements.

Do you think that this is related to the way ODL is written or do you think this is a general CUDA thing?

for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape,
                                               impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        %time fx = f(x)
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    %time fx = f(x)

impl:cuda, shape:[4200, 65000], nsubsets:1
CPU times: user 88.2 ms, sys: 24 ms, total: 112 ms
Wall time: 114 ms

impl:cuda, shape:[1050, 65000], nsubsets:4
CPU times: user 325 ms, sys: 28.7 ms, total: 354 ms
Wall time: 361 ms

impl:cuda, shape:[262, 65000], nsubsets:16
CPU times: user 1.43 s, sys: 31.7 ms, total: 1.46 s
Wall time: 1.49 s

impl:cuda, shape:[4200, 65000]
CPU times: user 93.6 ms, sys: 20.2 ms, total: 114 ms
Wall time: 116 ms

impl:numpy, shape:[4200, 65000], nsubsets:1
CPU times: user 10.7 s, sys: 352 ms, total: 11.1 s
Wall time: 6.88 s

impl:numpy, shape:[1050, 65000], nsubsets:4
CPU times: user 13.4 s, sys: 562 ms, total: 14 s
Wall time: 6.9 s

impl:numpy, shape:[262, 65000], nsubsets:16
CPU times: user 24.7 s, sys: 1 s, total: 25.7 s
Wall time: 7.04 s

impl:numpy, shape:[4200, 65000]
CPU times: user 10.8 s, sys: 380 ms, total: 11.1 s
Wall time: 6.92 s

for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape, impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        out = Y.element()
        
        t = 0        
        for i in range(len(Y)):
            f_prox = f[i].convex_conj.proximal(x[i])
            src.tic()
            f_prox(x[i], out=out[i])
            t += src.toc()
        print('time:{}, average:{}'.format(t, t / len(Y)))
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    out = Y.element()
    f_prox = f.convex_conj.proximal(x)
    %time f_prox(x, out=out)

impl:cuda, shape:[4200, 65000], nsubsets:1
time:0.607445955276, average:0.607445955276

impl:cuda, shape:[1050, 65000], nsubsets:4
time:0.405075311661, average:0.101268827915

impl:cuda, shape:[262, 65000], nsubsets:16

time:2.36511826515, average:0.147819891572

impl:cuda, shape:[4200, 65000]
CPU times: user 146 ms, sys: 127 µs, total: 146 ms
Wall time: 150 ms

impl:numpy, shape:[4200, 65000], nsubsets:1
time:3.18624901772, average:3.18624901772

impl:numpy, shape:[1050, 65000], nsubsets:4
time:3.12681221962, average:0.781703054905

impl:numpy, shape:[262, 65000], nsubsets:16
time:3.25435972214, average:0.203397482634

impl:numpy, shape:[4200, 65000]
CPU times: user 13.8 s, sys: 1.47 s, total: 15.2 s
Wall time: 3.19 s

Add optimized versions of p-norm and p-dist for p != 2

The standard p-norm can be implemented with the help of the CUDA sum and abs ufuncs, but this involves a copy. We need a C++ implementation if we want efficiency.

For p=inf I currently don't see a way of implementing in Python. We can include that case in the same function in C++, too, but while we're at it, the max() and min() functions would be good to have.

Installation issues

Hi when I try to build I get the following error;

CMake Error at CMakeLists.txt:28 (add_dependencies):
  add_dependencies called with incorrect number of arguments


CMake Warning (dev) in CMakeLists.txt:
  No cmake_minimum_required command is present.  A line of code such as

    cmake_minimum_required(VERSION 2.8)

  should be added at the top of the file.  The version specified may be lower
  if you wish to support older CMake versions for this project.  For more
  information run "cmake --help-policy CMP0000".
This warning is for project developers.  Use -Wno-dev to suppress it.

Remove pyinstall

We should change remove the custom pyinstall and instead use the built in install found in CMake. This is used in our STIR clone.

Installation without admin rights

I have been trying to install odlcuda without admin rights. I have changed the "CMAKE_INSTALL_PREFIX" folder to a folder where I do have write access. This approach worked well to install other software packages. However, odlcuda seems to ignore this path and tries to install it into the default path

/usr/local/lib/python2.7/dist-packages/

instead.

I tried two options how to change the installation path, all of which should work according to some forums, but none does here.

  • cmake -DCMAKE_INSTALL_PREFIX=~/.local/lib/python2.7/site-packages ../
  • make DESTDIR=~/.local/lib/python2.7/site-packages pyinstall

Any ideas of what is going on here?

Error when trying to install odlcuda

We installed odl and run test, no issues. Then we tried to install odlcuda. But we had a problem when we ran this command "CUDA_ROOT=/usr/local/cuda-9.0 CUDA_COMPUTE=37 conda build ./conda". Error shows "conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {"odl[version='>=0.3.0']"}".
Do you what the problem is?

Better error with wrong GCC

currently using unmatched GCC and CUDA (particularly GCC 5.x and CUDA<8) gives weird errors. Perhaps we should check for this to help users.

Errors when installing odlcuda with GCC.

We have installed odl and run pytest, no errors. But when we tried to install odlcuda by using this command “CUDA_ROOT=/usr/local/cuda-10.2 CUDA_COMPUTE=75 conda build ./conda”. The following errors occurs,
"conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {"gcc[version='<5']"}".
I wonder what the problem is and how to slove this. Thanks a lot.

using odlcuda

This is kind of a continuation of issue odlgroup/odl#1074.

A few observations:

  1. I can import odlcuda. The import also works without _install_location = __file__ but in the following I left it in.

  2. The order matters. I first tried

import odlcuda
import odl

which causes odl not to know 'cuda' but

import odl
import odlcuda

works!

  1. This is not really related to CUDA but still weird: I tried to test timings similar to my application.

domain_cpu = odl.uniform_discr([0], [1], [3e+8], impl='numpy')

and failed.

Traceback (most recent call last):

File "", line 4, in
domain_cpu = odl.uniform_discr([0], [1], [3e+8], impl='numpy')

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/lp_discr.py", line 1311, in uniform_discr
**kwargs)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/lp_discr.py", line 1222, in uniform_discr_fromintv
**kwargs)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/lp_discr.py", line 1136, in uniform_discr_fromspace
nodes_on_bdry)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/partition.py", line 940, in uniform_partition_fromintv
grid = uniform_grid_fromintv(intv_prod, shape, nodes_on_bdry=nodes_on_bdry)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/grid.py", line 1092, in uniform_grid_fromintv
shape = normalized_scalar_param_list(shape, intv_prod.ndim, safe_int_conv)

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/util/normalize.py", line 149, in normalized_scalar_param_list
out_list.append(param_conv(p))

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/util/normalize.py", line 396, in safe_int_conv
raise ValueError('cannot safely convert {} to integer'.format(number))

ValueError: cannot safely convert 300000000.0 to integer

  1. Now the CUDA issue: Instead of 1D, I went 3D:

domain_gpu = odl.uniform_discr([0, 0, 0], [1, 1, 1], [4000, 300, 400], impl='cuda')
x_gpu = domain_gpu.one()

error:

Traceback (most recent call last):

File "", line 4, in
x_gpu = domain_gpu.one()

File "/mhome/damtp/s/me404/store/repositories/git_ODL/odl/discr/discretization.py", line 473, in one
return self.element_type(self, self.dspace.one())

File "/home/me404/.local/lib/python2.7/site-packages/odlcuda-0.5.0-py2.7.egg/odlcuda/cu_ntuples.py", line 912, in one
return self.element_type(self, self._vector_impl(self.size, 1))

RuntimeError: function_attributes(): after cudaFuncGetAttributes: invalid device function

Any idea what is wrong here?

Build for multiple CUDA compute versions?

Currently AFAICS you can only specify a single CUDA_COMPUTE value. For conda packages it would be good to build for a number of values so the package works for different GPU architectures.
I don't know very much about this topic so I don't know if this is necessary at all or if it's fine to just set the minimum version required. The only thing I know is that the packages on the conda channel are built with 52 and fail for lower versions due to "invalid device function" errors.

nd arrays

We currently only support 1d arrays. I'll look into how we could improve this to either true Nd array support, or to at least 2d, 3d support.

Circular dependency with CUDA

We have a circular dependency wherein odlcuda imports odl, and odl.space..entry_points imports odlcuda. We need to solve this somehow.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.