Giter VIP home page Giter VIP logo

cuda-on-cl's Introduction

cuda-on-cl

Build applications written in NVIDIA® CUDA™ code for OpenCL™ 1.2 devices.

Concept

  • Compile using cocl
  • link using -lcocl -lOpenCL
  • at runtime, loads libOpenCL.so

Here is a screenshot of running on a Mac:

How to use, example

  • write a CUDA sourcecode file, or find an existing one
  • here's a simple example: cuda_sample.cu
  • Use cocl to compile cuda_sample.cu:
$ cocl cuda_sample.cu
   ...
   ... (bunch of compily stuff) ...
   ...

    ./cuda_sample.cu compiled into ./cuda_sample

Run:

$ ./cuda_sample
Using Intel , OpenCL platform: Intel Gen OCL Driver
Using OpenCL device: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2
hostFloats[2] 123
hostFloats[2] 222
hostFloats[2] 444

Two-step compilation

If you want, you can compile in two steps:

cocl -c teststream.cu
g++ -o teststream teststream.o -lcocl -lclblast -leasycl -lclew

Result is the same:

$ ./cuda_sample
Using Intel , OpenCL platform: Intel Gen OCL Driver
Using OpenCL device: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2
hostFloats[2] 123
hostFloats[2] 222
hostFloats[2] 444

Options

Option Description
-I provide an include directory, eg -I /usr/local/eigen
-o output filepath, eg -o foo.o
-c compile to .o file; dont link
--devicell-opt [option] pass [option] through to device ll optimization phase. Affects success and quality of OpenCL generation.
-fPIC passed to clang object-code compiler

The options provided to -devicell-opt are passed through to opt-3.8, http://llvm.org/docs/Passes.html

opt-3.8 fits in as follows:

  • clang-3.8 -x cuda --device-only converts the incoming .cu file to LLVM IR
  • opt-3.8 optimizes the IR. -devicell-opt control this
  • ir-to-opencl writes the IR as OpenCL

Recommended generation options:

  • --devicell-opt inline --devicell-opt mem2reg --devicell-opt instcombine --devicell-opt O2

You can open the -device.cl file to look at the OpenCL generated, and compare the effects of different options.

How it works

Behind the scenes, there are a few parts:

  • Device-side, cocl converts the CUDA kernels into OpenCL kernels
  • Host-side, cocl:
    • converts the cuda kernel launch code into opencl kernel launch code, and
    • bakes in the OpenCL code

More detail

New!

  • the device-IR to OpenCL step happens at runtime now
    • surprisingly, this actually is faster than doing it offline
    • thats because the GPU driver only needs to compile the small amount of OpenCL needed for a specific kernel, rather than an entire IR file
  • in addition, address-space deduction is significantly facilitated

What it provides

  • compiler for host-side code, including memory allocation, copy, streams, kernel launches
  • compiler for device-side code, handling templated C++ code, converting it into bog-standard OpenCL 1.2 code
  • cuBLAS API implementations for GEMM, GEMV, SCAL, SAXPY (using Cedric Nugteren's CLBlast)
  • cudnn API implementations for:
    • convolution (using im2col algorithim, over Cedric Nugteren's CLBlast)
    • pooling
    • activations: ReLU, tanh, sigmoid
    • softmax forward

How to build

Systems tested

  • Ubuntu 16.04, with:
    • NVIDIA GPU
  • Mac Sierra, with:
    • Intel HD Graphics 530
    • Radeon Pro 450

Pre-requisites

  • OpenCL-enabled GPU, and appropriate OpenCL drivers installed for the GPU

Mac OS X

cd ~
wget http://llvm.org/releases/3.8.0/clang+llvm-3.8.0-x86_64-apple-darwin.tar.xz
tar -xf clang+llvm-3.8.0-x86_64-apple-darwin.tar.xz
mv clang+llvm-3.8.0-x86_64-apple-darwin /usr/local/opt
ln -s /usr/local/opt/clang+llvm-3.8.0-x86_64-apple-darwin /usr/local/opt/llvm-3.8

set CLANG_HOME as export CLANG_HOME=/usr/local/opt/llvm-3.8

Ubuntu 16.04

sudo apt-get install llvm-3.8 llvm-3.8-dev clang-3.8
sudo apt-get install git cmake cmake-curses-gui libc6-dev-i386 make gcc g++ zlib1g-dev

set CLANG_HOME to /usr/lib/llvm-3.8

Build/installation

git clone --recursive https://github.com/hughperkins/cuda-on-cl
cd cuda-on-cl
mkdir build
cd build
cmake ..
# Note: I usually set build/release type to `Debug`, so this is what is tested
make -j 4
# on Ubuntu:
sudo make install
# or on Mac, if you have homebrew, you dont need sudo:
make install

Note that you'll need to continue to export CLANG_HOME environment variable when using cocl.

Test

There are the following tests:

gtest tests

cd build
make -j 4
./cocl_unittests

No dependencies on graphics card etc. It simply takes some hand-crafted IR, and writes it to OpenCL. It never actually tries to run the OpenCL, so it validates:

  • can cocl handle the IR without choking/crashing?
  • do the hand-crafted OpenCL expected results match up with the actual cocl outputs?

Tests from python

Pre-requisites

pip install -r test/requirements.txt

Procedure

OFFSET_32BIT=1 \
COCL_OPTIONS='--devicell-opt inline --devicell-opt mem2reg --devicell-opt instcombine --devicell-opt O2' \
py.test -v
  • python tests are at test

You can modify the options in COCL_OPTIONS. There are passed to the cocl command, see section #Options above.

If you set OFFSET_32BIT to off in your cmake options, you should remove the OFFSET_32BIT=1 optio nwhen running py.test

End-to-end tests

Run:

cd build
ccmake ..

turn on BUILD_TESTS, and run the build.

Now you can do, from build directory:

make run-tests

You can run a test by name, eg:

make run-offsetkernelargs

Result:

################################
# running:
################################
LD_LIBRARY_PATH=build: build/test-cocl-offsetkernelargs
Using Intel , OpenCL platform: Intel Gen OCL Driver
Using OpenCL device: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2
126.456

Tests options

From ccmake .., there are various options you can choose, that affect hte OpenCL code produced. These options will affect how well the OpenCL generation works, and how acceptable it is to your GPU driver. If you're reading the OpenCL code ,they will affect readability too.

You can see the section Options above for more details.

Docker

See docker. Docker images run ok on beignet and NVIDIA :-)

Related projects

License

Apache 2.0

News

  • May 5:
  • May 1:
    • dnn tests pass on Radeon Pro 450, on Mac Sierra now
    • fix crash bugs in pooling forward/backward, on Mac Sierra
    • thanks to my employer ASAPP giving me use of a nice Mac Book Pro 4th Generation, with Radeon Pro 450, unit tests now pass on said hardware :-)
  • April 29:
    • Updated to latest EasyCL. This lets you use environment variable CL_GPUOFFSET to choose different gpus, eg set to 1 to use second gpu, to 2 to use third gpu, etc
  • April 15:
  • April 14:
    • added backwards implementation for convolution, including data, filters, and bias
  • April 13:
    • added CLBlast wrappers for: sgemv, sscal, saxpy
  • April 4:
    • merged in current dnn branch, which provides forward convolutional implementation for cudnn API, using im2col over Cedric Nugteren's CLBlast
    • CUDA-on-CL got accepted for a technical presentation at this year's IWOCL conference :-) Conference sessions here: IWOCL 2017 Conference program
  • Nov 25:
  • Nov 24:
    • merge from branch clwriter:
      • lots of refactorization under-the-hood
      • can handle determining the address-space of functions returning pointers
      • opencl generation is at runtime now => facilitates determining address-space; and counter-intuitively is actually faster, because less OpenCL to compile by the GPU driver
  • Nov 18:
  • Nov 17:
    • merged runtime-compile branch into master branch. This brings a few changes:
      • opencl generation is now at runtime, rather than at compile time
        • this lets us build only the one specific kernel we need
        • means more information is available at generation time, facilitating the generation process
      • build on Mac OS X is more or less working, eg https://travis-ci.org/hughperkins/cuda-on-cl/builds/176580716
      • code radically refactorized underneath
      • remove --run_branch_transforms, --branches_as_switch, for now
  • Nov 8:
    • exposed generation options as cocl options, eg --run_branching_transforms, --branches_as_switch, and the --devicell-opt [opt] options
  • Nov 6:
    • created dockerfiles for Beignet and NVIDIA docker
  • Nov 5:
  • Older news

cuda-on-cl's People

Contributors

hughperkins avatar weimzh avatar guoyejun avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.