Giter VIP home page Giter VIP logo

cl-cuda's Introduction

Cl-Cuda

Cl-cuda is a library to use NVIDIA CUDA in Common Lisp programs. It provides not only FFI binding to CUDA driver API but the kernel description language with which users can define CUDA kernel functions in S-expression. The kernel description language also provides facilities to define kernel macros and kernel symbol macros in addition to kernel functions. Cl-cuda's kernel macro and kernel symbol macro offer powerful abstraction that CUDA C itself does not have and provide enormous advantage in resource-limited GPU programming.

Kernel functions defined with the kernel description language can be launched as almost same as ordinal Common Lisp functions except that they must be launched in a CUDA context and followed with grid and block sizes. Kernel functions are compiled and loaded automatically and lazily when they are to be launched for the first time. This process is as following. First, they are compiled into a CUDA C code (.cu file) by cl-cuda. The compiled CUDA C code, then, is compiled into a CUDA kernel module (.ptx file) by NVCC - NVIDIA CUDA Compiler Driver. The obtained kernel module is automatically loaded via CUDA driver API and finally the kernel functions are launched with properly constructed arguments to be passed to CUDA device. Since this process is autonomously managed by the kernel manager, users do not need to handle it for themselves. About the kernel manager, see Kernel manager section.

Memory management is also one of the most important things in GPU programming. Cl-cuda provides memory block data structure which abstract host memory and device memory. With memory block, users do not need to manage host memory and device memory individually for themselves. It lightens their burden on memory management, prevents bugs and keeps code simple. Besides memory block that provides high level abstraction on host and device memory, cl-cuda also offers low level interfaces to handle CFFI pointers and CUDA device pointers directly. With these primitive interfaces, users can choose to gain more flexible memory control than using memory block if needed.

Cl-cuda is verified on several environments. For detail, see Verification environments section.

Example

Following code is a part of vector addition example using cl-cuda based on CUDA SDK's "vectorAdd" sample.

You can define vec-add-kernel kernel function using defkernel macro. In the definition, aref is to refer values stored in an array. set is to store values into an array. block-dim-x, block-idx-x and thread-idx-x have their counterparts in CUDA C's built-in variables and are used to specify the array index to be operated in each CUDA thread.

Once the kernel function is defined, you can launch it as if it is an ordinal Common Lisp function except that it requires to be in a CUDA context and followed by :gird-dim and :block-dim keyword parameters which specify the dimensions of grid and block. To keep a CUDA context, you can use with-cuda macro which has responsibility on initializing CUDA and managing a CUDA context. with-memory-blocks manages memory blocks which abstract host memory area and device memory area, then sync-memory-block copies data stored in a memory block between host and device.

For the whole code, please see examples/vector-add.lisp.

(defkernel vec-add-kernel (void ((a float*) (b float*) (c float*) (n int)))
  (let ((i (+ (* block-dim-x block-idx-x) thread-idx-x)))
    (if (< i n)
        (set (aref c i)
             (+ (aref a i) (aref b i))))))

(defun main ()
  (let* ((dev-id 0)
         (n 1024)
         (threads-per-block 256)
         (blocks-per-grid (/ n threads-per-block)))
    (with-cuda (dev-id)
      (with-memory-blocks ((a 'float n)
                           (b 'float n)
                           (c 'float n))
        (random-init a n)
        (random-init b n)
        (sync-memory-block a :host-to-device)
        (sync-memory-block b :host-to-device)
        (vec-add-kernel a b c n
                        :grid-dim  (list blocks-per-grid 1 1)
                        :block-dim (list threads-per-block 1 1))
        (sync-memory-block c :device-to-host)
        (verify-result a b c n)))))

Installation

You can install cl-cuda via quicklisp.

> (ql:quickload :cl-cuda)

You may encounter the following error, please install CFFI explicitly (ql:quickload :cffi) before loading cl-cuda. Just once is enough.

Component CFFI-GROVEL not found
   [Condition of type ASDF/FIND-SYSTEM:MISSING-COMPONENT]

Requirements

Cl-cuda requires following:

  • NVIDIA CUDA-enabled GPU
  • CUDA Toolkit, CUDA Drivers and CUDA SDK need to be installed

Verification environments

Cl-cuda is verified to work in following environments:

Environment 1

  • Mac OS X 10.6.8 (MacBookPro)
  • GeForce 9400M
  • CUDA 4
  • SBCL 1.0.55 32-bit
  • All tests pass, all examples work

Environment2

  • Amazon Linux x86_64 (Amazon EC2)
  • Tesla M2050
  • CUDA 4
  • SBCL 1.1.7 64-bit
  • All tests pass, all examples which are verified work (others not tried yet)
  • (setf *nvcc-options* (list "-arch=sm_20" "-m32")) needed

Environment3 (Thanks to Viktor Cerovski)

  • Linux 3.5.0-32-generic Ubuntu SMP x86_64
  • GeFroce 9800 GT
  • CUDA 5
  • SBCL 1.1.7 64-bit
  • All tests pass, all examples work

Environment4 (Thanks to wvxvw)

  • Fedra18 x86_64
  • GeForce GTX 560M
  • CUDA 5.5
  • SBCL 1.1.2-1.fc18
  • vector-add example works (didn't try the rest yet)

Further information:

  • (setf *nvcc-options* (list "-arch=sm_20" "-m32")) needed
  • using video drivers from rpmfusion instead of the ones in cuda package
  • see issue #1

Environment5 (Thanks to Atabey Kaygun)

  • Linux 3.11-2-686-pae SMP Debian 3.11.8-1 (2013-11-13) i686 GNU/Linux
  • NVIDIA Corporation GK106 GeForce GTX 660
  • CUDA 5.5
  • SBCL 1.1.12
  • All tests pass, all examples work

Environment6 (Thanks to @gos-k)

  • Ubuntu 16.04.1 LTS
  • GeForce GTX 1080
  • CUDA Version 8.0.27
  • Driver Version 367.35
  • CCL Version 1.11-r16635 (LinuxX8664)
  • All tests pass, all examples work

API

Here explain some APIs commonly used.

[Macro] with-cuda

WITH-CUDA (dev-id) &body body

Initializes CUDA and keeps a CUDA context during body. dev-id is passed to get-cuda-device function and the device handler returned is passed to create-cuda-context function to create a CUDA context in the expanded form. The results of get-cuda-device and create-cuda-context functions are bound to *cuda-device* and *cuda-context* special variables respectively. The kernel manager unloads before with-cuda exits.

[Function] synchronize-context

SYNCHRONIZE-CONTEXT

Blocks until a CUDA context has completed all preceding requested tasks.

[Function] alloc-memory-block

ALLOC-MEMORY-BLOCK type size

Allocates a memory block to hold size elements of type type and returns it. Actually, linear memory areas are allocated on both host and device memory and a memory block holds pointers to them.

[Function] free-memory-block

FREE-MEMORY-BLOCK memory-block

Frees memory-block previously allocated by alloc-memory-block. Freeing a memory block twice should cause an error.

[Macro] with-memory-block, with-memory-blocks

WITH-MEMORY-BLOCK (var type size) &body body
WITH-MEMORY-BLOCKS ({(var type size)}*) &body body

Binds var to a memory block allocated using alloc-memory-block applied to the given type and size during body. The memory block is freed using free-memory-block when with-memory-block exits. with-memory-blocks is a plural form of with-memory-block.

[Function] sync-memory-block

SYNC-MEMORY-BLOCK memory-block direction

Copies stored data between host memory and device memory for memory-block. direction is either :host-to-device or :device-to-host which specifies the direction of copying.

[Accessor] memory-block-aref

MEMORY-BLOCK-AREF memory-block index

Accesses memory-block's element specified by index. Note that the accessed memory area is that on host memory. Use sync-memory-block to synchronize stored data between host memory and device memory.

[Macro] defglobal

DEFGLOBAL name type &optional expression qualifiers

Defines a global variable. name is a symbol which is the name of the variable. type is the type of the variable. Optional expression is an expression which initializes the variable. Optional qualifiers is one of or a list of keywords: :device, :constant, :shared, :managed and :restrict, which are corresponding to CUDA C's __device__, __constant__, __shared__, __managed__ and __restrict__ variable qualifiers. If not given, :device is used.

(defglobal pi float 3.14159 :constant)

[Accessor] global-ref

Accesses a global variable's value on device from host with automatically copying its value from/to device.

(defglobal x :device int 0)
(global-ref x)                 ; => 0
(setf (global-ref x) 42)
(global-ref x)                 ; => 42

[Special Variable] *tmp-path*

Specifies the temporary directory in which cl-cuda generates files such as .cu file and .ptx file. The default is "/tmp/".

(setf *tmp-path* "/path/to/tmp/")

[Special Variable] *nvcc-options*

Specifies additional command line options passed to nvcc command that cl-cuda calls internally. The default is nil. If -arch=sm_XX option is not specified here, it is automatically inserted with cuDeviceComputeCapability driver API.

(setf *nvcc-options* (list "-arch=sm_20" "-m32"))

[Special Variable] *nvcc-binary*

Specifies the path to nvcc command so that cl-cuda can call internally. The default is just nvcc.

(setf *nvcc-binary* "/path/to/nvcc")

[Special Variable] *show-messages*

Specifies whether to let cl-cuda show operational messages or not. The default is t.

(setf *show-messages* nil)

[Special Variable] *sdk-not-found*

Readonly. The value is t if cl-cuda could not find CUDA SDK or at least it failed to load libcuda for some reason, otherwise nil.

*sdk-not-found*    ; => nil

Kernel Description Language

Types

not documented yet.

IF statement

IF test-form then-form [else-form]

if allows the execution of a form to be dependent on a single test-form. First test-form is evaluated. If the result is true, then then-form is selected; otherwise else-form is selected. Whichever form is selected is then evaluated. If else-form is not provided, does nothing when else-form is selected.

Example:

(if (= a 0)
    (return 0)
    (return 1))

Compiled:

if (a == 0) {
  return 0;
} else {
  return 1;
}

LET statement

LET ({(var init-form)}*) statement*

let declares new variable bindings and set corresponding init-forms to them and execute a series of statements that use these bindings. let performs the bindings in parallel. For sequentially, use let* kernel macro instead.

Example:

(let ((i 0))
  (return i))

Compiled:

{
  int i = 0;
  return i;
}

SYMBOL-MACROLET statement

SYMBOL-MACROLET ({(symbol expansion)}*) statement*

symbol-macrolet establishes symbol expansion rules in the variable environment and execute a series of statements that use these rules. In cl-cuda's compilation process, the symbol macros found in a form are replaces by corresponding expansions.

Example:

(symbol-macrolet ((x 1.0))
  (return x))

Compiled:

{
  return 1.0;
}

MACROLET statement

MACROLET ({(name lambda-list local-form*)}*) statement*

macrolet establishes local macro definitions, using the same format as defkernelmacro, and executes a series of statements with these definition bindings.

Example:

(macrolet ((square (a)
             (if (numberp a)
                 (* a a)
                 `(* ,a ,a))))
  (return (square 2)))

Compiled:

{
  return 4;
}

DO statement

DO ({(var init-form step-form)}*) (test-form) statement*

do iterates over a group of statements while test-form holds. do accepts an arbitrary number of iteration vars and their initial values are supplied by init-forms. step-forms supply how the vars should be updated on succeeding iterations through the loop.

Example:

(do ((a 0 (+ a 1))
     (b 0 (+ b 1)))
    ((> a 15))
  (do-some-statement))

Compiled:

for ( int a = 0, int b = 0; ! (a > 15); a = a + 1, b = b + 1 )
{
  do_some_statement();
}

WITH-SHARED-MEMORY statement

WITH-SHARED-MEMORY ({(var type size*)}*) statement*

with-shared-memory declares new variable bindings on shared memory by adding __shared__ variable specifiers. It allows to declare array variables if dimensions are provided. A series of statements are executed with these bindings.

Example:

(with-shared-memory ((a int 16)
                     (b float 16 16))
  (return))

Compiled:

{
  __shared__ int a[16];
  __shared__ float b[16][16];
  return;
}

SET statement

SET reference expression

set provides simple variable assignment. It accepts one of variable, structure and array references as reference.

Example:

(set x 1.0)
(set (float4-x y 1.0)
(set (aref z 0) 1.0)

Compiled:

x = 1.0;
y.x = 1.0;
z[0] = 1.0;

PROGN statement

PROGN statement*

progn evaluates statements, in the order in which they are given.

Example:

(progn
  (do-some-statements)
  (do-more-statements))

Compiled:

do_some_statements();
do_more_statements();

RETURN statement

RETURN [return-form]

return returns control, with return-form if supplied, from a kernel function.

Example:

(return 0)

Compiled:

return 0;

Architecture

The following figure illustrates cl-cuda's overall architecture.

                   +---------------------------------+-----------+-----------+
                   | defkernel                       | memory    | context   |
       cl-cuda.api +---------------------------------+           |           |
                   | kernel-manager                  |           |           |
                   +---------------------------------+-----------+-----------+
                   +----------------------------+----------------------------+
      cl-cuda.lang | Kernel description lang.   | the Compiler               |
                   +----------------------------+----------------------------+
                   +---------------------------------------------------------+
cl-cuda.driver-api | driver-api                                              |
                   +---------------------------------------------------------+
                   +---------------------------------------------------------+
              CUDA | CUDA driver API                                         |
                   +---------------------------------------------------------+

Cl-cuda consists of three subpackages: api, lang and driver-api.

driver-api subpackage is a FFI binding to CUDA driver API. api subpackage invokes CUDA driver API via this binding internally.

lang subpackage provides the kernel description language. It provides the language's syntax, type, built-in functions and the compiler to CUDA C. api subpackage calls this compiler.

api subpackage provides API for cl-cuda users. It further consists of context, memory, kernel-manager and defkernel subpackages. context subpackage has responsibility on initializing CUDA and managing CUDA contexts. memory subpackage offers memory management, providing high level API for memory block data structure and low level API for handling host memory and device memory directly. kernel-manager subpackage manages the entire process from compiling the kernel description language to loading/unloading obtained kernel module autonomously. Since it is wrapped by defkernel subpackage which provides the interface to define kernel functions, cl-cuda's users usually do not need to use it for themselves.

Kernel manager

The kernel manager is a module which manages defining kernel functions, compiling them into a CUDA kernel module, loading it and unloading it. I show you its work as a finite state machine here.

To begin with, the kernel manager has four states.

I   initial state
II  compiled state
III module-loaded state
IV  function-loaded state

The initial state is its entry point. The compiled state is a state where kernel functions defined with the kernel descrpition language have been compiled into a CUDA kernel module (.ptx file). The obtained kernel module has been loaded in the module-loaded state. In the function-loaded state, each kernel function in the kernel module has been loaded.

Following illustrates the kernel manager's state transfer.

      compile-module        load-module            load-function
    =================>    =================>     =================>
  I                    II                    III                    IV
    <=================    <=================
      define-function     <========================================
      define-macro          unload
      define-symbol-macro
      define-global

kernel-manager-compile-module function compiles defined kernel functions into a CUDA kernel module. kernel-manager-load-module function loads the obtained kernel module. kernel-manager-load-function function loads each kernel function in the kernel module.

In the module-loaded state and function-loaded state, kernel-manager-unload function unloads the kernel module and turn the kernel manager's state back to the compiled state. kernel-manager-define-function, kernel-manager-define-macro, kernel-manager-define-symbol-macro and kernel-manager-define-global functions, which are wrapped as defkernel, defkernelmacro, defkernel-symbol-macro and defglobal macros respectively, change its state back into the initial state and make it require compilation again.

The kernel manager is stored in *kernel-manager* special variable when cl-cuda is loaded and keeps alive during the Common Lisp process. Usually, you do not need to manage it explicitly.

How cl-cuda works when CUDA SDK is not installed

This section is for cl-cuda users who develop an application or a library which has alternative sub system other than cl-cuda and may run on environments CUDA SDK is not installed.

Compile and load time Cl-cuda is compiled and loaded without causing any conditions on environments CUDA SDK is not installed. Since cl-cuda API 's symbols are interned, user programs can use them normally.

Run time At the time cl-cuda's API is called, an error that tells CUDA SDK is not found should occur. With *sdk-not-found* special variable, user programs can get if cl-cuda has found CUDA SDK or not.

How cl-cuda determines CUDA SDK is installed or not is that if it has successfully loaded libuda dynamic library with cffi:user-foreign-library function.

Streams

The low level interface works with multiple streams. With the async stuff it's possible to overlap copy and computation with two streams. Cl-cuda provides *cuda-stream* special variable, to which bound stream is used in kernel function calls.

The following is for working with streams in mgl-mat:

(defmacro with-cuda-stream ((stream) &body body)
  (alexandria:with-gensyms (stream-pointer)
    `(cffi:with-foreign-objects
         ((,stream-pointer 'cl-cuda.driver-api:cu-stream))
       (cl-cuda.driver-api:cu-stream-create ,stream-pointer 0)
       (let ((,stream (cffi:mem-ref ,stream-pointer
                                    'cl-cuda.driver-api:cu-stream)))
         (unwind-protect
              (locally ,@body)
           (cl-cuda.driver-api:cu-stream-destroy ,stream))))))

then, call a kernel function with binding a stream to *cuda-stream*:

(with-cuda-stream (*cuda-stream*)
  (call-kernel-function))

Author

Copyright

Copyright (c) 2012 Masayuki Takagi ([email protected])

License

Licensed under the MIT License.

cl-cuda's People

Contributors

drsplinter avatar fare avatar ghollisjr avatar gos-k avatar melisgl avatar takagi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cl-cuda's Issues

Failing vector-add test (Linux amd64 CUDA 5)

I'm failing this test on FC17

uname -r
3.9.10-100.fc17.x86_64
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
sbcl --version
SBCL 1.0.57-1.fc17

The error I receive is: CUDA_ERROR_LAUNCH_FAILED, which is, afaik, a generic error if "something" went wrong.

WARNING: This may not be a bug, in fact, this may be a misconfiguration on my side, however, I'd appreciate if you could tell me what else to check.

This is the output from the test:

VECTOR-ADD> (main)
CU-INIT succeeded.
CU-DEVICE-GET succeeded.
CU-CTX-CREATE succeeded.
CU-MEM-ALLOC succeeded.
CU-MEM-ALLOC succeeded.
CU-MEM-ALLOC succeeded.
CU-MEMCPY-HOST-TO-DEVICE succeeded.
CU-MEMCPY-HOST-TO-DEVICE succeeded.
nvcc -arch=sm_11 -I /home/wvxvw/quicklisp/local-projects/cl-cuda/include -ptx -o /tmp/cl-cuda-sBBXlw.ptx /tmp/cl-cuda-sBBXlw.cu
CU-MODULE-LOAD succeeded.
CU-MODULE-GET-FUNCTION succeeded.
CU-LAUNCH-KERNEL succeeded.
; Evaluation aborted on #<SIMPLE-ERROR "~A failed with driver API error No. ~A.~%~A" {1003FEF573}>.

Misleading information when failed to load libcuda.

Fact

  • When cl-cuda is being loaded, it is completed even if loading libcuda failed.
  • After that, when foreign functions in libcuda are called, CL-CUDA.DRIVER-API::SDK-NOT-FOUND-ERROR is raised to show "CUDA SDK not found."

Problem

  • We can not find out that the reason of the condition raised is CUDA SDK is not found or found but its loading has some problems.

See also
#42

Infer global's type from its initial value.

Infer global's type from its initial value, removing type argument from DEFGLOBAL macro.

before

(defglobal x int 1)

after

(defglobal x 1)    ; type of x is inferred as int.

add double float support

Add double float suport:

  • double
  • double3
  • double4

See melisgl/cl-cuda@ea8cf60e9c74e878973d85338f1ab727b76b68b3 and melisgl/cl-cuda@67f96a0b530808e70af7c495f7735d5ad9b29034 in #4.

Latter says about -arch=sm13 NVCC options needed for double floats.

passing structure type references to ALLOC-GL-ARRAY's TYPE argument

In the definition of cl-cuda-interop:alloc-memory-block, since alloc-gl-array function's type argument accepts only symbols, structure type references must be passd to type in bare style which is actually deprecated in CFFI. For example, foo must be passed instead of (:struct foo).

NG: (alloc-gl-array '(:struct foo) count)
OK: (alloc-gl-array 'foo count)

As a working around for this problem, I define bare-cffi-type function which convert structure type references from the form (:struct foo) to foo, and pass its returning value to alloc-gl-array function.

(alloc-gl-array (bare-cffi-type type) count)

This problem is already reported on cl-opengl's issue tracker #41.

OpenGL interoperability's performance lose after cleaning-up

After cleaning-up, the performance of N-body example with OpenGL interoperability seems worse than that of before cleaning-up. If my memory is correct, OpenGL interoperability gave a little performance gain.

  • before cleaning-up, about 5 percent of performance gain with OpenGL interoperability
  • after cleaning-up, about 40 percent of performance lose with OpenGL interoperability

It is a disappointing result if OpenGL interoperability gives performance lose.

Add initializer syntax for CUDA vector types.

Add initializer syntax for CUDA vector types as compiled:

__device__ float4 foo = { 0.0, 1.0, 2.0, 3.0 };

It's because __device__, __constant__ and __shared__ variables are not allowed for dynamic initialization so the following is invalid:

__device__ float4 foo = make_float4( 0.0, 1.0, 2.0, 3.0 );

to be accepted in Quicklisp distribution

Currently, cl-cuda is not available in Quicklisp distribution because of its testing policy (see #514 in quicklisp-projects).

It may be accepted if it just finished to be compiled without condition on an environment where CUDA SDK is not installed even though it does not work.

Approach:

  1. try to compile grovel files which include cuda.h before evaluate the defsystem form in cl-cuda.asd
  2. a condition would be signaled since CUDA SDK is not installed
  3. handle the condition and push a flag to *features* which mentions CUDA SDK is not installed
  4. in the defsystem form, avoid cffi-grovel:grovel-file form to be evaluated by looking *features*

Quetions:

  • may be warned that some symbols are not found if avoid cffi-grovel:grovel-file?

Support cuModuleGetGlobal driver API.

Support cuModuleGetGlobal dirver API. It is useful in case using parameters which are dynamically determined in a program but not changed across launching kernel functions.

  • Support cuModuleGetGlobal driver API.
  • Introduce a cl-cuda API to define CUDA C global.

warnings caused by an unused argument

The unused argument type in definition of defkernelconst macro causes a warning.

See melisgl/cl-cuda@97ea6cf7bdfc7450c033152b7d6b3d555bb5efd2 in issue #4 .

warnings caused by forward references

The variables below are forward-referenced and cause warnings:

  • +built-in-functions+
  • +built-in-macros+
  • kernel-manager

See melisgl/cl-cuda@97ea6cf7bdfc7450c033152b7d6b3d555bb5efd2 in issue #4 .

use _v2 of CUDA functions when available

Use _v2 of CUDA functions when available:

  • cuCtxCreate_v2
  • cuCtxDestroy_v2
  • cuMemAlloc_v2
  • cuMemFree_v2
  • cuMemcpyHtoD_v2
  • cuMemcpyDtoH_v2
  • cuEventDestroy_v2

Question:

  • are there any other functions having _v2?

See melisgl/cl-cuda@db464369fa42f7090fa6ec6b3ee216d0279ee320 in #4

Simple Example not working

After trying the code on the main page...

(defun main ()
(let* ((dev-id 0)
(n 1024)
(threads-per-block 256)
(blocks-per-grid (/ n threads-per-block)))
(with-cuda (dev-id)
(with-memory-blocks ((a 'float n)
(b 'float n)
(c 'float n))
(random-init a n)
(random-init b n)
(sync-memory-block a :host-to-device)
(sync-memory-block b :host-to-device)
(vec-add-kernel a b c n
:grid-dim (list blocks-per-grid 1 1)
:block-dim (list threads-per-block 1 1))
(sync-memory-block c :device-to-host)
(verify-result a b c n)))))

I got this error

nvcc exits with code: 127
/usr/bin/env: nvcc: No such file or directory
[Condition of type SIMPLE-ERROR]

Cannot find CUDA SDK

Hello,
Running latest CCL with Version 1.11-r16635 on OS X 10.10.5
Could load and compile cl-cuda without problem.
I have a hard time referencing my version of CUDA which is NVIDIA-CUDA-7.5
When I run any cuda example, I get an error message:
e.g. (cl-cuda-examples.diffuse0:main)

Error: CUDA SDK not found.
While executing: CL-CUDA.DRIVER-API:CU-INIT, in process Listener(4).
How do I configure cl-cuda to reference the right framework ?
Should I recompile ?

Maybe a silly question - My first time using this library.

grovel size_t type

Grovel size_t type which is environment-dependent.

Question:

  • where to place a grovel specifiation file?

See melisgl/cl-cuda@d6e6dd94a5ca7a8243f23f7eddecbbd56aa51ceb in #4

support curand XORWOW

Support curand XORWOW:

  • curand_init
  • curand_uniform
  • curandStateXORWOW_t

Depends on #15, #19, #21 and #22.

See melisgl/cl-cuda@85c27a967e00edf6ef57ddebfacf2d4f30d76682 in #4.

support pointers and integers when launching kernels

Only MEMORY-BLOCKs were suported previously which is fine as long as
one uses ALLOC-MEMORY-BLOCK. With this change CU-DEVICE-PTRs obtained
directly from CU-MEM-ALLOC can be used.

See melisgl/cl-cuda@67f96a0b530808e70af7c495f7735d5ad9b29034 in #4.

improve compiling cl-cuda type to CUDA C type

Improve the way to compile cl-cuda type to CUDA C type.

  • int -> "int" : OK
  • curand-state-xorwow -> "curandStateXORWOW" : NG
  • curand-state-xorwow -> "curandStateXORWOW_t" : OK

Currently, cl-cuda type is translated to string simply.

See melisgl/cl-cuda@85c27a967e00edf6ef57ddebfacf2d4f30d76682 in #4.

Add selector macros for CUDA vector types' CL counterparts.

Add selector macros for CUDA vector types' CL counterparts: float3, float4, double3 and double4.

(defmacro with-float4 ((x y z w) value &body body)
  (once-only (value)
    `(let ((,x (float4-x ,value))
           (,y (float4-y ,value))
           (,z (float4-z ,value))
           (,w (float4-w ,value)))
       (declare (ignorable ,x ,y ,z ,w))
       ,@body)))

no class named CFFI-GROVEL::PROCESS-OP

Trying to load cl-cuda in sbcl, I get this error:

* (ql:quickload :cl-cuda)

debugger invoked on a LOAD-SYSTEM-DEFINITION-ERROR in thread #<THREAD "main thread" RUNNING {1002A8B383}>: Error while trying to load definition for system cl-cuda from pathname /home/dev/quicklisp/local-projects/cl-cuda/cl-cuda.asd: There is no class named CFFI-GROVEL::PROCESS-OP.

What am I doing wrong?

grovel CUdeviceptr type

Grovel CUdeviceptr type from cuda_kernel.h.

Question:

  • also grovel other CUDA driver API types?
  • also grovel other CUDA driver API functions, structures and enumerations?
  • where to place a grovel specification file?

See melisgl/cl-cuda@d6e6dd94a5ca7a8243f23f7eddecbbd56aa51ceb in #4

Appropriately compile single and double precision float values.

Compile single float values to be explicitly typed to avoid being compiled as double float values.

before

0.0

after

0.0f

Additionally, fix double float values which are now compiled as (double)0.0 to 0.0, double float literal.

before

(double)0.0

after

0.0

support unsigned long long type

Support unsigned long long type which is used in curand library.

See melisgl/cl-cuda@85c27a967e00edf6ef57ddebfacf2d4f30d76682 in #4.

can't define a __device__ kernel function that returns void type

Currently, a function specifier is determined by its return type, that __global__ for void type and __device__ for not void type.

For example,

(defkernel foo (void ())
  (return))

is compiled into:

__global__ void foo () {
  return;
}

Because of this rule, a __device__ kernel function that returns void type can't be defined.

To solve this problem, following syntaxes may be given:

(defdevicekernel foo (void ()) ...
(defkernel (foo :device) (void ()) ...
(defkernel foo :device (void ()) ...
(defkernel foo ((void :device) ()) ...
(defkernel foo (void :device ()) ...

I think of choosing the second one. Function specifiers can be omitted and the current rule is applied in such case.

:global is specified:

(defkernel (foo :global) (void ())
  (return))
;; compiled into: __global__ void foo () { ... }

:device is specified:

(defkernel (bar :device) (void ())
  (return))
;; compiled into: __device__ void bar () { ... }

__global__ is complemented because return type is void:

(defkernel foofoo (void ())
  (return))
;; compiled into: __global__ void foofoo () { ... }

__device__ is complemented because return type is int:

(defkernel baz (int ())
  (return 1))
;; compiled into: __device__ int baz () { ... }

PROGN statements and brace blocks "{ ... }" in CUDA C

Currenlty, the compiler make brace blocks { ... } when compiling following statements:

  • IF
  • LET
  • SYMBOL-MACROLET
  • DO
  • WITH-SHARED-MEMORY

On the other hand, It does not make brace blocks when compiling PROGN statement.

Should PROGN statements correspond to brace blocks in CUDA C?

If yes, what should LET statements be compiled into?

{
  int x = 0;
  return x;
}

or

{
  int x = 0;
  {
    return x;
  }
}

I want to adopt the former compiled code.

warnings caused by specifying cffi structure type

Specifying cffi structure type without :struct keyword causes warnings. For example, float3 structure type should be specified as '(:struct float3), not 'float3, to avoid warnings.

See melisgl/cl-cuda@97ea6cf7bdfc7450c033152b7d6b3d555bb5efd2 in issue #4 .

Any way to run on Windows?

Setting up cl-cuda seems to hook into gcc to create the FFI. GCC is well and good thanks to MSYS2/MinGW64, but apparently the CUDA toolkit and MinGW don't play nice together. Is there any way to set up cl-cuda to use the Windows CUDA toolchain?

add some math functions

Add math functions:

  • exp
  • log
  • __double2int_rn

Depends on #15.

See melisgl/cl-cuda@1713af4a7a6d8cdbb3048d8f4f21ac99f6010d21 in #4.

don't fail if cuda library cannot be loaded

Don't fail if cuda library cannot be loaded.

Question:

  • why want to ignore-errors if cuda library cannot be loaded, because without cuda library cl-cuda makes no sense.

See melisgl/cl-cuda@ca0bde3fe89db1192f89bf2a702990900e996c61 in #4

Can't compile on Ubuntu 14.4 / CUDA 6.5

Hi there, I get the following error trying to quickload cl-cuda. The error message at the end is in German, it says "fatal error: cuda.h: File or directory not found":

  • (ql:quickload :cl-cuda)
    To load "cl-cuda":
    Load 1 ASDF system:
    cl-cuda
    ; Loading "cl-cuda"
    ........; cc -m64 -I/home/mwoehrle/quicklisp/dists/quicklisp/software/cffi_0.14.0/ -o /home/mwoehrle/.cache/common-lisp/sbcl-1.1.14.debian-linux-x64/home/mwoehrle/quicklisp/local-projects/local-projects/cl-cuda/src/driver-api/type-grovel /home/mwoehrle/.cache/common-lisp/sbcl-1.1.14.debian-linux-x64/home/mwoehrle/quicklisp/local-projects/local-projects/cl-cuda/src/driver-api/type-grovel.c

debugger invoked on a CFFI-GROVEL:GROVEL-ERROR in thread #<THREAD "main thread" RUNNING {1002A8AF53}>: External process exited with code 1.
Command was: "cc" "-m64" "-I/home/mwoehrle/quicklisp/dists/quicklisp/software/cffi_0.14.0/" "-o" "/home/mwoehrle/.cache/common-lisp/sbcl-1.1.14.debian-linux-x64/home/mwoehrle/quicklisp/local-projects/local-projects/cl-cuda/src/driver-api/type-grovel" "/home/mwoehrle/.cache/common-lisp/sbcl-1.1.14.debian-linux-x64/home/mwoehrle/quicklisp/local-projects/local-projects/cl-cuda/src/driver-api/type-grovel.c"
Output was:
/home/mwoehrle/.cache/common-lisp/sbcl-1.1.14.debian-linux-x64/home/mwoehrle/quicklisp/local-projects/local-projects/cl-cuda/src/driver-api/type-grovel.c:6:18: fatal error: cuda.h: Datei oder Verzeichnis nicht gefunden
#include <cuda.h>
^
compilation terminated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.