stellar-group / blaze_cuda Goto Github PK

View Code? Open in Web Editor NEW

17.0 5.0 3.0 4.23 MB

WIP · CUDA compatibility for Blaze · https://bitbucket.org/blaze-lib/blaze

C++ 31.04% C 0.09% Makefile 0.01% Cuda 68.86%

blaze cuda linear-algebra hpc metaprogramming cpp cpp14 gpu

blaze_cuda's Introduction

Blaze CUDA · WIP

CUDA extension for Blaze.

Introduction

The library is made to add CUDA capability to Blaze by adding CUDA vector, matrix and tensor types.

Build requirements

The only requirement is to use clang in CUDA mode instead of nvcc. nvcc fails to compile Blaze despite being "C++14-compatible", whereas clang succeeds in CUDA mode. Additionally, clang outputs cleaner error messages and provides a more standard shell interface, which makes scripting, and dependency management in makefiles easier.

The example folder provides a simple Makefile that can be used as a reference for projects that use Blaze CUDA.

Installation

sudo make install

Uninstall target available as well

Features

Dense Vectors
Dense Matrices (no CustomMatrix yet)
Element-wise operations for dense matrices & vectors
[WIP] Partial cuBLAS implementation for more complex operations

Blaze Tensor will be supported in the future.

blaze_cuda's People

Contributors

Stargazers

Watchers

Forkers

chengwei920412 joshacarter mfkiwl

blaze_cuda's Issues

Partial evaluation for Matrix/Matrix multiplication: potential redesign of Blaze CUDA

Matrix/Matrix multiplication is a computation that requires an evaluation as it relies on BLAS kernels. For that reason assign() is overloaded by special functions whenever an evaluation is required, but the overload that prevails remains the one for CUDADynamicMatrix.

This is a blocking feature for Blaze CUDA, it has got all my attention recently.

The issue here is that the workflow of smpAssign() is different than the one I expected. I might have to change the whole approach for Blaze CUDA, I've been thinking about making a separate cudaAssign() function. Klaus Iglberger suggested that I could overload DMatDMatMultExpr but it would only solve the problem for that specific computation, and I'd like it to be solved properly for all computations.

The problem however is that cudaAssign() would be external to the expressions and would not have access to the private type traits so I might need an additional type traits system if I take that direction. I'll give it a shot and see how that goes.

External to the original expression templates
Implement the same functionalities as their CPU counterparts
Follow the same enable condition as their CPU counterparts
Call cudaAssign() instead of assign()

cuBLAS will be used as much as possible to implement them.