Giter VIP home page Giter VIP logo

blaze_cuda's Introduction

Blaze CUDA · WIP

CUDA extension for Blaze.

Introduction

The library is made to add CUDA capability to Blaze by adding CUDA vector, matrix and tensor types.

Build requirements

The only requirement is to use clang in CUDA mode instead of nvcc. nvcc fails to compile Blaze despite being "C++14-compatible", whereas clang succeeds in CUDA mode. Additionally, clang outputs cleaner error messages and provides a more standard shell interface, which makes scripting, and dependency management in makefiles easier.

The example folder provides a simple Makefile that can be used as a reference for projects that use Blaze CUDA.

Installation

sudo make install

Uninstall target available as well

Features

  • Dense Vectors
  • Dense Matrices (no CustomMatrix yet)
  • Element-wise operations for dense matrices & vectors
  • [WIP] Partial cuBLAS implementation for more complex operations

Blaze Tensor will be supported in the future.

blaze_cuda's People

Contributors

jpenuchot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

blaze_cuda's Issues

Partial evaluation for Matrix/Matrix multiplication: potential redesign of Blaze CUDA

Matrix/Matrix multiplication is a computation that requires an evaluation as it relies on BLAS kernels. For that reason assign() is overloaded by special functions whenever an evaluation is required, but the overload that prevails remains the one for CUDADynamicMatrix.

This is a blocking feature for Blaze CUDA, it has got all my attention recently.

The issue here is that the workflow of smpAssign() is different than the one I expected. I might have to change the whole approach for Blaze CUDA, I've been thinking about making a separate cudaAssign() function. Klaus Iglberger suggested that I could overload DMatDMatMultExpr but it would only solve the problem for that specific computation, and I'd like it to be solved properly for all computations.

The problem however is that cudaAssign() would be external to the expressions and would not have access to the private type traits so I might need an additional type traits system if I take that direction. I'll give it a shot and see how that goes.

Add benchmarks

Performance being a main feature of Blaze, it has to be for Blaze CUDA too. Therefore benchmarking will be necessary at some point to make sure we reach that goal.

Add type traits

Adding type traits for better integration with the original Blaze. This would allow us to make sure the right assign functions are called given the operands' types.

This part will require a lot of attention, type traits are rather easy to implement but are structural elements to the library.

CUDA runtime error management

Error management macro is being worked on, most of the work will consist in making sure all CUDA runtime errors are handled.

Add tests

We definitely need them. These should be easy to get from the original Blaze.

Add documentation

This might be done once the code gets a nice refactoring. The structure might be subject to change, so this issue will be on hold for now.

blaze::CUDAReduce - Inaccuarte results for large CUDADynamicVector

blaze::CUDAReduce doesn't work for large sizes. I've been unable to find the source of the bug for days now and I'm running out of ideas.

Above a certain threshold the CUDA reduce kernel (the __global__ function) will start outputting inaccurate values. I've been trying to pinpoint the issue, to add synchronization directives but nothing seems to help.

Partial evaluation: Adapt more expressions

Most of the work is done now, cudaAssign() needs overloads for every expression to support partial evaluation properly, following the same implementation pattern as in DMatDMatAddExpr.h:

  • External to the original expression templates
  • Implement the same functionalities as their CPU counterparts
  • Follow the same enable condition as their CPU counterparts
  • Call cudaAssign() instead of assign()

cuBLAS will be used as much as possible to implement them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.