Giter VIP home page Giter VIP logo

cuda-benchmarking's Introduction

CUDA-Benchmarking

Motivation

This is a project in which I tried to implement comparisons between the different flavours of Matrix-Matrix products, as used in feed-forward and backpropagation. Cuda and C - examples are compiled and then executed via Ctypes in Python - since that is my language of choice. Cuda examples were borrowed and modified from Nvidia's toolkit webpage to suit my needs.
The following techniques are compared against each other:

  • numpy (calls optimized C library)
  • c sequential (my sequential c implementation)
  • Cuda Shared Memory approach (smaller blocks are stored in faster L1-cache and calculated one after the other)
  • CudaBlas (used by modern ML frameworks such as Pytorch and Tensorflow)

Method

Two matrices A of incresing size (m - 2^5 to 2^16, n - 2^4 to 2^15) as well as B, the transposed matrix of same size as Aare multiplied with one another. If the calculation is > 0.05 seconds the method is being disqualified. Random floats (32bit) were used for testing.

Results

Here the relative differences between Method and stepsize:
As you can see my sequential C implementation disqualified pretty quickly as the matrices grow exponential in size.
Numpy is extremely optimized and methods are called from C and it disqualifies at a 2^12 where it is 3853x slower than the fastes method - CudaBlas
Interestingly, the CUDA -shared mem implementation holds up pretty well against CUDA Blas and is only about 6x (results are rounded) slower.
Also interesting - which needs further investigation is how slow Cuda shared mem and Cuda blas is at smaller matrices. My assumption is the introduced latency of loading the data onto the GPU or the thread-block size of 16*16 threads causes some issues. This needs further investigation however.

Todo

  • why is cuda lagging on smaller matrix sizes?
  • testing on Jetson Nano device
  • can we improve the speed though increasing the thread-block size?
  • what role play the VRAM sizes Jetson Nano (2GB) vs. 1080Ti(11GB)?

cuda-benchmarking's People

Contributors

lukasld avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.