Giter VIP home page Giter VIP logo

matrix-cuda's Introduction

matrix-cuda

matrix multiplication in CUDA, this is a toy program for learning CUDA, some functions are reusable for other purposes

test results

following tests were carried out on a Tesla M2075 card

[lzhengchun@clus10 liu]$ ./a.out

please type in m n and k

1024 1024 1024

Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on GPU: 13.604608 ms.

Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on CPU: 9925.121094 ms.

all results are correct!!!, speedup = 729.541138

[lzhengchun@clus10 liu]$ ./a.out

please type in m n and k

1024 1024 1023

Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on GPU: 51.141281 ms.

Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on CPU: 8964.353516 ms.

all results are correct!!!, speedup = 175.286057

#Notes

(1) function gpu_matrix_mult: A naive implementation on GPUs assigns one thread to compute one element of matrix C. Each thread loads one row of matrix A and one column of matrix B from global memory, do the inner product, and store the result back to matrix C in the global memory. In the naive implementation, the amount of computation is 2 x M x N x K flop, while the amount of global memory access is 2 x M x N x K word. The "computation-to-memory ratio" is approximately 1/4 (flop/byte). Therefore, the naive implementation is memory bandwidth bounded.

(2) function gpu_square_matrix_mult: (!!! this is only for square matrix mutiplication)

To increase the "computation-to-memory ratio", the tiled matrix multiplication can be applied. One thread block computes one tile of matrix C. Each thread in the thread block computes one element of the tile. The figure shows a 32 x 32 matrix divided into four 16 x 16 tiles. To compute this, four thread blocks, each with 16 x 16 threads can be created. The GPU kernel computes C in multiple iterations. In each iteration, each thread block loads one tile of A and one tile of B from global memory to shared memory, performs computation, and stores temporal result of C in register. After all the iteration is done, the thread block stores one tile of C into global memory. For example, a thread block can compute C0,0 in two iterations: C0,0 = A0,0 B0,0 + A0,1 B1,0. Therefore, in the tiled implementation, the amount of computation is still 2 x M x N x K flop. However, using tile size of B, the amount of global memory access is 2 x M x N x K / B word. The "computation-to-memory ratio" is approximately B/4 (flop/byte). We now can tune the "computation-to-memory" ratio by changing the tile size B. Futher explain please redirect to http://www.es.ele.tue.nl/~mwijtvliet/5KK73/?page=mmcuda (be careful with the Pseudocode, there are missing points).

As you can see from the test results, tiled version has much better speedup than gpu_matrix_mult.

#comparison with openmp

(Intel(R) Xeon(R) CPU E5645 @ 2.40GHz) X 4 = 24 Cores

[lzhengchun@clus10 liu]$ ./a.out

please type in m n and k

2300 2300 2300

Time elapsed on matrix multiplication of 2300x2300 . 2300x2300 on GPU: 166.835617 ms.

Time elapsed on matrix multiplication of 2300x2300 . 2300x2300 on CPU: 19520.644531 ms.

all results are correct!!!, speedup = 117.005257

[lzhengchun@clus10 liu]$ ./a.out

please type in m n and k

1024 1024 1024

Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on GPU: 15.479232 ms.

Time elapsed on matrix multiplication of 1024x1024 . 1024x1024 on CPU: 2045.946167 ms.

all results are correct!!!, speedup = 132.173630

[lzhengchun@clus10 liu]$ ./a.out

please type in m n and k

1024 1024 1023

Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on GPU: 53.428638 ms.

Time elapsed on matrix multiplication of 1024x1024 . 1024x1023 on CPU: 1563.460571 ms.

all results are correct!!!, speedup = 29.262594

So, the openmp version is about 5X faster than single thread version, still far from theoritical (24)

#todo

(1) further optimization, especially the "computation-to-memory ratio" for non square matrix

(2) solve shared Mem Bank conflict issue and Global Memory does not coalesced issue

matrix-cuda's People

Contributors

danielgestwa avatar lzhengchun avatar samithaherath avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.