matmul's Introduction

matmul

Fast multiplication of matrices

The purpose of this project is to compare several matrix multiplication algorithms and show how exploiting cache locality can significantly improve execution speed. Until now, four versions have been implemented :

the basic one, called basic_mult, which completely compute c[i][j] by looping over k : c[i][j] += a[i][k] * b[k][j]. Not very cache-friendly.
block_mult_inplace, which decomposes the matrix into subblocks and performs operations on each subblock before moving to the next one.
block_mult_copy, which does the same as above but also rearranges the data of one subblock so that all lines of one subblock are contiguous in memory.
block_mult_copy_sse, same principle as above but with sse intrinsics.

Try modifying the optimization level by changing the OPT_LEVEL variable in the Makefile to see how it affects performance. There should not be much difference between block_mult_copy and block_mult_copy_sse with -O3 activated as gcc automatically uses vectorization at this level of optimization. But there is a huge difference when using -02.

Here are the results with -O2 on two 4096*4096 matrices (with an Intel i3-3227U, 1.90GHz CPU)

Time taken for basic_mult: 17.8mn
Time taken for block_mult_inplace: 148.28s
Time taken for block_mult_copy: 116.52s
Time taken for block_mult_copy_sse: 37.31s

With -O3:

Time taken for basic_mult: 18mn
Time taken for block_mult_inplace: 62.92s
Time taken for block_mult_copy: 32.31s
Time taken for block_mult_copy_sse: 30.05s
Time taken for block_mult_parallel with 4 threads: 11.7s

Recommend Projects

wlefranc / matmul Goto Github PK

matmul's Introduction

matmul

matmul's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent