Comments (5)
Thanks for looking into all this. That's certainly not a graph which will impress anyone!
I'd call it disappointing but unsurprising. Tullio generates the most naiive possible KernelAbstractions code, and I know from e.g. these transpose examples that there is much more than could be done. How easily and how well this can be done automatically, for arbitrary expressions, I don't know.
I also don't know whether there are operations on which this naiive code is more competitive. On the CPU, large matmul is a severe test of how you handle awkward memory access, but something like @tullio d[i] := p[i,k] * log(p[i,k]/q[i,k])
has much more work per element and a more obvious access pattern.
Would be nice for Tullio to do better at these things, but I don't even have a working GPU at the moment, and I have many projects. If anyone wanted to try that would be fantastic, I think the code to generate loops can be played with without diving into the more tangled macro code for parsing etc.
For the purposes of BLASBenchmarksGPU, there may already exist much better KernelAbstractions matmul code, no need to auto-generate it. And it might be interesting to see how varying levels of cleverness affect things at different sizes.
from tullio.jl.
Here is a plot for Matrix{Float32}=Matrix{Float32}×Matrix{Float32}
:
And here are the TFLOPS values:
14×4 DataFrame
Row │ Size Library TFLOPS Time
│ Int64 Symbol Float64 Float64
─────┼────────────────────────────────────────────
1 │ 128 CUBLAS 0.189052 22186.0
2 │ 128 Tullio 0.0552115 75968.0
3 │ 256 CUBLAS 0.93772 35783.0
4 │ 256 Tullio 0.290567 115479.0
5 │ 512 CUBLAS 3.5438 75748.0
6 │ 512 Tullio 0.633555 423697.0
7 │ 1024 CUBLAS 19.232 111662.0
8 │ 1024 Tullio 0.745863 2.8792e6
9 │ 2048 CUBLAS 41.5728 413248.0
10 │ 2048 Tullio 0.535823 3.20626e7
11 │ 4096 CUBLAS 49.3898 2.78274e6
12 │ 4096 Tullio 0.31201 4.40496e8
13 │ 8192 CUBLAS 39.9035 2.75543e7
14 │ 8192 Tullio 0.298638 3.68176e9
from tullio.jl.
Here is Matrix{Float32}=Matrix{Float32}×Matrix{Float32}
at small matrix sizes:
And here are the TFLOPS values:
Row │ Size Library TFLOPS Time
│ Int64 Symbol Float64 Float64
─────┼───────────────────────────────────────
1 │ 1 CUBLAS 1.07614e-7 18585.0
2 │ 1 Tullio 1.1015e-7 18157.0
3 │ 2 CUBLAS 1.12859e-6 14177.0
4 │ 2 Tullio 8.46158e-7 18909.0
5 │ 3 CUBLAS 3.05603e-6 17670.0
6 │ 3 Tullio 2.58807e-6 20865.0
7 │ 4 CUBLAS 7.21411e-6 17743.0
8 │ 4 Tullio 6.12821e-6 20887.0
9 │ 5 CUBLAS 1.40473e-5 17797.0
10 │ 5 Tullio 1.19355e-5 20946.0
11 │ 6 CUBLAS 2.91695e-5 14810.0
12 │ 6 Tullio 2.05695e-5 21002.0
13 │ 7 CUBLAS 4.50575e-5 15225.0
14 │ 7 Tullio 3.25922e-5 21048.0
15 │ 8 CUBLAS 6.45772e-5 15857.0
16 │ 8 Tullio 4.75682e-5 21527.0
17 │ 9 CUBLAS 8.25875e-5 17654.0
18 │ 9 Tullio 6.71889e-5 21700.0
19 │ 10 CUBLAS 0.000111744 17898.0
20 │ 10 Tullio 9.05305e-5 22092.0
21 │ 30 CUBLAS 0.00295421 18279.0
22 │ 30 Tullio 0.00176655 30568.0
23 │ 50 CUBLAS 0.011872 21058.0
24 │ 50 Tullio 0.00448053 55797.0
25 │ 70 CUBLAS 0.0324304 21153.0
26 │ 70 Tullio 0.0105823 64825.0
27 │ 90 CUBLAS 0.0624679 23340.0
28 │ 90 Tullio 0.023251 62707.0
29 │ 110 CUBLAS 0.107313 24806.0
30 │ 110 Tullio 0.0386391 68894.0
31 │ 130 CUBLAS 0.194994 22534.0
32 │ 130 Tullio 0.0501592 87601.0
33 │ 150 CUBLAS 0.238196 28338.0
34 │ 150 Tullio 0.0735534 91770.0
35 │ 170 CUBLAS 0.395524 24843.0
36 │ 170 Tullio 0.104512 94018.0
37 │ 190 CUBLAS 0.437074 31386.0
38 │ 190 Tullio 0.140838 97403.0
39 │ 210 CUBLAS 0.548053 33796.0
40 │ 210 Tullio 0.177458 104374.0
41 │ 230 CUBLAS 0.76275 31903.0
42 │ 230 Tullio 0.223804 108729.0
43 │ 250 CUBLAS 1.06841 29249.0
44 │ 250 Tullio 0.273533 114246.0
from tullio.jl.
Re the Float32 graph, in round numbers my CPU gets to about half a teraflop from the low-hundreds, which appears to be faster than Tullio's GPU matmul in the range shown. My graph is here:
https://github.com/mcabbott/Tullio.jl/blob/master/benchmarks/02/matmul-0.2.11-Float32-1.6.0-dev.png
And xref also #30 about GPU vs CPU speed.
from tullio.jl.
I would be curious to know whether #82 helps here. I am not expecting a huge difference, but it might help a little.
from tullio.jl.
Related Issues (20)
- Alternative to Tullio for Chained Multiplication HOT 4
- @views macro causes module compilation failure HOT 3
- Reporting a bug when Tullio being included with LoopVectorization HOT 1
- [Question] Is it possible to create a vector of SVectors from a Matrix using Tullio? HOT 2
- [Question] How to change summation order? HOT 5
- Use package extensions HOT 1
- How finalizers `|>` work HOT 5
- Method error when broadcast and sum of matrices HOT 1
- GPU Kernel Compilation Failed with Interpolations HOT 2
- Upgrade to CUDA.CUDAKernels HOT 9
- Bug when using Tullio + LoopVectorization HOT 5
- Add Finch.jl backend HOT 4
- CUDA v4 support HOT 2
- Using threads, vs setting threads=false gives different result HOT 3
- Issue with vectorized functions on GPU HOT 3
- Error when specifying the range of an index with a UnitRange HOT 4
- Scalar indexing with CUDA HOT 10
- Please update dep of FillArrays to v1.
- Bad interaction with Enzyme? HOT 6
- Zygote with Tullio gives wrong gradients/pullbacks using CUDA HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tullio.jl.