Comments (5)
Rethinking, I can understand that it is probably interpreted by moving both sums (over k and over l) to the very outside.
Yes, this is exactly right. Unlike Einstein it does not know that +
is special, it sees y[j,i] := f(g(a[i,k], b[j,k]), g(a[m,j], b[m,i]))
as something so sum over k
and j
.
The macro always generates just one loop nest. All that you can do with |>
is apply an operation later in the nest. The canonical example is @tullio y[i] := x[i,k]^2 |> sqrt
which makes
for i in axes(y,1) # outer loops for LHS
tmp = 0
for k in axes(x,2) # inner loops are the sum
tmp += x[i,k]^2
end
y[i] = sqrt(tmp) # sqrt moved later
end
Maybe "finaliser" is the wrong word, but that's all it does. I see what you're hoping for but that requires a more complicated set of loops which Tullio doesn't understand. I think the macro has not noticed the |>
at all (since it's not at top level) and hence calls Base's version which does nothing here. Probably it should throw an error instead.
from tullio.jl.
Thanks for the explanation. This makes sense. I found a way to write it, but I guess this is still doing pretty much the same as the first example above.
julia> @btime $c .= (@tullio $y[j,i] := $a[i,k] * $b[j,k]) .+ (@tullio $y[j,i] := $a[m,j] * $b[m,i]);
673.596 ms (4 allocations: 15.26 MiB)
The timings and memory are for 1k x1k arrays. The memory consumption got me a little worried, which is why I also tried a simple matrix multiplication:
julia> @btime c .= $a * transpose($b);
9.269 ms (3 allocations: 7.63 MiB)
julia> @btime @tullio c[j,i] = $a[i,k] * $b[j,k];
549.171 ms (9 allocations: 176 bytes)
Memory is no problem here, but speed is (probably cache usage?).
Is this (Tullio being 5x slower) a known issue?
from tullio.jl.
The non-tullio way of writing this is also using 15 Mb, but is faster:
julia> @btime $c .= $a*transpose($b) .+ transpose($a)*$b;
19.053 ms (4 allocations: 15.26 MiB)
I guess this may have to do with the magic of efficient (<O(NĀ²)) implementation of matrix multiplications.
from tullio.jl.
If you have c
then this will avoid the allocations:
mul!(c, a, transpose(b))
mul!(c, transpose(a), b, true, true)
So will things like @tullio y[j,i] += a[m,j] * b[m,i]
, with =
or +=
but not :=
.
For straight matrix multiplication, Tullio will usually lose to more specialised routines. See e.g. this graph: https://github.com/JuliaLinearAlgebra/Octavian.jl Around size 100, it suffers from the overhead of using Base's threads. Around size 3000, it suffers from not knowing about some optimisations. (I don't think <N^3 algorithms like Strassen are actually used in BLAS, but not very sure.) Tullio's main purpose in life is handling weird contractions which aren't served at all by such libraries, or which would require expensive permutedims
operations before/after. These are where it can be 5x faster sometimes.
from tullio.jl.
Thanks. Of course <O(N^3) is what I meant. Interesting to know that Strassen or alike are not actually used in BLAS as discussed here:
I was using Tullio for exactly this use-case, a weired contraction. See this code.
Yet it seems to be hard to avoid allocations or storing intermediate results in large arrays.
from tullio.jl.
Related Issues (20)
- Alternative to Tullio for Chained Multiplication HOT 4
- @views macro causes module compilation failure HOT 3
- Reporting a bug when Tullio being included with LoopVectorization HOT 1
- [Question] Is it possible to create a vector of SVectors from a Matrix using Tullio? HOT 2
- [Question] How to change summation order? HOT 5
- Use package extensions HOT 1
- Method error when broadcast and sum of matrices HOT 1
- GPU Kernel Compilation Failed with Interpolations HOT 2
- Upgrade to CUDA.CUDAKernels HOT 9
- Bug when using Tullio + LoopVectorization HOT 5
- Add Finch.jl backend HOT 4
- CUDA v4 support HOT 2
- Using threads, vs setting threads=false gives different result HOT 3
- Issue with vectorized functions on GPU HOT 3
- Error when specifying the range of an index with a UnitRange HOT 4
- Scalar indexing with CUDA HOT 10
- Please update dep of FillArrays to v1.
- Bad interaction with Enzyme? HOT 6
- Zygote with Tullio gives wrong gradients/pullbacks using CUDA HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ššš
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ā¤ļø Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tullio.jl.