mcabbott / tullio.jl Goto Github PK
View Code? Open in Web Editor NEW⅀
License: MIT License
⅀
License: MIT License
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
I think something is not being communicated correctly to LoopVectorization.jl on version 0.2.9:
julia> using Tullio, LoopVectorization
[ Info: Precompiling Tullio [bc48ee85-29a4-5162-ae0b-a64e1601d4bc]
julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j]
tmul! (generic function with 1 method)
julia> foreach((2, 10, 50, 100)) do N
A, B = rand(N, N + 1), rand(N + 1, N + 2)
@show N
@btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with Tullio.jl
@btime mul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with OpenBLAS
end
N = 2
51.804 ns (0 allocations: 0 bytes)
123.709 ns (0 allocations: 0 bytes)
N = 10
210.035 ns (0 allocations: 0 bytes)
371.549 ns (0 allocations: 0 bytes)
N = 50
10.550 μs (0 allocations: 0 bytes)
13.939 μs (0 allocations: 0 bytes)
N = 100
25.340 μs (49 allocations: 3.19 KiB)
39.860 μs (0 allocations: 0 bytes)
(@v1.5) pkg> st Tullio LoopVectorization
Status `~/.julia/environments/v1.5/Project.toml`
[bdcacae8] LoopVectorization v0.8.26
[bc48ee85] Tullio v0.2.8
Now, restarting julia,
(@v1.5) pkg> add Tullio@v0.2.9
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Resolving package versions...
Updating `~/.julia/environments/v1.5/Project.toml`
[bc48ee85] ↑ Tullio v0.2.8 ⇒ v0.2.9
Updating `~/.julia/environments/v1.5/Manifest.toml`
[bc48ee85] ↑ Tullio v0.2.8 ⇒ v0.2.9
julia> using Tullio, LoopVectorization
[ Info: Precompiling Tullio [bc48ee85-29a4-5162-ae0b-a64e1601d4bc]
julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j]
tmul! (generic function with 1 method)
julia> foreach((2, 10, 50, 100)) do N
A, B = rand(N, N + 1), rand(N + 1, N + 2)
@show N
@btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with Tullio.jl
@btime mul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with OpenBLAS
end
N = 2
51.125 ns (0 allocations: 0 bytes)
129.749 ns (0 allocations: 0 bytes)
N = 10
847.338 ns (0 allocations: 0 bytes)
372.568 ns (0 allocations: 0 bytes)
N = 50
111.719 μs (0 allocations: 0 bytes)
13.920 μs (0 allocations: 0 bytes)
N = 100
261.787 μs (49 allocations: 3.19 KiB)
38.220 μs (0 allocations: 0 bytes)
(@v1.5) pkg> st Tullio LoopVectorization
Status `~/.julia/environments/v1.5/Project.toml`
[bdcacae8] LoopVectorization v0.8.26
[bc48ee85] Tullio v0.2.9
I have the slightly unusual use-case dense matrix = dense matrix * banded matrix (actually more complicated than this, but this shows the issue). With the help of @mcabbott I managed to implement the following:
using LinearAlgebra
using BandedMatrices
using Tullio
using LoopVectorization
function clear_off_bands!(A::BandedMatrix{T}) where T
m,n = size(A.data)
l,u = bandwidths(A)
for j = 1:u
A.data[1:u-j+1,j] .= zero(T)
end
mpn = m+n
for j = n-l+1:n
A.data[mpn-l-j+1:end,j] .= zero(T)
end
A
end
function my_mul!(w, v, A::BandedMatrix)
# This implementation assumes that the off-band values, i.e. the
# upper-left and lower-right corners, are zero, cf. clear_off_bands!
l,u = bandwidths(A)
m,n = size(A)
@tullio verbose=2 w[i,j] = begin
k = min(max(h + j - u - 1, 1), m)
v[i,k] * A.data[h,j]
end
w
end
n = 4000
v = Matrix((1.0+0im)*I, n, n) # Complex identity matrix
w = similar(v)
w2 = similar(v)
w3 = similar(v)
A = clear_off_bands!(brand(ComplexF64, n, n, 5, 5))
At = BandedMatrix(transpose(A))
@time mul!(w, v, A) # This is exceedingly slow, since this uses generic dense multiplication
@time mul!(transpose(w2), At, transpose(v)) # This will BLAS
@time my_mul!(w3, v, A)
@show norm(w-w2), norm(w-w3), norm(A-w3)
which gives a warning
┌ Warning: LoopVectorization failed
│ err =
│ LoadError: ArgumentError: Collection has multiple elements, must contain exactly 1 element
│ in expression starting at /Users/jagot/.julia/packages/Tullio/bgqFi/src/macro.jl:1092
└ @ Tullio ~/.julia/packages/Tullio/bgqFi/src/macro.jl:1118
but other than that seems to work well. On my laptop I get the following timings:
cc @chriselrod
This doesn’t look thread safe in the gradient, as one thread’s i+1
may be another’s i
:
julia> using Tullio
julia> reg(x) = @tullio res = sqrt(abs2(x[i, j] - x[i+1, j]) +abs2(x[i, j] - x[i, j+1])) verbose=2
...
┌ Info: symbolic gradients
│ inbody =
│ 3-element Array{Any,1}:
│ :(𝛥x[i, j] = 𝛥x[i, j] + 𝛥ℛ[1] * conj((((x[i, j] - x[i + 1, j]) + (x[i, j] - x[i + 1, j])) + ((x[i, j] - x[i, j + 1]) + (x[i, j] - x[i, j + 1]))) * (inv(sqrt(abs2(x[i, j] - x[i + 1, j]) + abs2(x[i, j] - x[i, j + 1]))) / 2)))
│ :(𝛥x[i + 1, j] = 𝛥x[i + 1, j] + 𝛥ℛ[1] * conj(-(((x[i, j] - x[i + 1, j]) + (x[i, j] - x[i + 1, j]))) * (inv(sqrt(abs2(x[i, j] - x[i + 1, j]) + abs2(x[i, j] - x[i, j + 1]))) / 2)))
└ :(𝛥x[i, j + 1] = 𝛥x[i, j + 1] + 𝛥ℛ[1] * conj(-(((x[i, j] - x[i, j + 1]) + (x[i, j] - x[i, j + 1]))) * (inv(sqrt(abs2(x[i, j] - x[i + 1, j]) + abs2(x[i, j] - x[i, j + 1]))) / 2)))
┌ store.
│ arrays = [:x]
…
│ sharedind = [:i, :j]
│ shiftedind = [:i, :j]
Example from https://discourse.julialang.org/t/fast-gpu-kernels-differentiable-with-zygote/56756 (mostly about scalar reductions on GPU)
I wasn't sure whether to open this issue here or on KernelAbstractions, so I figured I'd open it here first for discussion, and then we can move it to KernelAbstractions later if need be.
Right now, the GPU performance doesn't seem to scale very well on large matrices.
Here's an example plot for Matrix{Float32}=Matrix{Float16}×Matrix{Float16}
, generated on a Titan V.
Here's the code used to generate it:
using BLASBenchmarksGPU
import CUDA
bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
import PyPlot
BLASBenchmarksGPU.plotbench(bench_result, "plot.png")
CUDA.versioninfo()
And here's the output, including all of the TFLOPS values:
julia> using BLASBenchmarksGPU
julia> import CUDA
julia> bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
Progress: 100%|███████████████████████████████████████████████████| Time: 0:08:24
Size: 16384
CUBLAS: 62.9 TFLOPS
GemmKernels: 66.19 TFLOPS
Tullio: 0.29 TFLOPS
Bennchmark Result of Matrix{Float32}=Matrix{Float16}×Matrix{Float16}
24×4 DataFrame
Row │ Size Library TFLOPS Time
│ Int64 Symbol Float64 Float64
─────┼─────────────────────────────────────────────────
1 │ 128 CUBLAS 0.148099 28321.0
2 │ 128 GemmKernels 0.0332214 126253.0
3 │ 128 Tullio 0.0540789 77559.0
4 │ 256 CUBLAS 0.642928 52190.0
5 │ 256 GemmKernels 0.268756 124851.0
6 │ 256 Tullio 0.310169 108181.0
7 │ 512 CUBLAS 4.36686 61471.0
8 │ 512 GemmKernels 2.02769 132385.0
9 │ 512 Tullio 0.677953 395950.0
10 │ 1024 CUBLAS 25.168 85326.0
11 │ 1024 GemmKernels 14.2662 150529.0
12 │ 1024 Tullio 0.850125 2.52608e6
13 │ 2048 CUBLAS 59.5005 288735.0
14 │ 2048 GemmKernels 36.4345 471527.0
15 │ 2048 Tullio 0.76592 2.24304e7
16 │ 4096 CUBLAS 84.8186 1.62039e6
17 │ 4096 GemmKernels 57.559 2.38779e6
18 │ 4096 Tullio 0.393097 3.49631e8
19 │ 8192 CUBLAS 90.1617 1.21949e7
20 │ 8192 GemmKernels 59.9127 1.83519e7
21 │ 8192 Tullio 0.324503 3.38829e9
22 │ 16384 CUBLAS 62.9003 1.39842e8
23 │ 16384 GemmKernels 66.1909 1.3289e8
24 │ 16384 Tullio 0.294184 2.98999e10
julia> import PyPlot
julia> BLASBenchmarksGPU.plotbench(bench_result, "plot.png")
julia> CUDA.versioninfo()
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.27.4
Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.27.4
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)
Toolchain:
- Julia: 1.6.0-beta1
- LLVM: 11.0.0
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
Environment:
- JULIA_CUDA_VERBOSE: true
1 device:
0: TITAN V (sm_70, 658.500 MiB / 11.784 GiB available)
Hey,
I discovered a strange behaviour:
using Zygote
using Tullio
x = ones((5,5))
function f_working(arr)
@tullio res = sqrt(arr[i, j])
return res
end
function f_not_working(arr)
@tullio res = identity(arr[i, j])
return res
end
Calling the second function with Zygote:
julia> gradient(f_not_working, ones((3,3)))
ERROR: no gradient definition here!
Stacktrace:
[1] (::Tullio.var"#217#218"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#9"{var"#𝒜𝒸𝓉!#8"},Nothing},Tuple{Array{Float64,2}},Array{Float64,1}})(::FillArrays.Fill{Float64,1,Tuple{Base.OneTo{Int64}}}) at <path>/.julia/packages/Tullio/bnAxO/src/grad/zygote.jl:7
[2] (::Tullio.var"#195#back#219"{Tullio.var"#217#218"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#9"{var"#𝒜𝒸𝓉!#8"},Nothing},Tuple{Array{Float64,2}},Array{Float64,1}}})(::FillArrays.Fill{Float64,1,Tuple{Base.OneTo{Int64}}}) at <path>/.julia/packages/ZygoteRules/6nssF/src/adjoint.jl:49
[3] f_not_working at ./REPL[5]:3 [inlined]
[4] (::typeof(∂(f_not_working)))(::Float64) at <path>/.julia/packages/Zygote/z3bQd/src/compiler/interface2.jl:0
[5] (::Zygote.var"#37#38"{typeof(∂(f_not_working))})(::Float64) at <path>/.julia/packages/Zygote/z3bQd/src/compiler/interface.jl:45
[6] gradient(::Function, ::Array{Float64,2}) at <path>/.julia/packages/Zygote/z3bQd/src/compiler/interface.jl:54
[7] top-level scope at REPL[6]:1
Pure Zygote is able to make the derivative of identity
, so it must be somehow due to the interplay of Tullio and Zygote.
Thanks,
Felix
Edit:
I think it also occurs in a more general case:
foo = sqrt
function f_not_working(arr)
@tullio res = foo(arr[i, j])
return res
end
using Tullio
using LoopVectorization
J = [1,3,1,1]
A = [i^2 for i in 1:10]
@tullio B[J[i],k] := A[k]
@tullio B[J[i],k] += A[k]
# 3×10 Array{Int64,2}:
# 1 4 9 16 25 36 49 64 81 100
# 0 0 0 0 0 0 0 0 0 0
# 1 4 9 16 25 36 49 64 81 100
@tullio B[J[i],k] := A[k]
@tullio avx=false B[J[i],k] += A[k]
# 3×10 Array{Int64,2}:
# 4 16 36 64 100 144 196 256 324 400
# 0 0 0 0 0 0 0 0 0 0
# 2 8 18 32 50 72 98 128 162 200
As @avx can't tell the user all the misusing, maybe the default set should be avx = false.
I think this is still blocked by JuliaDiff/ReverseDiff.jl#135, here's the error:
julia> Base.gensym(ex::Expr) = gensym(string(ex)); # without this, LoadError: MethodError: no method matching gensym(::Expr) instead
julia> using ReverseDiff, Tullio
┌ Warning: Error requiring ReverseDiff from Tullio:
│ LoadError: UndefVarError: ev not defined
│ Stacktrace:
│ [1] top-level scope at /Users/me/.julia/packages/ReverseDiff/Thhqg/src/macros.jl:190
│ [2] include(::Function, ::Module, ::String) at /Applications/Julia-1.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
│ [3] include at ./Base.jl:368 [inlined]
│ [4] include(::String) at /Users/me/.julia/packages/Tullio/HGzih/src/Tullio.jl:1
│ [5] top-level scope at none:1
│ [6] eval at ./boot.jl:331 [inlined]
│ [7] eval at /Users/me/.julia/packages/Tullio/HGzih/src/Tullio.jl:1 [inlined]
│ [8] (::Tullio.var"#150#154")() at /Users/me/.julia/packages/Requires/qy6zC/src/require.jl:85
...
julia> ReverseDiff.gradient(x -> sum(@tullio y[i] := x[i]), rand(3))
ERROR: MethodError: no method matching track(::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#7"},var"#20#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#8"}}, ::ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}})
Stacktrace:
[1] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#7"},var"#20#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#8"}})(::ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}) at /Users/me/.julia/packages/Tullio/HGzih/src/grad/reverse.jl:4
[2] (::var"#5#6")(::ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}) at ./REPL[4]:1
[3] ReverseDiff.GradientTape(::var"#5#6", ::Array{Float64,1}, ::ReverseDiff.GradientConfig{ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}}) at /Users/me/.julia/packages/ReverseDiff/Thhqg/src/api/tape.jl:199
[4] gradient(::Function, ::Array{Float64,1}, ::ReverseDiff.GradientConfig{ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}}) at /Users/me/.julia/packages/ReverseDiff/Thhqg/src/api/gradients.jl:22 (repeats 2 times)
What the macro does:
using TensorCast, ReverseDiff
id(x) = x;
@pretty ReverseDiff.@grad function id(x)
ReverseDiff.value(x), Δ -> (Δ,)
end
julia> f1(x) = sum(@tullio res[i, j] := begin
x[i+2, j] - 2 * x[i+1, j] + x[i, j]
end)
f1 (generic function with 1 method)
julia> gradient(f1, x)
ERROR: no gradient definition here!
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] (::Tullio.var"#215#216"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#11"{var"#𝒜𝒸𝓉!#10"}, Nothing}, Tuple{Matrix{Float64}}, Matrix{Float64}})(Δ::FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}})
@ Tullio ~/.julia/packages/Tullio/bgqFi/src/grad/zygote.jl:7
[3] (::Tullio.var"#85#back#217"{Tullio.var"#215#216"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#11"{var"#𝒜𝒸𝓉!#10"}, Nothing}, Tuple{Matrix{Float64}}, Matrix{Float64}}})(Δ::FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}})
@ Tullio ~/.julia/packages/ZygoteRules/OjfTt/src/adjoint.jl:59
[4] Pullback
@ ./REPL[39]:1 [inlined]
[5] (::typeof(∂(f1)))(Δ::Float64)
@ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
[6] (::Zygote.var"#41#42"{typeof(∂(f1))})(Δ::Float64)
@ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface.jl:40
[7] gradient(f::Function, args::Matrix{Float64})
@ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface.jl:49
[8] top-level scope
@ REPL[40]:1
julia> f1(x) = sum(@tullio res[i, j] := x[i+2, j] - 2 * x[i+1, j] + x[i, j])
f1 (generic function with 1 method)
julia> gradient(f1, x)
([1.0 1.0 … 1.0 1.0; -1.0 -1.0 … -1.0 -1.0; … ; -1.0 -1.0 … -1.0 -1.0; 1.0 1.0 … 1.0 1.0],)
I encountered this because my equation got quite lengthy.
This is the minimal example of the error.
a, b = randn(ComplexF64, 10, 10), randn(ComplexF64, 10, 10)
@tullio # (verbose = false, fastmath = true, threads = true, grad = :Base, avx = true, cuda = 256, tensor = true)
@tullio out[a, b] := a[a, c] * b[c, b] # works
@tullio out[A, B] := a[A, C] * conj(b[C, B]) # works
@tullio out[a, b] := a[a, c] * conj(b[c, b]) # syntax error
When @tullio
replaced by @tensor
in TensorOperations.jl
, the last expression works well. Is this a bug of the @tullio
macro?
On the most recent commit to master
(58c1fbb
), CI passed on Julia 1.4 and passed on Julia 1.5 but failed (timed out) on Julia nightly. Links to the logs:
GitHub Actions will kill a job if it runs for more than six hours.
We need to figure out how to make the CI jobs run for less than six hours.
Here is one option: we separate the test suite into "groups". E.g. if currently the test suite looks like this:
using Tullio
using Test
@test f() == "a"
@test g() == "b"
@test h() == "c"
We instead divide the test suite into three groups as such:
using Tullio
using Test
test_group = get(ENV, "TULLIO_TEST_GROUP", "all")
if test_group in ["all", "1"]
@test f() == "a"
end
if test_group in ["all", "2"]
@test g() == "b"
end
if test_group in ["all", "3"]
@test h() == "c"
end
And then we set up the matrix
in .github/workflows/ci.yml
file as such:
strategy:
fail-fast: false
matrix:
JULIA_NUM_THREADS:
- '1'
- '6'
TULLIO_TEST_GROUP:
- '1'
- '2'
- '3'
version:
- '1.4'
- '1' # automatically expands to the latest stable 1.x release of Julia
os:
- ubuntu-latest
arch:
- x64
@mcabbott @chriselrod Any thoughts?
Trying to implement a slightly weird "half-convolution" as follows:
using NNlib, Tullio
_tullio_conv(u, v) = @tullio y[i+_] := u[i+a] * v[a]
function tullio_conv(u::AbstractVector, v::AbstractVector)
# Left pad
upad = NNlib.pad_zeros(u, (length(v) - 1, 0))
v_rev = @view v[end:-1:1]
return _tullio_conv(upad, v_rev)
end
which results in "KernelAbstractions can't handle OffsetArrays here"
when running tullio_conv(a, b)
with a
and b
being CuArray
of same size. Works fine on CPU though.
Not sure if this is Tullio.jl's or KernelAbstractions.jl's "fault".
I noticed that this repo still uses travic-ci.org, at least based on the badge in the README. travis-ci.org will be shut down at the end of the year, cf. https://mailchi.mp/3d439eeb1098/travis-ciorg-is-moving-to-travis-cicom. Hence, everything should be migrated to travis-ci.com.
Hey,
I recognized the following today. The only difference between correct one and wrong one are the signs.
julia> using Tullio
julia> arr = [1 2; 1 2; 3 4; 3 4]
4×2 Array{Int64,2}:
1 2
1 2
3 4
3 4
julia> @tullio x2[i1, i2] := (arr[2i1 + 1, 2i2 + 1] + arr[2i1 + 0, 2i2 + 1] + arr[2i1 + 1, 2i2 + 0] + arr[2i1 + 0, 2i2 + 0])
1×0 Array{Int64,2}
julia> @tullio x2[i1, i2] := (arr[2i1 - 1, 2i2 - 1] + arr[2i1 + 0, 2i2 - 1] + arr[2i1 - 1, 2i2 + 0] + arr[2i1 + 0, 2i2 + 0])
2×1 Array{Int64,2}:
6
14
I didn't expect this because your downsample example is also formulated in a "+" way.
Thanks,
Felix
Edit: I guess I understand it. It simply starts at i1=1 and then sums up from there (2* 1 + 1 and 2* 1 + 0).
So that's wanted behaviour?
Hey,
I'm trying to figure out how we could use Tullio to generalize some n-dimensional functions.
Let's say I've got a n-dimensional array and I want to use Tullio for that. At the moment I need to treat dimensions differently.
using Tullio
using Random
Random.seed!(42)
function spatial_grad_square_1D(arr)
resi = let
@tullio resi = abs2(arr[i + 1] - arr[i - 1])
end
return resi
end
function spatial_grad_square_2D(arr)
resi = let
@tullio resi = abs2(arr[i + 1, j] - arr[i - 1, j])
end
resj = let
@tullio resj = abs2(arr[i, j + 1] - arr[i, j - 1])
end
return resi + resj
end
arr_1D = randn((10))
arr_2D = randn((10, 10))
spatial_grad_square_1D(arr_1D)
spatial_grad_square_2D(arr_2D)
Do you have any hints on how to achieve this in a more elegant way? Using the Julia metaprogramming documentation I couldn't figure it out 😞
Thanks,
Felix
Could we add a "Quick Start" section to the README, to help beginners that want to quickly get started playing around with Tullio? The section would include a list of recommended packages for use with Tullio.
E.g. maybe something like this?
julia> using Pkg
julia> Pkg.add(["Tullio", "LoopVectorization", "TensorOperations", "KernelAbstractions"])
julia> using Tullio, LoopVectorization, TensorOperations, KernelAbstractions
julia> matmul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]
julia> X = [1.0 2.0 3.0; 4.0 5.0 6.0]
julia> Y = [7.0 8.0; 9.0 10.0; 11.0 12.0]
julia> matmul(X, Y)
Are these the right packages for someone to start off with?
Hey,
the following code leads (sometimes, 30 - 50 % roughly) to a segmentation fault.
using Tullio
using Zygote
arr = [1 2; 3 4]
function f(arr)
@tullio res1 = arr[i, k] - arr[i - 1, k]
@tullio res2 = arr[i, k] - arr[i, k + 1]
return res1 + res2
end
f(arr)
print(gradient(f, arr))
It seems to be caused by the addition of the two statements. Furthermore this only happens once Zygote is loaded and gradient
is called.
The output is then:
julia> include("tullio_bug.jl")
([1 -1; 2 -1],)
julia> include("tullio_bug.jl")
signal (11): Segmentation fault
in expression starting at <path>/tullio_bug.jl:12
jl_gc_pool_alloc at /usr/bin/../lib/libjulia.so.1 (unknown line)
jl_gc_alloc at /usr/bin/../lib/libjulia.so.1 (unknown line)
jl_alloc_array_1d at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff849fc672)
unknown function (ip: 0x7eff84a11a38)
unknown function (ip: 0x7eff84a1234a)
jl_uncompress_ast at /usr/bin/../lib/libjulia.so.1 (unknown line)
_uncompressed_ast at ./reflection.jl:950 [inlined]
uncompressed_ast at ./reflection.jl:946
#meta#2 at /home/fxw/.julia/packages/IRTools/BpoqK/src/reflection/reflection.jl:112
meta at /home/fxw/.julia/packages/IRTools/BpoqK/src/reflection/reflection.jl:101 [inlined]
_lookup_grad at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/emit.jl:99
#s3320#1912 at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:21 [inlined]
#s3320#1912 at ./none:0
unknown function (ip: 0x7eff776bd06e)
unknown function (ip: 0x7eff84a3edab)
jl_code_for_staged at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff7761066a)
unknown function (ip: 0x7eff77615cca)
unknown function (ip: 0x7eff776aaad7)
unknown function (ip: 0x7eff776ab3d4)
unknown function (ip: 0x7eff776ab57f)
unknown function (ip: 0x7eff849ed555)
unknown function (ip: 0x7eff849edf04)
jl_apply_generic at /usr/bin/../lib/libjulia.so.1 (unknown line)
macro expansion at /home/fxw/.julia/packages/Tullio/xESL5/src/macro.jl:617 [inlined]
f at /home/fxw/Documents/Uni/FSU/2semester/Internship/DeconvOptim.jl/tullio_bug.jl:6 [inlined]
Pullback at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
#36 at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface.jl:46
unknown function (ip: 0x7eff0cb171b2)
gradient at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface.jl:55
unknown function (ip: 0x7eff84a02d85)
unknown function (ip: 0x7eff84a02a0e)
unknown function (ip: 0x7eff84a04670)
unknown function (ip: 0x7eff84a050e0)
unknown function (ip: 0x7eff84a210a1)
unknown function (ip: 0x7eff849f76e6)
jl_load at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff77955c1f)
unknown function (ip: 0x7eff84a02d85)
unknown function (ip: 0x7eff84a02a0e)
unknown function (ip: 0x7eff84a04670)
unknown function (ip: 0x7eff84a050e0)
unknown function (ip: 0x7eff84a210a1)
unknown function (ip: 0x7eff84a21748)
jl_toplevel_eval_in at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff77707df4)
unknown function (ip: 0x7eff77935179)
unknown function (ip: 0x7eff77935474)
unknown function (ip: 0x7eff84a08cc9)
unknown function (ip: (nil))
Allocations: 80635502 (Pool: 80625319; Big: 10183); GC: 66
[1] 395611 segmentation fault (core dumped) julia
Could this be related to fastmath?
Thanks,
Felix
Edit: The following code seems to work. Notice the minor change in res2
:
using Tullio
using Zygote
arr = [1 2; 3 4]
function f(arr)
@tullio res1 = arr[i, k] - arr[i - 1, k]
@tullio res2 = arr[i, k] - arr[i, k]
return res1 + res2
end
f(arr)
print(gradient(f, arr))
We now have Buildkite working 🎉
On CPU, the test suite is quite long - on Julia 1.5 the CPU tests take e.g. 2 hours and 12 minutes (https://github.com/mcabbott/Tullio.jl/runs/1557475138).
There's no need to run all the CPU-only tests on Buildkite, since we will already be running them on GitHub Actions.
We can e.g. set an environment variable on Buildkite, e.g. BUILDKITE="true"
. And then in the test suite, we will do:
is_buildkite = parse(Bool, get(ENV, "BUILDKITE", "false"))
And then if is_buildkite
is true, we only run the GPU-related tests.
I'm experiencing some slower performances when taking gradients on CuArray
s, than normal CPU Array
s, using Zygote. Here's an MWE (the Tullio operations implement a mixture linear regression):
using Tullio
using Zygote
using CUDA, KernelAbstractions
CUDA.allowscalar(false)
function prediction(V, W, r)
@tullio mixture[a, b, d] := V[a, b, c] * W[d, c]
@tullio r̂[a, d] := mixture[a, b, d] * r[b, d]
return r̂
end
### CuArray version
V = CuArray(randn(15, 15, 3))
W = CuArray(randn(150, 3))
@time begin
grad = gradient(Params([V])) do
l = 0
for t = 1:5
r = CuArray(randn(15, 150))
r̂ = prediction(V, W, r)
l += sum(r - r̂)
end
l
end
end
# output: 0.304601 seconds (516.80 k allocations: 15.125 MiB)
If I replace all the arrays to CPU, I get 0.151464 seconds (209.86 k allocations: 13.321 MiB)
.
I'm wondering if this is expected? Or am I forgetting something when using Cuda arrays that slows down the performance?
As pointed out by @mforets, there is a surprisingly large overhead for small scalar reductions:
using LinearAlgebra, Tullio
v = rand(10);
td(v,w) = @tullio d := v[i] * w[i]
@btime td($v, $v) # 2 μs
@btime dot($v, $v) # 20 ns
ts(v) = @tullio s := v[i]
@btime ts($v) # 2 μs
@btime sum($v) # 4 ns
Perhaps this is mostly from threading (or rather, thinking whether to multi-thread). Disabling that brings it closer to the array-output case:
td1(v,w) = @tullio d := v[i] * w[i] threads=false
@btime td1($v, $v) # 60 ns
tt(v) = @tullio vv[i] := 2*v[i];
@btime tt($v) # 50 ns
@btime 2 .* $v # 40 ns
tt1(v) = @tullio vv[i] := 2*v[i] threads=false;
@btime tt1($v) # 36 ns
The threaded reduction may also use init
a number of times, and ideally it would check that this is compatible with the reduction, as suggested #24 (comment).
Finally, I think that @tullio d := sqrt <| v[i] * v[i]
should be disallowed, as there's no reason to have the macro handle this.
Right now, Tullio.jl uses the GitLab runners for running GPU CI in the JuliaGPU group.
If I understand correctly, JuliaGPU is transitioning from GitLab to Buildkite, with the plan of eventually completely discontinuing the GitLab runners.
It would be great to migrate the GPU CI for Tullio.jl from the GitLab runners to the Buildkite runners.
@maleadt Can you provide instructions for migrating to the new Buildkite JuliaGPU setup?
julia> using Tracker, Zygote, Tullio, LinearAlgebra
julia> function gg(x)
res = 0.0
n = [norm(x)]
for i in 1:100
res += @tullio res_ := sin(exp(x[j]) - n[1])
end
atan(res)
end;
julia> Zygote.gradient(gg, rand(10))
([-0.3105827909434481, -1.435298638568034, -0.22031585984529584, -0.03951256409987808, 0.14018859477924406, -0.49862889658929765, -0.15284449100831132, -0.8243820635907274, -0.06310467526562058, -0.1548104899931962],)
julia> Tracker.gradient(gg, rand(10))
ERROR: MethodError: no method matching Float64(::Tracker.TrackedReal{Float64})
Closest candidates are:
Float64(::Real, ::RoundingMode) where T<:AbstractFloat at rounding.jl:200
Float64(::T) where T<:Number at boot.jl:716
Float64(::BigInt) at gmp.jl:451
...
Stacktrace:
[1] convert(::Type{Float64}, ::Tracker.TrackedReal{Float64}) at ./number.jl:7
[2] setindex!(::Array{Float64,1}, ::Tracker.TrackedReal{Float64}, ::Int64) at ./array.jl:847
[3] macro expansion at /Users/me/.julia/dev/Tullio/src/symbolic.jl:53 [inlined]
[4] ∇𝒜𝒸𝓉! at /Users/me/.julia/dev/Tullio/src/macro.jl:1207 [inlined]
In #72, we disabled some tests on Julia nightly because they were causing CI to time out.
Once everything is working on Julia nightly, we should reenable ALL tests.
Hey,
the following one:
julia> using Tullio
julia> y = ^C
julia> y = (1:10) .* (1:10)'
10×10 Matrix{Int64}:
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
6 12 18 24 30 36 42 48 54 60
7 14 21 28 35 42 49 56 63 70
8 16 24 32 40 48 56 64 72 80
9 18 27 36 45 54 63 72 81 90
10 20 30 40 50 60 70 80 90 100
julia> f(x) = (@tullio res[i, j] := x[i, j+1] (i in 1:5, j in 1:5))
f (generic function with 1 method)
julia> f(y)
ERROR: "range of index i must agree"
Stacktrace:
[1] ℳ𝒶𝓀ℯ
@ ~/.julia/packages/Tullio/bgqFi/src/macro.jl:968 [inlined]
[2] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#180"{var"#𝒜𝒸𝓉!#177"}, var"#735#∇ℳ𝒶𝓀ℯ#179"{var"#∇𝒜𝒸𝓉!#178"}})(args::Matrix{Int64})
@ Tullio ~/.julia/packages/Tullio/bgqFi/src/eval.jl:20
[3] f(x::Matrix{Int64})
@ Main ./REPL[119]:1
[4] top-level scope
@ REPL[120]:1
julia> f(x) = (@tullio res[i, j] := x[i+0, j+1] (i in 1:5, j in 1:5))
f (generic function with 1 method)
julia> f(y)
5×5 Matrix{Int64}:
2 3 4 5 6
4 6 8 10 12
6 9 12 15 18
8 12 16 20 24
10 15 20 25 30
What is the reason that we need the artificial +0
? My goal was to calculate only a subset of the initial x
Thanks,
Felix
I get the following warning on Tullio v0.2.10 when loading LoopVectorization v0.9.1:
┌ Warning: Error requiring LoopVectorization from Tullio:
│ UndefVarError: SVec not defined
...
I believe it is due to the renaming of an internal struct SVec
in previous versions of VectorizationBase to Vec
in current versions.
I'm not familiar enough with your internals to make a PR to fix this, but naively it looks like a simple find and replace.
The speed gap seems large. The benchmark is shown below:
julia> a = randn(4000,4000);
julia> b = randn(4000);
julia> @btime func($a,$b);
175.183 ms (0 allocations: 0 bytes)
julia> @btime func_tullio($a,$b);
266.097 ms (0 allocations: 0 bytes)
julia> @btime func_avx($a,$b);
45.472 ms (0 allocations: 0 bytes)
julia> @btime func_tullio_avx($a,$b);
150.239 ms (0 allocations: 0 bytes)
where the functions' definition are:
func_tullio(a,b) = @tullio threads = false avx = false b[i] = sin(a[i,j]);
func_tullio_avx(a,b) = @tullio threads = false b[i] = sin(a[i,j]);
func(a,b) = begin
ndims(a) == 2 || error("~~")
ndims(b) == 1 || error("~~")
size(b,1) == size(a,1) || error("~~")
b .= 0
for j in 1:size(a,2)
for i in 1:size(a,1) # no @inbounds @simd
b[i] += sin(a[i,j])
end
end
end
func_avx(a,b) = begin
ndims(a) == 2 || error("~~")
ndims(b) == 1 || error("~~")
size(b,1) == size(a,1) || error("~~")
b .= 0
for j in 1:size(a,2)
@avx for i in 1:size(a,1) # add @avx
b[i] += sin(a[i,j])
end
end
end
I tried to run the following code:
@tullio dP[x,y,z] -= -Diff[a+2, b+2] * Diff[c+2, d+2] * P[mod(x+a+c), mod(y+b+d), z] * P[mod(x+a),mod(y+b),z] (a in -1:1, b in -1:1, c in -1:1, d in -1:1, z in 1:3, x in 1:n, y in 1:n)
I could decompose it into:
V[x,y,z] = Diff[c+2,d+2] * P[mod(x+c), mod(y+d), z] (c in -1:1, d in -1:1, x in 1:n, y in 1:n, z in 1:3)
dP[x,y,z] -= Diff[a+2, b+2] * V[mod(x+a), mod(y+b), z] * P[mod(x+a), mod(y+b), z] (a in -1:1, b in -1:1, x in 1:n, y in 1:n, z in 1:3)
Is there good reason why mod(x+a+c)
not supported?
julia> using Tullio
julia> let v = [-1, 0, 1, 2]
@tullio out := abs(v[i])
end
4
julia> using LoopVectorization
julia> let v = [-1, 0, 1, 2]
@tullio out := abs(v[i])
end
DiffRules._abs_deriv
4
julia> typeof(ans)
Int64
julia> let v = [-1, 0, 1, 2]
@tullio out := abs(v[i])
end
DiffRules._abs_deriv
4
Is this maybe some println
debugging that didn't get removed?
julia> using LoopVectorization
julia> using Tullio
julia> a=randn(10,10);
julia> b=randn(10,10);
julia> c=randn(10,10);
julia> @tullio c[i,j]=a[i,k]b[k,j]
ERROR: Module IR does not contain specified entry function
Stacktrace:
[1] assume
@ ~/.julia/packages/SIMDPirates/EVSvY/src/llvm_utils.jl:308 [inlined]
[2] macro expansion
@ ~/.julia/packages/LoopVectorization/pHMnJ/src/reconstruct_loopset.jl:503 [inlined]
[3] _avx_!(::Val{(0, 0, -1, 4)}, ::Type{Tuple{:LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01), Symbol(""), :isnothing, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000001, LoopVectorization.compute, 0x00, 0x02), :numericconstant, Symbol("##zero#257"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x03), :LoopVectorization, :conditionalload, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000005, LoopVectorization.memload, 0x01, 0x04), :LoopVectorization, :~, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000002, LoopVectorization.compute, 0x00, 0x05), :LoopVectorization, :vifelse, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000020304, LoopVectorization.compute, 0x00, 0x06), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000023, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x07), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000031, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x08), :numericconstant, Symbol("##reductzero#266"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000003, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x09), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000231, 0x0000000000000003, 0x0000000000000000, 0x0000000000070809, LoopVectorization.compute, 0x00, 0x09), :LoopVectorization, :reduced_add, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x0000000000000a06, LoopVectorization.compute, 0x00, 0x06), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x000000000000000b, LoopVectorization.memstore, 0x01, 0x0a)}}, ::Type{Tuple{LoopVectorization.ArrayRefStruct{:ℛ, Symbol("##vptr##_ℛ")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:a, Symbol("##vptr##_a")}(0x0000000000000101, 0x0000000000000203, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:b, Symbol("##vptr##_b")}(0x0000000000000101, 0x0000000000000301, 0x0000000000000000)}}, ::Type{Tuple{0, Tuple{}, Tuple{1}, Tuple{}, Tuple{}, Tuple{(3, LoopVectorization.IntOrFloat), (9, LoopVectorization.IntOrFloat)}, Tuple{}}}, ::Type{Tuple{:j, :i, :k}}, ::Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, ::VectorizationBase.PackedStridedPointer{Float64, 1}, ::VectorizationBase.PackedStridedPointer{Float64, 1}, ::VectorizationBase.PackedStridedPointer{Float64, 1}, ::Nothing, ::typeof(isnothing))
@ LoopVectorization ~/.julia/packages/LoopVectorization/pHMnJ/src/reconstruct_loopset.jl:503
[4] 𝒜𝒸𝓉!
@ ~/.julia/packages/Tullio/oeP9x/src/macro.jl:932 [inlined]
[5] tile_halves(fun!::var"#𝒜𝒸𝓉!#1", ::Type{Matrix{Float64}}, As::Tuple{Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}, Is::Tuple{UnitRange{Int64}, UnitRange{Int64}}, Js::Tuple{UnitRange{Int64}}, keep::Nothing, final::Bool)
@ Tullio ~/.julia/packages/Tullio/oeP9x/src/threads.jl:142
[6] tile_halves
@ ~/.julia/packages/Tullio/oeP9x/src/threads.jl:139 [inlined]
[7] threader(fun!::var"#𝒜𝒸𝓉!#1", ::Type{Matrix{Float64}}, Z::Matrix{Float64}, As::Tuple{Matrix{Float64}, Matrix{Float64}}, I0s::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, J0s::Tuple{Base.OneTo{Int64}}, redfun::Function, block::Int64, keep::Nothing)
@ Tullio ~/.julia/packages/Tullio/oeP9x/src/threads.jl:68
[8] top-level scope
@ ~/.julia/packages/Tullio/oeP9x/src/macro.jl:802
julia> versioninfo()
Julia Version 1.6.0-DEV.1093
Commit 28330a2fef (2020-09-30 14:46 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-10.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4
Running this code gives an error, when setting CUDA.allowscalar(false)
it does show scalar operations are being done.
On cpu everything works as expected.
using CUDA, KernelAbstractions, Tullio, Flux
f(x) = @tullio out[m, n, batch] := W[m, i] * x[i, n, batch] + b[m]
a = cu(rand(3, 5, 6)); W = cu(rand(10, 3)); b = cu(rand(10))
Flux.gradient(Flux.Params(W)) do
sum(f(a))
end
AssertionError: length(__workgroupsize) <= length(ndrange)
Stacktrace:
[1] partition at C:\Users\jules\.julia\packages\KernelAbstractions\yw9SF\src\nditeration.jl:103 [inlined]
[2] partition(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#421#156"}, ::Tuple{}, ::Nothing) at C:\Users\jules\.julia\packages\KernelAbstractions\yw9SF\src\KernelAbstractions.jl:368
[3] (::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#421#156"})(::CuArray{Float32,2,Nothing}, ::Vararg{Any,N} where N; ndrange::Tuple{}, dependencies::Nothing, workgroupsize::Nothing, progress::Function) at C:\Users\jules\.julia\packages\KernelAbstractions\yw9SF\src\backends\cuda.jl:196
[4] ∇𝒜𝒸𝓉! at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\macro.jl:814 [inlined]
[5] (::var"#∇𝒜𝒸𝓉!#154")(::Type{CuArray{Float32,N,Nothing} where N}, ::CuArray{Float32,2,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,1,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,2,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,1,Nothing}, ::Base.OneTo{Int64}, ::Base.OneTo{Int64}, ::Base.OneTo{Int64}, ::Base.OneTo{Int64}) at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\macro.jl:811
[6] ∇threader(::Function, ::Type{CuArray{Float32,N,Nothing} where N}, ::Tuple{CuArray{Float32,2,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,1,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,2,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,1,Nothing}}, ::Tuple{}, ::NTuple{4,Base.OneTo{Int64}}; block::Int64) at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\eval.jl:74
[7] #750#∇ℳ𝒶𝓀ℯ at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\macro.jl:908 [inlined]
[8] #217 at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\grad\zygote.jl:8 [inlined]
[9] (::Tullio.var"#632#back#219"{Tullio.var"#217#218"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#159"{var"#𝒜𝒸𝓉!#150"},var"#750#∇ℳ𝒶𝓀ℯ#158"{var"#∇𝒜𝒸𝓉!#154"}},Tuple{CuArray{Float32,2,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,1,Nothing}},CuArray{Float32,3,Nothing}}})(::CuArray{Float32,3,Nothing}) at C:\Users\jules\.julia\packages\ZygoteRules\6nssF\src\adjoint.jl:49
[10] Affine at .\In[23]:11 [inlined]
[11] (::typeof(∂(λ)))(::CuArray{Float32,3,Nothing}) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface2.jl:0
[12] #237 at .\In[88]:4 [inlined]
[13] (::typeof(∂(#237)))(::Float32) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface2.jl:0
[14] (::Zygote.var"#54#55"{Zygote.Params,Zygote.Context,typeof(∂(#237))})(::Float32) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface.jl:177
[15] gradient(::Function, ::Zygote.Params) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface.jl:54
[16] top-level scope at In[88]:3
Upon removing the b[m]
term, the code doesn't error anymore but scalar iteration seems to take place.
@JuliaRegistrator register
Executing using ReverseDiff, Tullio
in REPL gives the warning
julia> using ReverseDiff, Tullio
┌ Warning: Error requiring ReverseDiff from Tullio:
│ LoadError: LoadError: MethodError: no method matching gensym(::Expr)
│ Closest candidates are:
│ gensym(!Matched::Symbol) at expr.jl:15
│ gensym(!Matched::String) at expr.jl:12
│ gensym() at expr.jl:10
│ ...
│ Stacktrace:
│ [1] @grad(::LineNumberNode, ::Module, ::Any) at /home/ritesh/.julia/packages/ReverseDiff/jFRo1/src/macros.jl:182
│ [2] include(::Function, ::Module, ::String) at ./Base.jl:380
│ [3] include at ./Base.jl:368 [inlined]
│ [4] include(::String) at /home/ritesh/.julia/packages/Tullio/oeP9x/src/Tullio.jl:1
│ [5] top-level scope at none:1
│ [6] eval at ./boot.jl:331 [inlined]
│ [7] eval at /home/ritesh/.julia/packages/Tullio/oeP9x/src/Tullio.jl:1 [inlined]
│ [8] (::Tullio.var"#156#160")() at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:85
│ [9] err(::Any, ::Module, ::String) at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:42
│ [10] (::Tullio.var"#155#159")() at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:84
│ [11] withpath(::Any, ::String) at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:32
│ [12] (::Tullio.var"#154#158")() at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:83
│ [13] listenpkg(::Any, ::Base.PkgId) at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:15
│ [14] macro expansion at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:81 [inlined]
│ [15] (::Tullio.var"#153#157")() at /home/ritesh/.julia/packages/Requires/jmibq/src/init.jl:11
│ [16] __init__() at /home/ritesh/.julia/packages/Requires/jmibq/src/init.jl:18
│ [17] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:697
│ [18] _require_search_from_serialized(::Base.PkgId, ::String) at ./loading.jl:782
│ [19] _require(::Base.PkgId) at ./loading.jl:1007
│ [20] require(::Base.PkgId) at ./loading.jl:928
│ [21] require(::Module, ::Symbol) at ./loading.jl:923
│ [22] eval(::Module, ::Any) at ./boot.jl:331
│ [23] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:134
│ [24] repl_backend_loop(::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:195
│ [25] start_repl_backend(::REPL.REPLBackend, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:180
│ [26] run_repl(::REPL.AbstractREPL, ::Any; backend_on_current_task::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:292
│ [27] run_repl(::REPL.AbstractREPL, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288
│ [28] (::Base.var"#806#808"{Bool,Bool,Bool,Bool})(::Module) at ./client.jl:399
│ [29] #invokelatest#1 at ./essentials.jl:710 [inlined]
│ [30] invokelatest at ./essentials.jl:709 [inlined]
│ [31] run_main_repl(::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at ./client.jl:383
│ [32] exec_options(::Base.JLOptions) at ./client.jl:313
│ [33] _start() at ./client.jl:506
│ in expression starting at /home/ritesh/.julia/packages/Tullio/oeP9x/src/grad/reverse.jl:8
│ in expression starting at /home/ritesh/.julia/packages/Tullio/oeP9x/src/grad/reverse.jl:8
└ @ Requires ~/.julia/packages/Requires/jmibq/src/require.jl:44
My Config:
ReverseDiff v1.4.3
Tullio v0.2.7
Looks like the warning is related to ReverseDiff.@grad
but Tracker.@grad
doesn't give any such warning.
I'm trying to build a sysimg for my project (which uses ReverseDiff and Tullio) and I guess this warning wouldn't allow Tullio to be added to precompile trace (I'm not sure if it would be added).
Is there a way to get rid of this warning?
Since CuArrays.jl is being replaced with CUDA.jl, it'd be good if Tullio could move it's support to CUDA.
This step will need to be done by @mcabbott
Here is the link to the GitHub App: https://github.com/settings/connections/applications/Iv1.112bf4be3e5ecdeb
Some bugs I think with range inference and ÷
, exposed by trying to solve this: https://discourse.julialang.org/t/efficient-reshaping-of-2d-array-by-pairs-of-columns/52953
julia> using Tullio, OffsetArrays, TensorCast
julia> A = [ 1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
17 18 ]; # input
julia> B = [ 1 2 7 8 13 14
3 4 9 10 15 16
5 6 11 12 17 18 ]; # desired output
julia> @cast B2[a,(c,b)] := A[(a,b),c] a:3 # easy!
3×6 reshape(PermutedDimsArray(::Array{Int64, 3}, (1, 3, 2)), 3, 6) with eltype Int64:
1 2 7 8 13 14
3 4 9 10 15 16
5 6 11 12 17 18
julia> @tullio B3[r, c] := A[3r+c÷3, 3r+c÷3+1] r in 1:3 # wrong formula!
3×7 OffsetArray(::Matrix{Int64}, 1:3, -6:0) with eltype Int64 with indices 1:3×-6:0:
2 5512452232 5512452232 5512452232 5372075888 5372075888 5372075888
0 11 11 11 4618052352 4618052352 4618052352
0 11 11 11 4618334208 4618334208 4618334208
julia> @tullio B4[r, c] := A[r + 3((c-1)÷2), mod(c)] r in 1:3
3×5 OffsetArray(::Matrix{Int64}, 1:3, 1:5) with eltype Int64 with indices 1:3×1:5:
1 2 7 8 13
3 4 9 10 15
5 6 11 12 17
MWE:
julia> using Tullio, CUDA, KernelAbstractions, Zygote, FiniteDifferences
julia> g(μ) = ifelse(μ == 4, 1, -1)
g (generic function with 1 method)
julia> push!(Tullio.BASE_NOGRAD, :g);
julia> f2(w, k) = @tullio min f[m] := w[j, m] * g(μ) * (k[j, μ] - k[m, μ])^2
f2 (generic function with 1 method)
julia> using Random; Random.seed!(1);
julia> k = rand(Float32, 2, 4)
2×4 Array{Float32,2}:
0.547994 0.567737 0.27934 0.8135
0.819285 0.557336 0.777828 0.00389743
julia> w = rand(Float32, 2, 2)
2×2 Array{Float32,2}:
0.699683 0.903188
0.568097 0.416541
julia> Δ = rand(Float32, 2)
2-element Array{Float32,1}:
0.51531243
0.13594759
julia> Zygote.pullback(f2, w, k)[2](Δ)[2]
2×4 Array{Float32,2}:
0.0 0.0 0.414277 0.0
0.0 0.0 -0.414277 0.0
julia> Zygote.pullback(f2, cu(w), cu(k))[2](cu(Δ))[2]
2×4 CuArray{Float32,2}:
0.0 0.0 0.414277 0.0
0.0 0.0 -0.122415 0.0
julia> j′vp(central_fdm(5, 1), f2, Δ, w, k)[2]
2×4 Array{Float32,2}:
5.59755f-9 5.59755f-9 0.414277 5.59755f-9
5.59755f-9 5.59755f-9 -0.414277 5.59755f-9
Seems fine if I use +
instead of min
as the reducer.
Hey,
I'm on a 4 core machine. I started Julia with 4 threads and observe (relatively) poor performance with the default Tullio settings
using Zygote, Tullio
x = randn((2000, 2000));
f(x) = (@tullio y = (x[i+1, j] - x[i-1, j])^2 + (x[i, j+1] - x[i, j-1])^2)
ft(x) = (@tullio threads=false y = (x[i+1, j] - x[i-1, j])^2 + (x[i, j+1] - x[i, j-1])^2)
@btime Zygote.gradient(f, x)[1];
@btime Zygote.gradient(ft, x)[1]
producing:
100.746 ms (173 allocations: 30.53 MiB)
11.788 ms (26 allocations: 30.52 MiB)
What's actually the reason for that? I could observe that none of the 4 Julia Threads achieved 100% CPU Usage. Rather 50 - 80% in the main thread, and around 20% in the other three.
Thanks,
Felix
Observe this initially baffling bit of code:
julia> using Tullio
julia> let s = [1,2,3,4]
@tullio x[i] := s[i]
end
4-element Array{Int64,1}:
1
2
3
4
julia> let s = [1,2,3,4], ndims=maximum(s)
@tullio x[i] := s[i]
end
ERROR: MethodError: objects of type Int64 are not callable
Stacktrace:
[1] top-level scope at /home/mason/.julia/packages/Tullio/Q2RDS/src/macro.jl:970
[2] top-level scope at REPL[3]:2
I think somewhere in the macroexpansion, you need to interpolate the Tullio.ndims
function otherwise a user having a variable called ndims
causes trouble.
The following minimal example complains about the indices of a
being out of range even though 1 <= a[1, j] <= 3
. I wonder if bounds checking could be modified to handle scenarios like this?
using Tullio
a = [1 2 3
4 5 6]
b = [1, 2, 3]
@tullio x := b[a[1, j]]
# "extrema of index a[1, j] must fit within b"
From here: https://discourse.julialang.org/t/increment-elements-of-array-by-index/49694/5
Tullio was detecting that i
is unsafe to multi-thread, but wasn't passing this to KernelAbstractions:
using CUDA, Tullio, KernelAbstractions
let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
@tullio A[I[i]] += v[i]
end
#+RESULTS:
: 4-element CuArray{Float32,1}:
: 1.0
: 0.0
: 2.0
: 0.0
After my attempt at a fix in 4d30a1f (which runs on the CPU):
julia> let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
@tullio A[I[i]] += v[i]
end
ERROR: AssertionError: length(__workgroupsize) <= length(ndrange)
Stacktrace:
[1] partition at /home/user/.julia/packages/KernelAbstractions/jAutM/src/nditeration.jl:103 [inlined]
[2] partition(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#253#3"}, ::Tuple{}, ::Nothing) at /home/user/.julia/packages/KernelAbstractions/jAutM/src/KernelAbstractions.jl:385
Besides not running, this is also too cautious, as it would apply to @tullio A[I[i]] = v[i]
where it would be more justifiable to over-write -- there is no guaranteed order, but accumulation ought to be right.
@dfdx points out that ScatterNNlib.jl has an approach using atomic_add
which may be worth trying to copy:
I hope I'm not doing it wrong but I've been playing around with Tullio.jl (super neat package!) and I can't seem to get this simple expression to work on the GPU:
julia> using CUDA, KernelAbstractions, Tullio
julia> A = rand(5, 5) |> CuArray;
julia> B = rand(5, 5) |> CuArray;
julia> @tullio C = A[i, j] / 2 + B[i, j] / 3
produces this stacktrace
ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type Float64
Closest candidates are:
convert(::Type{T}, ::T) where T<:Number at number.jl:6
convert(::Type{T}, ::Number) where T<:Number at number.jl:7
convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
...
Stacktrace:
[1] thread_scalar at /home/alir/.julia/packages/Tullio/bgqFi/src/threads.jl:237 [inlined]
[2] (::var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#1"})(::CuArray{Float64,2}, ::CuArray{Float64,2}) at /home/alir/.julia/packages/Tullio/bgqFi/src/macro.jl:805
[3] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#1"},var"#38#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#5"}})(::CuArray{Float64,2}, ::Vararg{CuArray{Float64,2},N} where N) at /home/alir/.julia/packages/Tullio/bgqFi/src/eval.jl:20
[4] top-level scope at /home/alir/.julia/packages/Tullio/bgqFi/src/macro.jl:975
with
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)
and
Status `~/.julia/environments/v1.5/Project.toml`
[052768ef] CUDA v2.4.1
[63c18a36] KernelAbstractions v0.5.3
[bc48ee85] Tullio v0.2.12
I didn't see a warning about this anywhere so I thought to open an issue. When I have multiple variables in the inner loop it takes much longer to finish, I'm assuming because it isn't contracting indices intelligently (though I could be wrong). Here is a test case where I multiply 3 matrices, maybe there is a better way to have written it within the Tullio framework:
MWE:
using BenchmarkTools
using Tullio
function mul3(A, B, C)
A * (B * C)
end
function mul3_tullio(A, B, C)
@tullio Y[i, j] := A[i, k] * B[k, l] * C[l, j]
end
A, B, C = rand(100, 100); rand(100, 100); rand(100, 100)
@btime mul3($A, $B, $C)
# 89.030 μs (4 allocations: 156.41 KiB)
@btime mul3_tullio($A, $B, $C)
# 111.093 ms (3 allocations: 78.25 KiB)
mul3(A, B, C) ≈ mul3_tullio(A, B, C)
# true
This is on Julia 1.4.1, Tullio version 0.2.7
Lead to
a = fill(100,4,4)
@tullio avx = false (&) b[i] := sin(a[i,j])+2 >= 0
4-element Array{Bool,1}:
0 # should be 1
0 # should be 1
0 # should be 1
0 # should be 1
Also some time user want keep the initial value for defined reduction function like += and *=
Maybe a synthesis like below is fine.
@tullio (fun_for_reduction,fun_for_initialization,keep) b[i] = sin(a[i,j])+2 >= 0
Minimal example:
using Tullio, LoopVectorization
@tullio out[i] := 0.1*data[i+o] o in 0:9
Part of the expansion is:
local 𝒶𝓍o = 0:9
local 𝒶𝓍i = (Tullio.subranges)((eachindex)(data), 𝒶𝓍o)
local 𝓇𝒽𝓈(data, i, o) = begin
#= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:788 =#
identity(0.1 * data[i + o])
end
begin
#= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:797 =#
local 𝒯1 = Core.Compiler.return_type(𝓇𝒽𝓈, (typeof)((data, (first)(𝒶𝓍i), (first)(𝒶𝓍o))))
#= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:798 =#
local 𝒯2 = if Base.isconcretetype(𝒯1)
#= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:799 =#
𝒯1
else
#= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:801 =#
nothing
#= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:802 =#
(typeof)(𝓇𝒽𝓈(data, (first)(𝒶𝓍i), (first)(𝒶𝓍o)))
end
end
local 𝒯3 = 𝒯2
local 𝒯 = 𝒯3
(first)(𝒶𝓍i) == 1 || (throw)("to allow indices not starting at 1, OffsetArrays must be visible in the caller's module. Otherwise write `A[i+_] := ...` to remove offset")
local out = similar(parent(data), 𝒯, tuple(Base.OneTo(𝒶𝓍i)))
begin
(Tullio.threader)(𝒜𝒸𝓉!, (Tullio.storage_type)(out, data), out, tuple(data), tuple(𝒶𝓍i), tuple(𝒶𝓍o), +, 262144, nothing)
#= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:956 =#
out
end
Making the ranges statically sized, e.g.
using ArrayInterface
StaticInt(0):StaticInt(9)
would let LoopVectorization specialize on the 0:9
.
using Tullio, LoopVectorization, BenchmarkTools
data = randn(1000);
out1 = similar(data, length(data) - 9);
out2 = similar(data, length(data) - 9);
function rollingmean10_lv!(out, data)
@avx for i ∈ eachindex(out)
outᵢ = zero(eltype(out))
for j ∈ 0:9
outᵢ += data[i+j]
end
out[i] = 0.1 * outᵢ
end
out
end
rollingmean10_tullio!(out, data) = @tullio out[i] = 0.1*data[i+o] o in 0:9 threads=false
rollingmean10_lv!(out1, data) ≈ rollingmean10_tullio!(out2, data)
@benchmark rollingmean10_lv!($out1, $data)
@benchmark rollingmean10_tullio!($out2, $data)
With a 10980xe, yields:
julia> rollingmean10_lv!(out1, data) ≈ rollingmean10_tullio!(out2, data)
true
julia> @benchmark rollingmean10_lv!($out1, $data)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 280.072 ns (0.00% GC)
median time: 285.438 ns (0.00% GC)
mean time: 285.265 ns (0.00% GC)
maximum time: 465.455 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 292
julia> @benchmark rollingmean10_tullio!($out2, $data)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 297.058 ns (0.00% GC)
median time: 297.394 ns (0.00% GC)
mean time: 297.760 ns (0.00% GC)
maximum time: 448.108 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 259
The performance difference is a little larger on my laptop (tiger lake):
julia> @benchmark rollingmean10_lv!($out1, $data)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 280.783 ns (0.00% GC)
median time: 295.303 ns (0.00% GC)
mean time: 301.026 ns (0.00% GC)
maximum time: 778.703 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 290
julia> @benchmark rollingmean10_tullio!($out2, $data)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 313.283 ns (0.00% GC)
median time: 344.108 ns (0.00% GC)
mean time: 345.672 ns (0.00% GC)
maximum time: 522.862 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 240
As I found out by looking at the asm, the difference is that Tullio uses fma instructions, which are a little slowe on Tiger Lake.
With the static 10, it'll completely unroll the loop:
L80:
vmovupd zmm1, zmmword ptr [rdi + 8*rcx - 136]
vmovupd zmm2, zmmword ptr [rdi + 8*rcx - 64]
vaddpd zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 128]
vaddpd zmm2, zmm2, zmmword ptr [rdi + 8*rcx - 72]
vaddpd zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 120]
vaddpd zmm3, zmm2, zmmword ptr [rdi + 8*rcx - 56]
vaddpd zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 112]
vaddpd zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 48]
vaddpd zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 104]
vaddpd zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 40]
vaddpd zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 96]
vaddpd zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 32]
vaddpd zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 88]
vaddpd zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 24]
vaddpd zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 80]
vaddpd zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 16]
vaddpd zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 8]
vaddpd zmm1, zmm1, zmm2
vaddpd zmm2, zmm3, zmmword ptr [rdi + 8*rcx]
vmulpd zmm1, zmm1, zmm0
vmovupd zmmword ptr [rdx], zmm1
vmulpd zmm1, zmm2, zmm0
vmovupd zmmword ptr [rdx + 64], zmm1
add rcx, 16
sub rdx, -128
cmp rcx, r11
jl L80
Versus without it:
L176:
vfmadd231pd zmm1, zmm0, zmmword ptr [rbx + 8*rdi - 192] # zmm1 = (zmm0 * mem) + zmm1
vfmadd231pd zmm2, zmm0, zmmword ptr [rbx + 8*rdi - 128] # zmm2 = (zmm0 * mem) + zmm2
vfmadd231pd zmm3, zmm0, zmmword ptr [rbx + 8*rdi - 64] # zmm3 = (zmm0 * mem) + zmm3
vfmadd231pd zmm4, zmm0, zmmword ptr [rbx + 8*rdi] # zmm4 = (zmm0 * mem) + zmm4
inc rdi
cmp rsi, rdi
jne L176
This version unrolls the i
loop by 4 and doesn't unroll the o
loop at all.
It also uses fma
s, instead of adds with a multiply at the end. Which is faster is architecture dependent.
On some computers, adds and fmas are the same fast, meaning you wouldn't want the multiply at the end. Bizarely, on Haswell, fma is actually faster than add, but more commonly fma has a lower throughput.
That's an optimiation LoopVectoriation (or LLVM, given associativity) should thus be making. Just observing.
In general, LoopVectorization should do better with more static size info. In can also sometimes help compile times, e.g. it can make a big difference with 2d concolutions.
Dear Michael,
I'm using @einsum
from Tullio and I get an unexpected behavior when debugging a julia program in VSCode, https://www.julia-vscode.org/. The debugger always stops at the same line in macro.jl.
Say, I have a program like
using Tullio: @einsum
v = [1; 2; 3];
w = [4; 5; 6]; # <== 1st breakpoint
@einsum a[i,j] := v[i] * w[j];
println("a=", a);
b = 4 * a; # <== 2nd breakpoint
println("b=", b);
After I start the debugger with F5
VSCode shows correct behavior at the 1st breakpoint, see the attached file 2_debugger_at_1st_BP.PNG
. When I continue with F5
the debugger stops in line 826 of macro.jl
. See attached file 3_debugger_stops_in_macro_jl.PNG
. Another continue with F5
leads me to the 2nd BP, see file 4_debugger_at_2n_BP.PNG
.
In case I have several @einsum
statements the debugger stops several times always in macro.jl
at line 826. That can be disturbing if I want to debug code after many @einsum
statements.
This behavior may be also related to VSCode but maybe you have a hint?
Thank you
Ralf
After starting the debugger with F5
at the 1st breakpoint (as expected)
Continue F5
and the debugger stops in macro.jl
(unexpected behavior)
Another Continue F5
and the debugger stops at the 2nd break point (as expected)
I came into an InexactError error when using Tullio to perform reduction and then abs2 over complex vectors.
The MWE might be easier to explain the issue:
using FFTW, Tullio
x = rand(120) .- .5
y = rand(120) .- .5
u = rand(500) .- .5
v = rand(500) .- .5
a = rand(120)
k = 2π/.01
function tfunc(u,v,a,k,x,y)
@tullio vals[i] := exp(im * (k * (x[j]*u[i] + y[j]*v[i]) + a[j])) |> abs2
end
tfunc(u,v,a,k,x,y)
Which results in
ERROR: LoadError: InexactError: Float64(-13.863000778992978 + 2.707549484306247im)
In my example the issue could easily be resolved by performing the abs2
directly on the vals
array returned by the @tullio macro.
I briefly discussed this with @mcabbott on slack and this appears to only happen when threading is enabled in @tullio.
This thread provides an example: https://discourse.julialang.org/t/sequential-multiple-hcat-s-of-a-range-of-rows/49852
julia> A = reshape(0:9, :,2) .+ 0
5×2 Matrix{Int64}:
0 5
1 6
2 7
3 8
4 9
julia> multihcat(A, steps) = reduce(hcat, A[i+1:end-(steps-i), :] for i in 0:steps);
julia> multihcat(A, 2) # desired output
3×6 Matrix{Int64}:
0 5 1 6 2 7
1 6 2 7 3 8
2 7 3 8 4 9
julia> @tullio B[r,c] := begin
i=r+(c-1)÷2
A[i, mod(c)]
end (r in 1:3, c in 1:6)
3×6 Matrix{Int64}:
0 5 1 6 2 7
1 6 2 7 3 8
2 7 3 8 4 9
julia> @tullio B[r,c] := A[r+(c-1)÷2, mod(c)] r in 1:3
ERROR: LoadError: "don't know how to handle (c - 1) ÷ 2, sorry"
Hey,
the code below produces different results when using Tullio and for loops. Once I'm using Float64 the results are very close to each other. Is this due to @fastmath? If yes, can we disable it?
Thanks,
Felix
using TestImages
using Tullio
img = convert(Array{Float32}, testimage("fabio_gray_512"))
x = reshape(img, 512, 512)
@tullio res_t = (x[i + 1, j] + x[i - 1, j])^2 (i in 2:size(x)[1] - 1)
print(res_t, "\n")
@tullio res_t = (x[i + 1, j] + x[i - 1, j])^2
print(res_t, "\n")
res_f = 0
for i = 2:size(x)[1] -1
for j = 1:size(x)[2]
res_f += (x[i+1, j] + x[i - 1, j])^2
end
end
print(res_f)
output:
219891.22
219891.22
219914.31
The following is valid Tullio code:
@tullio C[i ÷ 2, k] += A[i, j] * B[j, k]
but a naive implementation could incur in a race condition (as different values of i
end up writing in the same slot in C
). Following a conversation on slack, the default should be "safe but slow", so avoid multithreading on unsafe indices.
Code that used to work on 1.5.3 and still works with e.g. sin
k = LinRange(.16,37.8,Np) |> collect
rv = LinRange(0,15,1000) |> collect
gvec = zero(rv)
@tullio gvec[i] = rv[i]*sin(k[j]*rv[i])
and used to work with besselj0
now fails
@tullio gvec[i] = rv[i]*besselj0(k[j]*rv[i])
ERROR: LoadError: StackOverflowError:
Stacktrace:
[1] besselj0(x::VectorizationBase.Vec{4, Float64})
@ SpecialFunctions ~/.julia/packages/SpecialFunctions/24S26/src/bessel.jl:204
in expression starting at /Users/abradley/Dropbox/Research/Projects/GPESpectra/scripts/2d_trap_correlation_spectra.jl:760
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.