mcabbott / tullio.jl Goto Github PK

View Code? Open in Web Editor NEW

585.0 10.0 26.0 989 KB

⅀

License: MIT License

Julia 100.00%

tensor notation julia einsum automatic-differentiation

tullio.jl's People

Contributors

Stargazers

Watchers

tullio.jl's Issues

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

Massive performance degredation between v0.2.8 and v0.2.9 with LoopVectorization.jl

I think something is not being communicated correctly to LoopVectorization.jl on version 0.2.9:

julia> using Tullio, LoopVectorization
[ Info: Precompiling Tullio [bc48ee85-29a4-5162-ae0b-a64e1601d4bc]

julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j]
tmul! (generic function with 1 method)

julia> foreach((2, 10, 50, 100)) do N
           A, B = rand(N, N + 1), rand(N + 1, N + 2)
           @show N
           @btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with Tullio.jl
           @btime  mul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with OpenBLAS
       end
N = 2
  51.804 ns (0 allocations: 0 bytes)
  123.709 ns (0 allocations: 0 bytes)
N = 10
  210.035 ns (0 allocations: 0 bytes)
  371.549 ns (0 allocations: 0 bytes)
N = 50
  10.550 μs (0 allocations: 0 bytes)
  13.939 μs (0 allocations: 0 bytes)
N = 100
  25.340 μs (49 allocations: 3.19 KiB)
  39.860 μs (0 allocations: 0 bytes)

(@v1.5) pkg> st Tullio LoopVectorization
Status `~/.julia/environments/v1.5/Project.toml`
  [bdcacae8] LoopVectorization v0.8.26
  [bc48ee85] Tullio v0.2.8

Now, restarting julia,

(@v1.5) pkg> add Tullio@v0.2.9
   Updating registry at `~/.julia/registries/General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
Updating `~/.julia/environments/v1.5/Project.toml`
  [bc48ee85] ↑ Tullio v0.2.8 ⇒ v0.2.9
Updating `~/.julia/environments/v1.5/Manifest.toml`
  [bc48ee85] ↑ Tullio v0.2.8 ⇒ v0.2.9

julia> using Tullio, LoopVectorization
[ Info: Precompiling Tullio [bc48ee85-29a4-5162-ae0b-a64e1601d4bc]

julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j]
tmul! (generic function with 1 method)

julia> foreach((2, 10, 50, 100)) do N
           A, B = rand(N, N + 1), rand(N + 1, N + 2)
           @show N
           @btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with Tullio.jl
           @btime  mul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with OpenBLAS
       end
N = 2
  51.125 ns (0 allocations: 0 bytes)
  129.749 ns (0 allocations: 0 bytes)
N = 10
  847.338 ns (0 allocations: 0 bytes)
  372.568 ns (0 allocations: 0 bytes)
N = 50
  111.719 μs (0 allocations: 0 bytes)
  13.920 μs (0 allocations: 0 bytes)
N = 100
  261.787 μs (49 allocations: 3.19 KiB)
  38.220 μs (0 allocations: 0 bytes)

(@v1.5) pkg> st Tullio LoopVectorization
Status `~/.julia/environments/v1.5/Project.toml`
  [bdcacae8] LoopVectorization v0.8.26
  [bc48ee85] Tullio v0.2.9

Banded matrix multiplication does not LoopVectorize

I have the slightly unusual use-case dense matrix = dense matrix * banded matrix (actually more complicated than this, but this shows the issue). With the help of @mcabbott I managed to implement the following:

using LinearAlgebra
using BandedMatrices
using Tullio
using LoopVectorization

function clear_off_bands!(A::BandedMatrix{T}) where T
    m,n = size(A.data)
    l,u = bandwidths(A)

    for j = 1:u
        A.data[1:u-j+1,j] .= zero(T)
    end
    mpn = m+n
    for j = n-l+1:n
        A.data[mpn-l-j+1:end,j] .= zero(T)
    end
    A
end

function my_mul!(w, v, A::BandedMatrix)
    # This implementation assumes that the off-band values, i.e. the
    # upper-left and lower-right corners, are zero, cf. clear_off_bands!
    l,u = bandwidths(A)
    m,n = size(A)
    @tullio verbose=2 w[i,j] = begin
        k = min(max(h + j - u - 1, 1), m)
        v[i,k] * A.data[h,j]
    end
    w
end

n = 4000
v = Matrix((1.0+0im)*I, n, n) # Complex identity matrix
w = similar(v)
w2 = similar(v)
w3 = similar(v)

A = clear_off_bands!(brand(ComplexF64, n, n, 5, 5))
At = BandedMatrix(transpose(A))

@time mul!(w, v, A) # This is exceedingly slow, since this uses generic dense multiplication
@time mul!(transpose(w2), At, transpose(v)) # This will BLAS
@time my_mul!(w3, v, A)

@show norm(w-w2), norm(w-w3), norm(A-w3)

which gives a warning

┌ Warning: LoopVectorization failed
│   err =
│    LoadError: ArgumentError: Collection has multiple elements, must contain exactly 1 element
│    in expression starting at /Users/jagot/.julia/packages/Tullio/bgqFi/src/macro.jl:1092
└ @ Tullio ~/.julia/packages/Tullio/bgqFi/src/macro.jl:1118

but other than that seems to work well. On my laptop I get the following timings:

Naïve dense: 76.651751 s
BLAS with transposes: 1.152771 seconds
Tullio w/o AVX: 0.529101 seconds 🎉

cc @chriselrod

Potential for race conditions with shifted indices

This doesn’t look thread safe in the gradient, as one thread’s i+1 may be another’s i:

julia> using Tullio

julia> reg(x) = @tullio res = sqrt(abs2(x[i, j] - x[i+1, j]) +abs2(x[i, j] - x[i, j+1])) verbose=2
...
┌ Info: symbolic gradients
│   inbody =
│    3-element Array{Any,1}:
│     :(𝛥x[i, j] = 𝛥x[i, j] + 𝛥ℛ[1] * conj((((x[i, j] - x[i + 1, j]) + (x[i, j] - x[i + 1, j])) + ((x[i, j] - x[i, j + 1]) + (x[i, j] - x[i, j + 1]))) * (inv(sqrt(abs2(x[i, j] - x[i + 1, j]) + abs2(x[i, j] - x[i, j + 1]))) / 2)))
│     :(𝛥x[i + 1, j] = 𝛥x[i + 1, j] + 𝛥ℛ[1] * conj(-(((x[i, j] - x[i + 1, j]) + (x[i, j] - x[i + 1, j]))) * (inv(sqrt(abs2(x[i, j] - x[i + 1, j]) + abs2(x[i, j] - x[i, j + 1]))) / 2)))
└     :(𝛥x[i, j + 1] = 𝛥x[i, j + 1] + 𝛥ℛ[1] * conj(-(((x[i, j] - x[i, j + 1]) + (x[i, j] - x[i, j + 1]))) * (inv(sqrt(abs2(x[i, j] - x[i + 1, j]) + abs2(x[i, j] - x[i, j + 1]))) / 2)))
┌ store.
│   arrays = [:x]
…
│   sharedind = [:i, :j]
│   shiftedind = [:i, :j]

Example from https://discourse.julialang.org/t/fast-gpu-kernels-differentiable-with-zygote/56756 (mostly about scalar reductions on GPU)

GPU performance

I wasn't sure whether to open this issue here or on KernelAbstractions, so I figured I'd open it here first for discussion, and then we can move it to KernelAbstractions later if need be.

Right now, the GPU performance doesn't seem to scale very well on large matrices.

Here's an example plot for Matrix{Float32}=Matrix{Float16}×Matrix{Float16}, generated on a Titan V.

Here's the code used to generate it:

using BLASBenchmarksGPU
import CUDA
bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
import PyPlot
BLASBenchmarksGPU.plotbench(bench_result, "plot.png")
CUDA.versioninfo()

And here's the output, including all of the TFLOPS values:

julia> using BLASBenchmarksGPU

julia> import CUDA

julia> bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
Progress: 100%|███████████████████████████████████████████████████| Time: 0:08:24
  Size:         16384
  CUBLAS:       62.9 TFLOPS
  GemmKernels:  66.19 TFLOPS
  Tullio:       0.29 TFLOPS
Bennchmark Result of Matrix{Float32}=Matrix{Float16}×Matrix{Float16}
24×4 DataFrame
 Row │ Size   Library      TFLOPS      Time
     │ Int64  Symbol       Float64     Float64
─────┼─────────────────────────────────────────────────
   1 │   128  CUBLAS        0.148099    28321.0
   2 │   128  GemmKernels   0.0332214  126253.0
   3 │   128  Tullio        0.0540789   77559.0
   4 │   256  CUBLAS        0.642928    52190.0
   5 │   256  GemmKernels   0.268756   124851.0
   6 │   256  Tullio        0.310169   108181.0
   7 │   512  CUBLAS        4.36686     61471.0
   8 │   512  GemmKernels   2.02769    132385.0
   9 │   512  Tullio        0.677953   395950.0
  10 │  1024  CUBLAS       25.168       85326.0
  11 │  1024  GemmKernels  14.2662     150529.0
  12 │  1024  Tullio        0.850125        2.52608e6
  13 │  2048  CUBLAS       59.5005     288735.0
  14 │  2048  GemmKernels  36.4345     471527.0
  15 │  2048  Tullio        0.76592         2.24304e7
  16 │  4096  CUBLAS       84.8186          1.62039e6
  17 │  4096  GemmKernels  57.559           2.38779e6
  18 │  4096  Tullio        0.393097        3.49631e8
  19 │  8192  CUBLAS       90.1617          1.21949e7
  20 │  8192  GemmKernels  59.9127          1.83519e7
  21 │  8192  Tullio        0.324503        3.38829e9
  22 │ 16384  CUBLAS       62.9003          1.39842e8
  23 │ 16384  GemmKernels  66.1909          1.3289e8
  24 │ 16384  Tullio        0.294184        2.98999e10


julia> import PyPlot

julia> BLASBenchmarksGPU.plotbench(bench_result, "plot.png")

julia> CUDA.versioninfo()
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.27.4

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.27.4
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.0-beta1
- LLVM: 11.0.0
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_VERBOSE: true

1 device:
  0: TITAN V (sm_70, 658.500 MiB / 11.784 GiB available)

Gradient with Zygote and Identity

Hey,

I discovered a strange behaviour:

using Zygote
using Tullio

x = ones((5,5))

function f_working(arr)
    @tullio res = sqrt(arr[i, j])
    return res
end

function f_not_working(arr)
    @tullio res = identity(arr[i, j])
    return res
end

Calling the second function with Zygote:

julia> gradient(f_not_working, ones((3,3)))
ERROR: no gradient definition here!
Stacktrace:
 [1] (::Tullio.var"#217#218"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#9"{var"#𝒜𝒸𝓉!#8"},Nothing},Tuple{Array{Float64,2}},Array{Float64,1}})(::FillArrays.Fill{Float64,1,Tuple{Base.OneTo{Int64}}}) at <path>/.julia/packages/Tullio/bnAxO/src/grad/zygote.jl:7
 [2] (::Tullio.var"#195#back#219"{Tullio.var"#217#218"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#9"{var"#𝒜𝒸𝓉!#8"},Nothing},Tuple{Array{Float64,2}},Array{Float64,1}}})(::FillArrays.Fill{Float64,1,Tuple{Base.OneTo{Int64}}}) at <path>/.julia/packages/ZygoteRules/6nssF/src/adjoint.jl:49
 [3] f_not_working at ./REPL[5]:3 [inlined]
 [4] (::typeof(∂(f_not_working)))(::Float64) at <path>/.julia/packages/Zygote/z3bQd/src/compiler/interface2.jl:0
 [5] (::Zygote.var"#37#38"{typeof(∂(f_not_working))})(::Float64) at <path>/.julia/packages/Zygote/z3bQd/src/compiler/interface.jl:45
 [6] gradient(::Function, ::Array{Float64,2}) at <path>/.julia/packages/Zygote/z3bQd/src/compiler/interface.jl:54
 [7] top-level scope at REPL[6]:1

Pure Zygote is able to make the derivative of identity, so it must be somehow due to the interplay of Tullio and Zygote.

Thanks,

Felix

Edit:
I think it also occurs in a more general case:

foo = sqrt

function f_not_working(arr)
    @tullio res = foo(arr[i, j])
    return res
end

Force avx to "avx = false" when use LHS scattering, unless the index is checked to be unique.

using Tullio
using LoopVectorization
J = [1,3,1,1]
A = [i^2 for i in 1:10]

Example 1(wrong):

@tullio B[J[i],k] := A[k]
@tullio B[J[i],k] += A[k]
# 3×10 Array{Int64,2}:
#  1  4  9  16  25  36  49  64  81  100
#  0  0  0   0   0   0   0   0   0    0
#  1  4  9  16  25  36  49  64  81  100

Example 2(right):

@tullio B[J[i],k] := A[k]
@tullio avx=false B[J[i],k] += A[k]
# 3×10 Array{Int64,2}:
#  4  16  36  64  100  144  196  256  324  400
#  0   0   0   0    0    0    0    0    0    0
#  2   8  18  32   50   72   98  128  162  200

As @avx can't tell the user all the misusing, maybe the default set should be avx = false.

Use with ReverseDiff

I think this is still blocked by JuliaDiff/ReverseDiff.jl#135, here's the error:

julia> Base.gensym(ex::Expr) = gensym(string(ex)); # without this, LoadError: MethodError: no method matching gensym(::Expr) instead

julia> using ReverseDiff, Tullio
┌ Warning: Error requiring ReverseDiff from Tullio:
│ LoadError: UndefVarError: ev not defined
│ Stacktrace:
│  [1] top-level scope at /Users/me/.julia/packages/ReverseDiff/Thhqg/src/macros.jl:190
│  [2] include(::Function, ::Module, ::String) at /Applications/Julia-1.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
│  [3] include at ./Base.jl:368 [inlined]
│  [4] include(::String) at /Users/me/.julia/packages/Tullio/HGzih/src/Tullio.jl:1
│  [5] top-level scope at none:1
│  [6] eval at ./boot.jl:331 [inlined]
│  [7] eval at /Users/me/.julia/packages/Tullio/HGzih/src/Tullio.jl:1 [inlined]
│  [8] (::Tullio.var"#150#154")() at /Users/me/.julia/packages/Requires/qy6zC/src/require.jl:85
...

julia> ReverseDiff.gradient(x -> sum(@tullio y[i] := x[i]), rand(3))
ERROR: MethodError: no method matching track(::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#7"},var"#20#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#8"}}, ::ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}})
Stacktrace:
 [1] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#7"},var"#20#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#8"}})(::ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}) at /Users/me/.julia/packages/Tullio/HGzih/src/grad/reverse.jl:4
 [2] (::var"#5#6")(::ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}) at ./REPL[4]:1
 [3] ReverseDiff.GradientTape(::var"#5#6", ::Array{Float64,1}, ::ReverseDiff.GradientConfig{ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}}) at /Users/me/.julia/packages/ReverseDiff/Thhqg/src/api/tape.jl:199
 [4] gradient(::Function, ::Array{Float64,1}, ::ReverseDiff.GradientConfig{ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}}) at /Users/me/.julia/packages/ReverseDiff/Thhqg/src/api/gradients.jl:22 (repeats 2 times)

What the macro does:

using TensorCast, ReverseDiff
id(x) = x;
@pretty ReverseDiff.@grad function id(x)
           ReverseDiff.value(x), Δ -> (Δ,)
       end

Gradient missing for `begin ... end`

julia> f1(x) = sum(@tullio res[i, j] := begin           
                       x[i+2, j] - 2 * x[i+1, j] + x[i, j]
                   end)
f1 (generic function with 1 method)

julia> gradient(f1, x)
ERROR: no gradient definition here!
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] (::Tullio.var"#215#216"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#11"{var"#𝒜𝒸𝓉!#10"}, Nothing}, Tuple{Matrix{Float64}}, Matrix{Float64}})(Δ::FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}})
   @ Tullio ~/.julia/packages/Tullio/bgqFi/src/grad/zygote.jl:7
 [3] (::Tullio.var"#85#back#217"{Tullio.var"#215#216"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#11"{var"#𝒜𝒸𝓉!#10"}, Nothing}, Tuple{Matrix{Float64}}, Matrix{Float64}}})(Δ::FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}})
   @ Tullio ~/.julia/packages/ZygoteRules/OjfTt/src/adjoint.jl:59
 [4] Pullback
   @ ./REPL[39]:1 [inlined]
 [5] (::typeof(∂(f1)))(Δ::Float64)
   @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [6] (::Zygote.var"#41#42"{typeof(∂(f1))})(Δ::Float64)
   @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface.jl:40
 [7] gradient(f::Function, args::Matrix{Float64})
   @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface.jl:49
 [8] top-level scope
   @ REPL[40]:1

julia> f1(x) = sum(@tullio res[i, j] := x[i+2, j] - 2 * x[i+1, j] + x[i, j])
f1 (generic function with 1 method)

julia> gradient(f1, x)
([1.0 1.0 … 1.0 1.0; -1.0 -1.0 … -1.0 -1.0; … ; -1.0 -1.0 … -1.0 -1.0; 1.0 1.0 … 1.0 1.0],)

I encountered this because my equation got quite lengthy.

syntax error when array and index shares the same symbol

This is the minimal example of the error.

a, b = randn(ComplexF64, 10, 10), randn(ComplexF64, 10, 10)
@tullio # (verbose = false, fastmath = true, threads = true, grad = :Base, avx = true, cuda = 256, tensor = true)
@tullio out[a, b] := a[a, c] * b[c, b] # works
@tullio out[A, B] := a[A, C] * conj(b[C, B]) # works
@tullio out[a, b] := a[a, c] * conj(b[c, b]) # syntax error

When @tullio replaced by @tensor in TensorOperations.jl, the last expression works well. Is this a bug of the @tullio macro?

CI is failing (timing out) on Julia nightly

On the most recent commit to master (58c1fbb), CI passed on Julia 1.4 and passed on Julia 1.5 but failed (timed out) on Julia nightly. Links to the logs:

Julia 1.4.2 (passed in 1h 21m 42s): https://github.com/mcabbott/Tullio.jl/runs/1505716118
Julia 1.5.3 (passed in 52m 41s ): https://github.com/mcabbott/Tullio.jl/runs/1505716124
Julia nightly (failed - killed for timing out at 6h 0m 13s): https://github.com/mcabbott/Tullio.jl/runs/1505716131

GitHub Actions will kill a job if it runs for more than six hours.

We need to figure out how to make the CI jobs run for less than six hours.

Here is one option: we separate the test suite into "groups". E.g. if currently the test suite looks like this:

using Tullio
using Test

@test f() == "a"

@test g() == "b"

@test h() == "c"

We instead divide the test suite into three groups as such:

using Tullio
using Test

test_group = get(ENV, "TULLIO_TEST_GROUP", "all")

if test_group in ["all", "1"]
    @test f() == "a"
end

if test_group in ["all", "2"]
    @test g() == "b"
end

if test_group in ["all", "3"]
    @test h() == "c"
end

And then we set up the matrix in .github/workflows/ci.yml file as such:

strategy:
  fail-fast: false
  matrix:
    JULIA_NUM_THREADS:
      - '1'
      - '6'
    TULLIO_TEST_GROUP:
      - '1'
      - '2'
      - '3'
    version:
      - '1.4'
      - '1' # automatically expands to the latest stable 1.x release of Julia
    os:
      - ubuntu-latest
    arch:
      - x64

@mcabbott @chriselrod Any thoughts?

Issues with offset indices and GPU

Trying to implement a slightly weird "half-convolution" as follows:

using NNlib, Tullio

_tullio_conv(u, v) = @tullio y[i+_] := u[i+a] * v[a]


function tullio_conv(u::AbstractVector, v::AbstractVector)
    # Left pad
    upad = NNlib.pad_zeros(u, (length(v) - 1, 0))
    v_rev = @view v[end:-1:1]

    return _tullio_conv(upad, v_rev)
end

which results in "KernelAbstractions can't handle OffsetArrays here" when running tullio_conv(a, b) with a and b being CuArray of same size. Works fine on CPU though.

Not sure if this is Tullio.jl's or KernelAbstractions.jl's "fault".

Migrate to travis-ci.com

I noticed that this repo still uses travic-ci.org, at least based on the badge in the README. travis-ci.org will be shut down at the end of the year, cf. https://mailchi.mp/3d439eeb1098/travis-ciorg-is-moving-to-travis-cicom. Hence, everything should be migrated to travis-ci.com.

Unexpected behaviour when summing with different sign

Hey,

I recognized the following today. The only difference between correct one and wrong one are the signs.

julia> using Tullio
julia> arr = [1 2; 1 2; 3 4; 3 4]
4×2 Array{Int64,2}:
 1  2
 1  2
 3  4
 3  4
julia> @tullio x2[i1, i2] := (arr[2i1 + 1, 2i2 + 1] + arr[2i1 + 0, 2i2 + 1] + arr[2i1 + 1, 2i2 + 0] + arr[2i1 + 0, 2i2 + 0])
1×0 Array{Int64,2}
julia> @tullio x2[i1, i2] := (arr[2i1 - 1, 2i2 - 1] + arr[2i1 + 0, 2i2 - 1] + arr[2i1 - 1, 2i2 + 0] + arr[2i1 + 0, 2i2 + 0])
2×1 Array{Int64,2}:
  6
 14

I didn't expect this because your downsample example is also formulated in a "+" way.

Thanks,

Felix

Edit: I guess I understand it. It simply starts at i1=1 and then sums up from there (2* 1 + 1 and 2* 1 + 0).
So that's wanted behaviour?

How to implement n-dimensional functions?

Hey,

I'm trying to figure out how we could use Tullio to generalize some n-dimensional functions.
Let's say I've got a n-dimensional array and I want to use Tullio for that. At the moment I need to treat dimensions differently.

using Tullio
using Random

Random.seed!(42)

function spatial_grad_square_1D(arr)
    resi = let 
         @tullio resi = abs2(arr[i + 1] - arr[i - 1])
    end
    return resi
end

function spatial_grad_square_2D(arr)
    resi = let 
         @tullio resi = abs2(arr[i + 1, j] - arr[i - 1, j])
    end
    resj = let 
         @tullio resj = abs2(arr[i, j + 1] - arr[i, j - 1])
    end
    return resi + resj
end

arr_1D = randn((10))
arr_2D = randn((10, 10))
spatial_grad_square_1D(arr_1D) 
spatial_grad_square_2D(arr_2D)

Do you have any hints on how to achieve this in a more elegant way? Using the Julia metaprogramming documentation I couldn't figure it out 😞

Thanks,

Felix

Adding a "Quick Start" section to the README, with a list of recommended packages to install?

Could we add a "Quick Start" section to the README, to help beginners that want to quickly get started playing around with Tullio? The section would include a list of recommended packages for use with Tullio.

E.g. maybe something like this?

Quick Start

julia> using Pkg

julia> Pkg.add(["Tullio", "LoopVectorization", "TensorOperations", "KernelAbstractions"])

julia> using Tullio, LoopVectorization, TensorOperations, KernelAbstractions

julia> matmul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]

julia> X = [1.0 2.0 3.0; 4.0 5.0 6.0]

julia> Y = [7.0 8.0; 9.0 10.0; 11.0 12.0]

julia> matmul(X, Y)

Are these the right packages for someone to start off with?

Tullio
LoopVectorization
TensorOperations
KernelAbstractions

Segmentation fault for consecutive Tullio statements

Hey,

the following code leads (sometimes, 30 - 50 % roughly) to a segmentation fault.

using Tullio
using Zygote

arr = [1 2; 3 4]
function f(arr)
    @tullio res1 = arr[i, k] - arr[i - 1, k]
    @tullio res2 = arr[i, k] - arr[i, k + 1]
    return res1 + res2
end

f(arr)
print(gradient(f, arr))

It seems to be caused by the addition of the two statements. Furthermore this only happens once Zygote is loaded and gradient is called.
The output is then:

julia> include("tullio_bug.jl")
([1 -1; 2 -1],)
julia> include("tullio_bug.jl")

signal (11): Segmentation fault
in expression starting at <path>/tullio_bug.jl:12
jl_gc_pool_alloc at /usr/bin/../lib/libjulia.so.1 (unknown line)
jl_gc_alloc at /usr/bin/../lib/libjulia.so.1 (unknown line)
jl_alloc_array_1d at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff849fc672)
unknown function (ip: 0x7eff84a11a38)
unknown function (ip: 0x7eff84a1234a)
jl_uncompress_ast at /usr/bin/../lib/libjulia.so.1 (unknown line)
_uncompressed_ast at ./reflection.jl:950 [inlined]
uncompressed_ast at ./reflection.jl:946
#meta#2 at /home/fxw/.julia/packages/IRTools/BpoqK/src/reflection/reflection.jl:112
meta at /home/fxw/.julia/packages/IRTools/BpoqK/src/reflection/reflection.jl:101 [inlined]
_lookup_grad at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/emit.jl:99
#s3320#1912 at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:21 [inlined]
#s3320#1912 at ./none:0
unknown function (ip: 0x7eff776bd06e)
unknown function (ip: 0x7eff84a3edab)
jl_code_for_staged at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff7761066a)
unknown function (ip: 0x7eff77615cca)
unknown function (ip: 0x7eff776aaad7)
unknown function (ip: 0x7eff776ab3d4)
unknown function (ip: 0x7eff776ab57f)
unknown function (ip: 0x7eff849ed555)
unknown function (ip: 0x7eff849edf04)
jl_apply_generic at /usr/bin/../lib/libjulia.so.1 (unknown line)
macro expansion at /home/fxw/.julia/packages/Tullio/xESL5/src/macro.jl:617 [inlined]
f at /home/fxw/Documents/Uni/FSU/2semester/Internship/DeconvOptim.jl/tullio_bug.jl:6 [inlined]
Pullback at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
#36 at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface.jl:46
unknown function (ip: 0x7eff0cb171b2)
gradient at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface.jl:55
unknown function (ip: 0x7eff84a02d85)
unknown function (ip: 0x7eff84a02a0e)
unknown function (ip: 0x7eff84a04670)
unknown function (ip: 0x7eff84a050e0)
unknown function (ip: 0x7eff84a210a1)
unknown function (ip: 0x7eff849f76e6)
jl_load at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff77955c1f)
unknown function (ip: 0x7eff84a02d85)
unknown function (ip: 0x7eff84a02a0e)
unknown function (ip: 0x7eff84a04670)
unknown function (ip: 0x7eff84a050e0)
unknown function (ip: 0x7eff84a210a1)
unknown function (ip: 0x7eff84a21748)
jl_toplevel_eval_in at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff77707df4)
unknown function (ip: 0x7eff77935179)
unknown function (ip: 0x7eff77935474)
unknown function (ip: 0x7eff84a08cc9)
unknown function (ip: (nil))
Allocations: 80635502 (Pool: 80625319; Big: 10183); GC: 66
[1]    395611 segmentation fault (core dumped)  julia

Could this be related to fastmath?

Thanks,

Felix

Edit: The following code seems to work. Notice the minor change in res2:

using Tullio
using Zygote

arr = [1 2; 3 4]
function f(arr)
    @tullio res1 = arr[i, k] - arr[i - 1, k]
    @tullio res2 = arr[i, k] - arr[i, k]
    return res1 + res2
end

f(arr)
print(gradient(f, arr))

Idea: on Buildkite, only run the GPU-related tests?

We now have Buildkite working 🎉

On CPU, the test suite is quite long - on Julia 1.5 the CPU tests take e.g. 2 hours and 12 minutes (https://github.com/mcabbott/Tullio.jl/runs/1557475138).

There's no need to run all the CPU-only tests on Buildkite, since we will already be running them on GitHub Actions.

We can e.g. set an environment variable on Buildkite, e.g. BUILDKITE="true". And then in the test suite, we will do:

is_buildkite = parse(Bool, get(ENV, "BUILDKITE", "false"))

And then if is_buildkite is true, we only run the GPU-related tests.

Slower performance on GPU than CPU with Zygote

I'm experiencing some slower performances when taking gradients on CuArrays, than normal CPU Arrays, using Zygote. Here's an MWE (the Tullio operations implement a mixture linear regression):

using Tullio
using Zygote
using CUDA, KernelAbstractions

CUDA.allowscalar(false)

function prediction(V, W, r)
    @tullio mixture[a, b, d] := V[a, b, c] * W[d, c] 
    @tullio r̂[a, d] := mixture[a, b, d] * r[b, d] 
    return r̂
end

### CuArray version
V = CuArray(randn(15, 15, 3))
W = CuArray(randn(150, 3))


@time begin 
    grad = gradient(Params([V])) do 
        l = 0
        for t = 1:5
            r = CuArray(randn(15, 150))
            r̂ = prediction(V, W, r) 
            l += sum(r - r̂)
        end
        l
    end
end
# output:   0.304601 seconds (516.80 k allocations: 15.125 MiB)

If I replace all the arrays to CPU, I get 0.151464 seconds (209.86 k allocations: 13.321 MiB).

I'm wondering if this is expected? Or am I forgetting something when using Cuda arrays that slows down the performance?

Scalar reductions

As pointed out by @mforets, there is a surprisingly large overhead for small scalar reductions:

using LinearAlgebra, Tullio
v = rand(10);

td(v,w) = @tullio d := v[i] * w[i]
@btime td($v, $v)  # 2 μs
@btime dot($v, $v) # 20 ns

ts(v) = @tullio s := v[i]
@btime ts($v)  # 2 μs
@btime sum($v) # 4 ns

Perhaps this is mostly from threading (or rather, thinking whether to multi-thread). Disabling that brings it closer to the array-output case:

td1(v,w) = @tullio d := v[i] * w[i]  threads=false
@btime td1($v, $v)  # 60 ns

tt(v) = @tullio vv[i] := 2*v[i];
@btime tt($v)  # 50 ns
@btime 2 .* $v # 40 ns

tt1(v) = @tullio vv[i] := 2*v[i] threads=false;
@btime tt1($v) # 36 ns

The threaded reduction may also use init a number of times, and ideally it would check that this is compatible with the reduction, as suggested #24 (comment).

Finally, I think that @tullio d := sqrt <| v[i] * v[i] should be disallowed, as there's no reason to have the macro handle this.

Migrate GPU CI to Buildkite?

Right now, Tullio.jl uses the GitLab runners for running GPU CI in the JuliaGPU group.

If I understand correctly, JuliaGPU is transitioning from GitLab to Buildkite, with the plan of eventually completely discontinuing the GitLab runners.

It would be great to migrate the GPU CI for Tullio.jl from the GitLab runners to the Buildkite runners.

@maleadt Can you provide instructions for migrating to the new Buildkite JuliaGPU setup?

A case where TrackedReal escapes?

julia> using Tracker, Zygote, Tullio, LinearAlgebra

julia> function gg(x)
         res = 0.0
         n = [norm(x)]
         for i in 1:100
            res += @tullio res_ := sin(exp(x[j]) - n[1])
         end
         atan(res)
       end;

julia> Zygote.gradient(gg, rand(10))
([-0.3105827909434481, -1.435298638568034, -0.22031585984529584, -0.03951256409987808, 0.14018859477924406, -0.49862889658929765, -0.15284449100831132, -0.8243820635907274, -0.06310467526562058, -0.1548104899931962],)

julia> Tracker.gradient(gg, rand(10))
ERROR: MethodError: no method matching Float64(::Tracker.TrackedReal{Float64})
Closest candidates are:
  Float64(::Real, ::RoundingMode) where T<:AbstractFloat at rounding.jl:200
  Float64(::T) where T<:Number at boot.jl:716
  Float64(::BigInt) at gmp.jl:451
  ...
Stacktrace:
 [1] convert(::Type{Float64}, ::Tracker.TrackedReal{Float64}) at ./number.jl:7
 [2] setindex!(::Array{Float64,1}, ::Tracker.TrackedReal{Float64}, ::Int64) at ./array.jl:847
 [3] macro expansion at /Users/me/.julia/dev/Tullio/src/symbolic.jl:53 [inlined]
 [4] ∇𝒜𝒸𝓉! at /Users/me/.julia/dev/Tullio/src/macro.jl:1207 [inlined]

TODO: reenable ALL tests on Julia 1.6 and Julia nightly

In #72, we disabled some tests on Julia nightly because they were causing CI to time out.

Once everything is working on Julia nightly, we should reenable ALL tests.

When supplied range conflicts with un-shifted index...

Hey,

the following one:

julia> using Tullio

julia> y = ^C

julia> y = (1:10) .* (1:10)'
10×10 Matrix{Int64}:
  1   2   3   4   5   6   7   8   9   10
  2   4   6   8  10  12  14  16  18   20
  3   6   9  12  15  18  21  24  27   30
  4   8  12  16  20  24  28  32  36   40
  5  10  15  20  25  30  35  40  45   50
  6  12  18  24  30  36  42  48  54   60
  7  14  21  28  35  42  49  56  63   70
  8  16  24  32  40  48  56  64  72   80
  9  18  27  36  45  54  63  72  81   90
 10  20  30  40  50  60  70  80  90  100

julia> f(x) = (@tullio res[i, j] := x[i, j+1] (i in 1:5, j in 1:5))
f (generic function with 1 method)

julia> f(y)
ERROR: "range of index i must agree"
Stacktrace:
 [1] ℳ𝒶𝓀ℯ
   @ ~/.julia/packages/Tullio/bgqFi/src/macro.jl:968 [inlined]
 [2] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#180"{var"#𝒜𝒸𝓉!#177"}, var"#735#∇ℳ𝒶𝓀ℯ#179"{var"#∇𝒜𝒸𝓉!#178"}})(args::Matrix{Int64})
   @ Tullio ~/.julia/packages/Tullio/bgqFi/src/eval.jl:20
 [3] f(x::Matrix{Int64})
   @ Main ./REPL[119]:1
 [4] top-level scope
   @ REPL[120]:1

julia> f(x) = (@tullio res[i, j] := x[i+0, j+1] (i in 1:5, j in 1:5))
f (generic function with 1 method)

julia> f(y)
5×5 Matrix{Int64}:
  2   3   4   5   6
  4   6   8  10  12
  6   9  12  15  18
  8  12  16  20  24
 10  15  20  25  30

What is the reason that we need the artificial +0? My goal was to calculate only a subset of the initial x

Thanks,

Felix

Error requiring LoopVectorization from Tullio

I get the following warning on Tullio v0.2.10 when loading LoopVectorization v0.9.1:

┌ Warning: Error requiring LoopVectorization from Tullio:
│ UndefVarError: SVec not defined
...

I believe it is due to the renaming of an internal struct SVec in previous versions of VectorizationBase to Vec in current versions.

I'm not familiar enough with your internals to make a PR to fix this, but naively it looks like a simple find and replace.

@tullio seems slower than equivalent for loop?

The speed gap seems large. The benchmark is shown below:

julia> a = randn(4000,4000);

julia> b = randn(4000);

julia> @btime func($a,$b);
  175.183 ms (0 allocations: 0 bytes)

julia> @btime func_tullio($a,$b);
  266.097 ms (0 allocations: 0 bytes)

julia> @btime func_avx($a,$b);
  45.472 ms (0 allocations: 0 bytes)

julia> @btime func_tullio_avx($a,$b);
  150.239 ms (0 allocations: 0 bytes)

where the functions' definition are:

func_tullio(a,b) = @tullio threads = false avx = false b[i] = sin(a[i,j]);

func_tullio_avx(a,b) = @tullio threads = false b[i] = sin(a[i,j]);

func(a,b) = begin
    ndims(a) == 2 || error("~~")
    ndims(b) == 1 || error("~~")
    size(b,1) == size(a,1) || error("~~")
    b .= 0
    for j in 1:size(a,2)  
        for i in 1:size(a,1) # no @inbounds @simd
            b[i] += sin(a[i,j])
        end
    end
end

func_avx(a,b) = begin
    ndims(a) == 2 || error("~~")
    ndims(b) == 1 || error("~~")
    size(b,1) == size(a,1) || error("~~")
    b .= 0
    for j in 1:size(a,2)
        @avx for i in 1:size(a,1)  # add @avx
            b[i] += sin(a[i,j])
        end
    end
end

Not sure what to do with x + a + c

I tried to run the following code:

@tullio dP[x,y,z] -= -Diff[a+2, b+2] * Diff[c+2, d+2] * P[mod(x+a+c), mod(y+b+d), z] * P[mod(x+a),mod(y+b),z] (a in -1:1, b in -1:1, c in -1:1, d in -1:1, z in 1:3, x in 1:n, y in 1:n)

I could decompose it into:

V[x,y,z] = Diff[c+2,d+2] * P[mod(x+c), mod(y+d), z] (c in -1:1, d in -1:1, x in 1:n, y in 1:n, z in 1:3)

dP[x,y,z] -= Diff[a+2, b+2] * V[mod(x+a), mod(y+b), z] * P[mod(x+a), mod(y+b), z] (a in -1:1, b in -1:1, x in 1:n, y in 1:n, z in 1:3)

Is there good reason why mod(x+a+c) not supported?

Spurious printing of "DiffRules._abs_deriv" when LoopVectorization.jl loaded

julia> using Tullio

julia> let v = [-1, 0, 1, 2]
           @tullio out := abs(v[i])
       end
4

julia> using LoopVectorization

julia> let v = [-1, 0, 1, 2]
           @tullio out := abs(v[i])
       end
DiffRules._abs_deriv
4

julia> typeof(ans)
Int64

julia> let v = [-1, 0, 1, 2]
           @tullio out := abs(v[i])
       end
DiffRules._abs_deriv
4

Is this maybe some println debugging that didn't get removed?

error on julia 1.6 with LoopVectorization on MacOS

julia> using LoopVectorization

julia> using Tullio

julia> a=randn(10,10);

julia> b=randn(10,10);

julia> c=randn(10,10);

julia> @tullio c[i,j]=a[i,k]b[k,j]
ERROR: Module IR does not contain specified entry function
Stacktrace:
 [1] assume
   @ ~/.julia/packages/SIMDPirates/EVSvY/src/llvm_utils.jl:308 [inlined]
 [2] macro expansion
   @ ~/.julia/packages/LoopVectorization/pHMnJ/src/reconstruct_loopset.jl:503 [inlined]
 [3] _avx_!(::Val{(0, 0, -1, 4)}, ::Type{Tuple{:LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01), Symbol(""), :isnothing, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000001, LoopVectorization.compute, 0x00, 0x02), :numericconstant, Symbol("##zero#257"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x03), :LoopVectorization, :conditionalload, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000005, LoopVectorization.memload, 0x01, 0x04), :LoopVectorization, :~, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000002, LoopVectorization.compute, 0x00, 0x05), :LoopVectorization, :vifelse, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000020304, LoopVectorization.compute, 0x00, 0x06), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000023, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x07), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000031, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x08), :numericconstant, Symbol("##reductzero#266"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000003, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x09), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000231, 0x0000000000000003, 0x0000000000000000, 0x0000000000070809, LoopVectorization.compute, 0x00, 0x09), :LoopVectorization, :reduced_add, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x0000000000000a06, LoopVectorization.compute, 0x00, 0x06), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x000000000000000b, LoopVectorization.memstore, 0x01, 0x0a)}}, ::Type{Tuple{LoopVectorization.ArrayRefStruct{:ℛ, Symbol("##vptr##_ℛ")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:a, Symbol("##vptr##_a")}(0x0000000000000101, 0x0000000000000203, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:b, Symbol("##vptr##_b")}(0x0000000000000101, 0x0000000000000301, 0x0000000000000000)}}, ::Type{Tuple{0, Tuple{}, Tuple{1}, Tuple{}, Tuple{}, Tuple{(3, LoopVectorization.IntOrFloat), (9, LoopVectorization.IntOrFloat)}, Tuple{}}}, ::Type{Tuple{:j, :i, :k}}, ::Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, ::VectorizationBase.PackedStridedPointer{Float64, 1}, ::VectorizationBase.PackedStridedPointer{Float64, 1}, ::VectorizationBase.PackedStridedPointer{Float64, 1}, ::Nothing, ::typeof(isnothing))
   @ LoopVectorization ~/.julia/packages/LoopVectorization/pHMnJ/src/reconstruct_loopset.jl:503
 [4] 𝒜𝒸𝓉!
   @ ~/.julia/packages/Tullio/oeP9x/src/macro.jl:932 [inlined]
 [5] tile_halves(fun!::var"#𝒜𝒸𝓉!#1", ::Type{Matrix{Float64}}, As::Tuple{Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}, Is::Tuple{UnitRange{Int64}, UnitRange{Int64}}, Js::Tuple{UnitRange{Int64}}, keep::Nothing, final::Bool)
   @ Tullio ~/.julia/packages/Tullio/oeP9x/src/threads.jl:142
 [6] tile_halves
   @ ~/.julia/packages/Tullio/oeP9x/src/threads.jl:139 [inlined]
 [7] threader(fun!::var"#𝒜𝒸𝓉!#1", ::Type{Matrix{Float64}}, Z::Matrix{Float64}, As::Tuple{Matrix{Float64}, Matrix{Float64}}, I0s::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, J0s::Tuple{Base.OneTo{Int64}}, redfun::Function, block::Int64, keep::Nothing)
   @ Tullio ~/.julia/packages/Tullio/oeP9x/src/threads.jl:68
 [8] top-level scope
   @ ~/.julia/packages/Tullio/oeP9x/src/macro.jl:802

julia> versioninfo()
Julia Version 1.6.0-DEV.1093
Commit 28330a2fef (2020-09-30 14:46 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

Error when taking gradient of gpu expression

Running this code gives an error, when setting CUDA.allowscalar(false) it does show scalar operations are being done.
On cpu everything works as expected.

using CUDA, KernelAbstractions, Tullio, Flux

f(x) = @tullio out[m, n, batch] := W[m, i] * x[i, n, batch] + b[m]
a = cu(rand(3, 5, 6)); W = cu(rand(10, 3)); b = cu(rand(10))

Flux.gradient(Flux.Params(W)) do
    sum(f(a))
end

AssertionError: length(__workgroupsize) <= length(ndrange)

Stacktrace:
 [1] partition at C:\Users\jules\.julia\packages\KernelAbstractions\yw9SF\src\nditeration.jl:103 [inlined]
 [2] partition(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#421#156"}, ::Tuple{}, ::Nothing) at C:\Users\jules\.julia\packages\KernelAbstractions\yw9SF\src\KernelAbstractions.jl:368
 [3] (::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#421#156"})(::CuArray{Float32,2,Nothing}, ::Vararg{Any,N} where N; ndrange::Tuple{}, dependencies::Nothing, workgroupsize::Nothing, progress::Function) at C:\Users\jules\.julia\packages\KernelAbstractions\yw9SF\src\backends\cuda.jl:196
 [4] ∇𝒜𝒸𝓉! at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\macro.jl:814 [inlined]
 [5] (::var"#∇𝒜𝒸𝓉!#154")(::Type{CuArray{Float32,N,Nothing} where N}, ::CuArray{Float32,2,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,1,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,2,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,1,Nothing}, ::Base.OneTo{Int64}, ::Base.OneTo{Int64}, ::Base.OneTo{Int64}, ::Base.OneTo{Int64}) at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\macro.jl:811
 [6] ∇threader(::Function, ::Type{CuArray{Float32,N,Nothing} where N}, ::Tuple{CuArray{Float32,2,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,1,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,2,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,1,Nothing}}, ::Tuple{}, ::NTuple{4,Base.OneTo{Int64}}; block::Int64) at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\eval.jl:74
 [7] #750#∇ℳ𝒶𝓀ℯ at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\macro.jl:908 [inlined]
 [8] #217 at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\grad\zygote.jl:8 [inlined]
 [9] (::Tullio.var"#632#back#219"{Tullio.var"#217#218"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#159"{var"#𝒜𝒸𝓉!#150"},var"#750#∇ℳ𝒶𝓀ℯ#158"{var"#∇𝒜𝒸𝓉!#154"}},Tuple{CuArray{Float32,2,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,1,Nothing}},CuArray{Float32,3,Nothing}}})(::CuArray{Float32,3,Nothing}) at C:\Users\jules\.julia\packages\ZygoteRules\6nssF\src\adjoint.jl:49
 [10] Affine at .\In[23]:11 [inlined]
 [11] (::typeof(∂(λ)))(::CuArray{Float32,3,Nothing}) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface2.jl:0
 [12] #237 at .\In[88]:4 [inlined]
 [13] (::typeof(∂(#237)))(::Float32) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface2.jl:0
 [14] (::Zygote.var"#54#55"{Zygote.Params,Zygote.Context,typeof(∂(#237))})(::Float32) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface.jl:177
 [15] gradient(::Function, ::Zygote.Params) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface.jl:54
 [16] top-level scope at In[88]:3

Upon removing the b[m] term, the code doesn't error anymore but scalar iteration seems to take place.

Register

@JuliaRegistrator register

Warning when loaded with Reversediff

Executing using ReverseDiff, Tullio in REPL gives the warning

julia> using ReverseDiff, Tullio
┌ Warning: Error requiring ReverseDiff from Tullio:
│ LoadError: LoadError: MethodError: no method matching gensym(::Expr)
│ Closest candidates are:
│   gensym(!Matched::Symbol) at expr.jl:15
│   gensym(!Matched::String) at expr.jl:12
│   gensym() at expr.jl:10
│   ...
│ Stacktrace:
│  [1] @grad(::LineNumberNode, ::Module, ::Any) at /home/ritesh/.julia/packages/ReverseDiff/jFRo1/src/macros.jl:182
│  [2] include(::Function, ::Module, ::String) at ./Base.jl:380
│  [3] include at ./Base.jl:368 [inlined]
│  [4] include(::String) at /home/ritesh/.julia/packages/Tullio/oeP9x/src/Tullio.jl:1
│  [5] top-level scope at none:1
│  [6] eval at ./boot.jl:331 [inlined]
│  [7] eval at /home/ritesh/.julia/packages/Tullio/oeP9x/src/Tullio.jl:1 [inlined]
│  [8] (::Tullio.var"#156#160")() at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:85
│  [9] err(::Any, ::Module, ::String) at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:42
│  [10] (::Tullio.var"#155#159")() at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:84
│  [11] withpath(::Any, ::String) at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:32
│  [12] (::Tullio.var"#154#158")() at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:83
│  [13] listenpkg(::Any, ::Base.PkgId) at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:15
│  [14] macro expansion at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:81 [inlined]
│  [15] (::Tullio.var"#153#157")() at /home/ritesh/.julia/packages/Requires/jmibq/src/init.jl:11
│  [16] __init__() at /home/ritesh/.julia/packages/Requires/jmibq/src/init.jl:18
│  [17] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:697
│  [18] _require_search_from_serialized(::Base.PkgId, ::String) at ./loading.jl:782
│  [19] _require(::Base.PkgId) at ./loading.jl:1007
│  [20] require(::Base.PkgId) at ./loading.jl:928
│  [21] require(::Module, ::Symbol) at ./loading.jl:923
│  [22] eval(::Module, ::Any) at ./boot.jl:331
│  [23] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:134
│  [24] repl_backend_loop(::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:195
│  [25] start_repl_backend(::REPL.REPLBackend, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:180
│  [26] run_repl(::REPL.AbstractREPL, ::Any; backend_on_current_task::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:292
│  [27] run_repl(::REPL.AbstractREPL, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288
│  [28] (::Base.var"#806#808"{Bool,Bool,Bool,Bool})(::Module) at ./client.jl:399
│  [29] #invokelatest#1 at ./essentials.jl:710 [inlined]
│  [30] invokelatest at ./essentials.jl:709 [inlined]
│  [31] run_main_repl(::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at ./client.jl:383
│  [32] exec_options(::Base.JLOptions) at ./client.jl:313
│  [33] _start() at ./client.jl:506
│ in expression starting at /home/ritesh/.julia/packages/Tullio/oeP9x/src/grad/reverse.jl:8
│ in expression starting at /home/ritesh/.julia/packages/Tullio/oeP9x/src/grad/reverse.jl:8
└ @ Requires ~/.julia/packages/Requires/jmibq/src/require.jl:44

My Config:

ReverseDiff v1.4.3
Tullio v0.2.7

Looks like the warning is related to ReverseDiff.@grad but Tracker.@grad doesn't give any such warning.

I'm trying to build a sysimg for my project (which uses ReverseDiff and Tullio) and I guess this warning wouldn't allow Tullio to be added to precompile trace (I'm not sure if it would be added).

Is there a way to get rid of this warning?

Support CUDA.jl

Since CuArrays.jl is being replaced with CUDA.jl, it'd be good if Tullio could move it's support to CUDA.

TODO: Install the Buildkite GitHub App on this repo

This step will need to be done by @mcabbott

Here is the link to the GitHub App: https://github.com/settings/connections/applications/Iv1.112bf4be3e5ecdeb

Bugs in range inference with `div`

Some bugs I think with range inference and ÷, exposed by trying to solve this: https://discourse.julialang.org/t/efficient-reshaping-of-2d-array-by-pairs-of-columns/52953

julia> using Tullio, OffsetArrays, TensorCast

julia> A = [ 1  2
             3  4
             5  6
             7  8
             9 10
            11 12
            13 14
            15 16
            17 18 ];  # input

julia> B = [ 1  2   7   8  13  14
             3  4   9  10  15  16
             5  6  11  12  17  18 ];  # desired output

julia> @cast B2[a,(c,b)] := A[(a,b),c]  a:3  # easy!
3×6 reshape(PermutedDimsArray(::Array{Int64, 3}, (1, 3, 2)), 3, 6) with eltype Int64:
 1  2   7   8  13  14
 3  4   9  10  15  16
 5  6  11  12  17  18

julia> @tullio B3[r, c] := A[3r+c÷3, 3r+c÷3+1]  r in 1:3  # wrong formula!
3×7 OffsetArray(::Matrix{Int64}, 1:3, -6:0) with eltype Int64 with indices 1:3×-6:0:
 2  5512452232  5512452232  5512452232  5372075888  5372075888  5372075888
 0          11          11          11  4618052352  4618052352  4618052352
 0          11          11          11  4618334208  4618334208  4618334208

julia> @tullio B4[r, c] := A[r + 3((c-1)÷2), mod(c)]  r in 1:3
3×5 OffsetArray(::Matrix{Int64}, 1:3, 1:5) with eltype Int64 with indices 1:3×1:5:
 1  2   7   8  13
 3  4   9  10  15
 5  6  11  12  17

Wrong derivative reducing over `min`, when using KernelAbstractions

MWE:

julia> using Tullio, CUDA, KernelAbstractions, Zygote, FiniteDifferences

julia> g(μ) = ifelse(μ == 4, 1, -1)
g (generic function with 1 method)

julia> push!(Tullio.BASE_NOGRAD, :g);

julia> f2(w, k) = @tullio min f[m] := w[j, m] * g(μ) * (k[j, μ] - k[m, μ])^2
f2 (generic function with 1 method)

julia> using Random; Random.seed!(1);

julia> k = rand(Float32, 2, 4)
2×4 Array{Float32,2}:
 0.547994  0.567737  0.27934   0.8135
 0.819285  0.557336  0.777828  0.00389743

julia> w = rand(Float32, 2, 2)
2×2 Array{Float32,2}:
 0.699683  0.903188
 0.568097  0.416541

julia> Δ = rand(Float32, 2)
2-element Array{Float32,1}:
 0.51531243
 0.13594759

julia> Zygote.pullback(f2, w, k)[2](Δ)[2]
2×4 Array{Float32,2}:
 0.0  0.0   0.414277  0.0
 0.0  0.0  -0.414277  0.0

julia> Zygote.pullback(f2, cu(w), cu(k))[2](cu(Δ))[2]
2×4 CuArray{Float32,2}:
 0.0  0.0   0.414277  0.0
 0.0  0.0  -0.122415  0.0

julia> j′vp(central_fdm(5, 1), f2, Δ, w, k)[2]
2×4 Array{Float32,2}:
 5.59755f-9  5.59755f-9   0.414277  5.59755f-9
 5.59755f-9  5.59755f-9  -0.414277  5.59755f-9

Seems fine if I use + instead of min as the reducer.

Default Thread settings (threads=true) cause bad performance on CPU

Hey,

I'm on a 4 core machine. I started Julia with 4 threads and observe (relatively) poor performance with the default Tullio settings

using Zygote, Tullio
x = randn((2000, 2000));
f(x) = (@tullio y =  (x[i+1, j] - x[i-1, j])^2 + (x[i, j+1] - x[i, j-1])^2)
ft(x) = (@tullio threads=false y = (x[i+1, j] - x[i-1, j])^2 + (x[i, j+1] - x[i, j-1])^2)

@btime Zygote.gradient(f, x)[1];
@btime Zygote.gradient(ft, x)[1]

producing:

  100.746 ms (173 allocations: 30.53 MiB)
  11.788 ms (26 allocations: 30.52 MiB)

What's actually the reason for that? I could observe that none of the 4 Julia Threads achieved 100% CPU Usage. Rather 50 - 80% in the main thread, and around 20% in the other three.

Thanks,

Felix

Macro hygiene issue `ndims`

Observe this initially baffling bit of code:

julia> using Tullio

julia> let s = [1,2,3,4]
           @tullio x[i] := s[i]
       end
4-element Array{Int64,1}:
 1
 2
 3
 4

julia> let s = [1,2,3,4], ndims=maximum(s)
           @tullio x[i] := s[i]
       end
ERROR: MethodError: objects of type Int64 are not callable
Stacktrace:
 [1] top-level scope at /home/mason/.julia/packages/Tullio/Q2RDS/src/macro.jl:970
 [2] top-level scope at REPL[3]:2

I think somewhere in the macroexpansion, you need to interpolate the Tullio.ndims function otherwise a user having a variable called ndims causes trouble.

Bounds checking too conservative on sub-arrays

The following minimal example complains about the indices of a being out of range even though 1 <= a[1, j] <= 3. I wonder if bounds checking could be modified to handle scenarios like this?

using Tullio

a = [1 2 3
     4 5 6]
b = [1, 2, 3]
@tullio x := b[a[1, j]]

# "extrema of index a[1, j] must fit within b"

Unsafe scatter with GPU arrays

From here: https://discourse.julialang.org/t/increment-elements-of-array-by-index/49694/5

Tullio was detecting that i is unsafe to multi-thread, but wasn't passing this to KernelAbstractions:

using CUDA, Tullio, KernelAbstractions

let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
    @tullio A[I[i]] += v[i]
end

#+RESULTS:
: 4-element CuArray{Float32,1}:
:  1.0
:  0.0
:  2.0
:  0.0

After my attempt at a fix in 4d30a1f (which runs on the CPU):

julia> let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
           @tullio A[I[i]] += v[i]
       end
ERROR: AssertionError: length(__workgroupsize) <= length(ndrange)
Stacktrace:
 [1] partition at /home/user/.julia/packages/KernelAbstractions/jAutM/src/nditeration.jl:103 [inlined]
 [2] partition(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#253#3"}, ::Tuple{}, ::Nothing) at /home/user/.julia/packages/KernelAbstractions/jAutM/src/KernelAbstractions.jl:385

Besides not running, this is also too cautious, as it would apply to @tullio A[I[i]] = v[i] where it would be more justifiable to over-write -- there is no guaranteed order, but accumulation ought to be right.

@dfdx points out that ScatterNNlib.jl has an approach using atomic_add which may be worth trying to copy:

https://github.com/yuehhua/ScatterNNlib.jl/blob/74622bf40437f16e10468e8a35dae824527c83cf/src/cuda/cuarray.jl#L6-L17

Error in scalar reduction on the GPU

I hope I'm not doing it wrong but I've been playing around with Tullio.jl (super neat package!) and I can't seem to get this simple expression to work on the GPU:

julia> using CUDA, KernelAbstractions, Tullio

julia> A = rand(5, 5) |> CuArray;

julia> B = rand(5, 5) |> CuArray;

julia> @tullio C = A[i, j] / 2 + B[i, j] / 3

produces this stacktrace

ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type Float64
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
  ...
Stacktrace:
 [1] thread_scalar at /home/alir/.julia/packages/Tullio/bgqFi/src/threads.jl:237 [inlined]
 [2] (::var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#1"})(::CuArray{Float64,2}, ::CuArray{Float64,2}) at /home/alir/.julia/packages/Tullio/bgqFi/src/macro.jl:805
 [3] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#1"},var"#38#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#5"}})(::CuArray{Float64,2}, ::Vararg{CuArray{Float64,2},N} where N) at /home/alir/.julia/packages/Tullio/bgqFi/src/eval.jl:20
 [4] top-level scope at /home/alir/.julia/packages/Tullio/bgqFi/src/macro.jl:975

with

Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)

and

Status `~/.julia/environments/v1.5/Project.toml`
  [052768ef] CUDA v2.4.1
  [63c18a36] KernelAbstractions v0.5.3
  [bc48ee85] Tullio v0.2.12

Slowdown when multiple variables looped

I didn't see a warning about this anywhere so I thought to open an issue. When I have multiple variables in the inner loop it takes much longer to finish, I'm assuming because it isn't contracting indices intelligently (though I could be wrong). Here is a test case where I multiply 3 matrices, maybe there is a better way to have written it within the Tullio framework:

MWE:

using BenchmarkTools
using Tullio

function mul3(A, B, C) 
    A * (B * C)
end

function mul3_tullio(A, B, C) 
    @tullio Y[i, j] := A[i, k] * B[k, l] * C[l, j]
end

A, B, C = rand(100, 100); rand(100, 100); rand(100, 100)
@btime mul3($A, $B, $C)
#   89.030 μs (4 allocations: 156.41 KiB)

@btime mul3_tullio($A, $B, $C)
#   111.093 ms (3 allocations: 78.25 KiB)

mul3(A, B, C) ≈ mul3_tullio(A, B, C)
# true

This is on Julia 1.4.1, Tullio version 0.2.7

The initialization for user defined reduction function is always zero(T)

Lead to

a = fill(100,4,4)
@tullio avx = false (&) b[i] := sin(a[i,j])+2 >= 0 

4-element Array{Bool,1}:
 0 # should be 1
 0 # should be 1
 0 # should be 1
 0 # should be 1

Also some time user want keep the initial value for defined reduction function like += and *=
Maybe a synthesis like below is fine.

@tullio (fun_for_reduction,fun_for_initialization,keep) b[i] = sin(a[i,j])+2 >= 0

Pass LoopVectorization static size information

Minimal example:

using Tullio, LoopVectorization
@tullio out[i] := 0.1*data[i+o]  o in 0:9

Part of the expansion is:

local 𝒶𝓍o = 0:9
local 𝒶𝓍i = (Tullio.subranges)((eachindex)(data), 𝒶𝓍o)
local 𝓇𝒽𝓈(data, i, o) = begin
    #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:788 =#
    identity(0.1 * data[i + o])
end
begin
    #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:797 =#
    local 𝒯1 = Core.Compiler.return_type(𝓇𝒽𝓈, (typeof)((data, (first)(𝒶𝓍i), (first)(𝒶𝓍o))))
    #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:798 =#
    local 𝒯2 = if Base.isconcretetype(𝒯1)
        #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:799 =#
        𝒯1
    else
        #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:801 =#
        nothing
        #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:802 =#
        (typeof)(𝓇𝒽𝓈(data, (first)(𝒶𝓍i), (first)(𝒶𝓍o)))
    end
end
local 𝒯3 = 𝒯2
local 𝒯 = 𝒯3
(first)(𝒶𝓍i) == 1 || (throw)("to allow indices not starting at 1, OffsetArrays must be visible in the caller's module. Otherwise write `A[i+_] := ...` to remove offset")
local out = similar(parent(data), 𝒯, tuple(Base.OneTo(𝒶𝓍i)))
begin
    (Tullio.threader)(𝒜𝒸𝓉!, (Tullio.storage_type)(out, data), out, tuple(data), tuple(𝒶𝓍i), tuple(𝒶𝓍o), +, 262144, nothing)
    #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:956 =#
    out
end

Making the ranges statically sized, e.g.

using ArrayInterface
StaticInt(0):StaticInt(9)

would let LoopVectorization specialize on the 0:9.

using Tullio, LoopVectorization, BenchmarkTools

data = randn(1000);
out1 = similar(data, length(data) - 9);
out2 = similar(data, length(data) - 9);

function rollingmean10_lv!(out, data)
    @avx for i ∈ eachindex(out)
        outᵢ = zero(eltype(out))
        for j ∈ 0:9
            outᵢ += data[i+j]
        end
        out[i] = 0.1 * outᵢ
    end
    out
end
rollingmean10_tullio!(out, data) = @tullio out[i] = 0.1*data[i+o]  o in 0:9 threads=false

rollingmean10_lv!(out1, data) ≈ rollingmean10_tullio!(out2, data)

@benchmark rollingmean10_lv!($out1, $data)
@benchmark rollingmean10_tullio!($out2, $data)

With a 10980xe, yields:

julia> rollingmean10_lv!(out1, data) ≈ rollingmean10_tullio!(out2, data)
true

julia> @benchmark rollingmean10_lv!($out1, $data)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     280.072 ns (0.00% GC)
  median time:      285.438 ns (0.00% GC)
  mean time:        285.265 ns (0.00% GC)
  maximum time:     465.455 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     292

julia> @benchmark rollingmean10_tullio!($out2, $data)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     297.058 ns (0.00% GC)
  median time:      297.394 ns (0.00% GC)
  mean time:        297.760 ns (0.00% GC)
  maximum time:     448.108 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     259

The performance difference is a little larger on my laptop (tiger lake):

julia> @benchmark rollingmean10_lv!($out1, $data)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     280.783 ns (0.00% GC)
  median time:      295.303 ns (0.00% GC)
  mean time:        301.026 ns (0.00% GC)
  maximum time:     778.703 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     290

julia> @benchmark rollingmean10_tullio!($out2, $data)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     313.283 ns (0.00% GC)
  median time:      344.108 ns (0.00% GC)
  mean time:        345.672 ns (0.00% GC)
  maximum time:     522.862 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     240

As I found out by looking at the asm, the difference is that Tullio uses fma instructions, which are a little slowe on Tiger Lake.

With the static 10, it'll completely unroll the loop:

L80:
        vmovupd zmm1, zmmword ptr [rdi + 8*rcx - 136]
        vmovupd zmm2, zmmword ptr [rdi + 8*rcx - 64]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 128]
        vaddpd  zmm2, zmm2, zmmword ptr [rdi + 8*rcx - 72]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 120]
        vaddpd  zmm3, zmm2, zmmword ptr [rdi + 8*rcx - 56]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 112]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 48]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 104]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 40]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 96]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 32]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 88]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 24]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 80]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 16]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 8]
        vaddpd  zmm1, zmm1, zmm2
        vaddpd  zmm2, zmm3, zmmword ptr [rdi + 8*rcx]
        vmulpd  zmm1, zmm1, zmm0
        vmovupd zmmword ptr [rdx], zmm1
        vmulpd  zmm1, zmm2, zmm0
        vmovupd zmmword ptr [rdx + 64], zmm1
        add     rcx, 16
        sub     rdx, -128
        cmp     rcx, r11
        jl      L80

Versus without it:

L176:
        vfmadd231pd     zmm1, zmm0, zmmword ptr [rbx + 8*rdi - 192] # zmm1 = (zmm0 * mem) + zmm1
        vfmadd231pd     zmm2, zmm0, zmmword ptr [rbx + 8*rdi - 128] # zmm2 = (zmm0 * mem) + zmm2
        vfmadd231pd     zmm3, zmm0, zmmword ptr [rbx + 8*rdi - 64] # zmm3 = (zmm0 * mem) + zmm3
        vfmadd231pd     zmm4, zmm0, zmmword ptr [rbx + 8*rdi] # zmm4 = (zmm0 * mem) + zmm4
        inc     rdi
        cmp     rsi, rdi
        jne     L176

This version unrolls the i loop by 4 and doesn't unroll the o loop at all.
It also uses fmas, instead of adds with a multiply at the end. Which is faster is architecture dependent.
On some computers, adds and fmas are the same fast, meaning you wouldn't want the multiply at the end. Bizarely, on Haswell, fma is actually faster than add, but more commonly fma has a lower throughput.
That's an optimiation LoopVectoriation (or LLVM, given associativity) should thus be making. Just observing.

In general, LoopVectorization should do better with more static size info. In can also sometimes help compile times, e.g. it can make a big difference with 2d concolutions.

VSCode debug with @einsum stops in macro.jl

Dear Michael,

I'm using @einsum from Tullio and I get an unexpected behavior when debugging a julia program in VSCode, https://www.julia-vscode.org/. The debugger always stops at the same line in macro.jl.
Say, I have a program like

using Tullio: @einsum

v = [1; 2; 3];
w = [4; 5; 6];       # <== 1st breakpoint
@einsum a[i,j] := v[i] * w[j]; 
println("a=", a); 
b  = 4 * a;          # <== 2nd breakpoint 
println("b=", b);

After I start the debugger with F5 VSCode shows correct behavior at the 1st breakpoint, see the attached file 2_debugger_at_1st_BP.PNG. When I continue with F5 the debugger stops in line 826 of macro.jl. See attached file 3_debugger_stops_in_macro_jl.PNG. Another continue with F5 leads me to the 2nd BP, see file 4_debugger_at_2n_BP.PNG.

In case I have several @einsum statements the debugger stops several times always in macro.jl at line 826. That can be disturbing if I want to debug code after many @einsum statements.

This behavior may be also related to VSCode but maybe you have a hint?

Thank you

Ralf

After starting the debugger with F5 at the 1st breakpoint (as expected)

Continue F5 and the debugger stops in macro.jl (unexpected behavior)

Another Continue F5 and the debugger stops at the 2nd break point (as expected)

InexactError when the intermediate reduction type differs from the final type

I came into an InexactError error when using Tullio to perform reduction and then abs2 over complex vectors.

The MWE might be easier to explain the issue:

using FFTW, Tullio
x = rand(120) .- .5
y = rand(120) .- .5
u = rand(500) .- .5
v = rand(500) .- .5
a = rand(120)
k = 2π/.01

function tfunc(u,v,a,k,x,y)
    @tullio vals[i] := exp(im * (k * (x[j]*u[i] + y[j]*v[i]) + a[j])) |> abs2
end

tfunc(u,v,a,k,x,y)

Which results in

ERROR: LoadError: InexactError: Float64(-13.863000778992978 + 2.707549484306247im)

In my example the issue could easily be resolved by performing the abs2 directly on the vals array returned by the @tullio macro.
I briefly discussed this with @mcabbott on slack and this appears to only happen when threading is enabled in @tullio.

Allow `div` in range inference?

This thread provides an example: https://discourse.julialang.org/t/sequential-multiple-hcat-s-of-a-range-of-rows/49852

julia> A = reshape(0:9, :,2) .+ 0
5×2 Matrix{Int64}:
 0  5
 1  6
 2  7
 3  8
 4  9

julia> multihcat(A, steps) = reduce(hcat, A[i+1:end-(steps-i), :] for i in 0:steps);

julia> multihcat(A, 2)  # desired output
3×6 Matrix{Int64}:
 0  5  1  6  2  7
 1  6  2  7  3  8
 2  7  3  8  4  9

julia> @tullio B[r,c] := begin
           i=r+(c-1)÷2
           A[i, mod(c)]
       end  (r in 1:3, c in 1:6)
3×6 Matrix{Int64}:
 0  5  1  6  2  7
 1  6  2  7  3  8
 2  7  3  8  4  9

julia> @tullio B[r,c] := A[r+(c-1)÷2, mod(c)]  r in 1:3
ERROR: LoadError: "don't know how to handle (c - 1) ÷ 2, sorry"

Different results: for loop vs Tullio

Hey,

the code below produces different results when using Tullio and for loops. Once I'm using Float64 the results are very close to each other. Is this due to @fastmath? If yes, can we disable it?

Thanks,

Felix

using TestImages
using Tullio
img = convert(Array{Float32}, testimage("fabio_gray_512"))
x = reshape(img, 512, 512)

@tullio res_t = (x[i + 1, j] + x[i - 1, j])^2 (i in 2:size(x)[1] - 1)
print(res_t, "\n")

@tullio res_t = (x[i + 1, j] + x[i - 1, j])^2
print(res_t, "\n")

res_f = 0
for i = 2:size(x)[1] -1
    for j = 1:size(x)[2]
        res_f += (x[i+1, j] + x[i - 1, j])^2
    end
end
print(res_f)

output:

219891.22
219891.22
219914.31

Make sure to avoid race conditions with nontrivial LHS indexing

The following is valid Tullio code:

@tullio C[i ÷ 2, k] += A[i, j] * B[j, k]

but a naive implementation could incur in a race condition (as different values of i end up writing in the same slot in C). Following a conversation on slack, the default should be "safe but slow", so avoid multithreading on unsafe indices.

besselj0 error

Code that used to work on 1.5.3 and still works with e.g. sin

k = LinRange(.16,37.8,Np) |> collect
rv = LinRange(0,15,1000) |> collect
gvec = zero(rv)

@tullio gvec[i] = rv[i]*sin(k[j]*rv[i])

and used to work with besselj0 now fails

@tullio gvec[i] = rv[i]*besselj0(k[j]*rv[i])
ERROR: LoadError: StackOverflowError:
Stacktrace:
 [1] besselj0(x::VectorizationBase.Vec{4, Float64})
   @ SpecialFunctions ~/.julia/packages/SpecialFunctions/24S26/src/bessel.jl:204
in expression starting at /Users/abradley/Dropbox/Research/Projects/GPESpectra/scripts/2d_trap_correlation_spectra.jl:760

mcabbott / tullio.jl Goto Github PK

tullio.jl's People

Contributors

Stargazers

Watchers

Forkers

tullio.jl's Issues

Example 1(wrong):

Example 2(right):

Quick Start

Recommend Projects

Recommend Topics

Recommend Org