Minimal example: <div class="highlight highlight-source-julia notranslate position

Interesting, this deserves some thought. At the moment, <code class=

LoopVectorization's master branch on the AVX2 laptop: <div class="highlight highli

Pass LoopVectorization static size information,about mcabbott/tullio.jl

Comments (6)

mcabbott commented on May 25, 2024

Interesting, this deserves some thought.

At the moment, threader is happy to break index i among threads, and o until it thinks the blocks are small enough. Of course it shouldn't here, since its range is so short, the macro doesn't do anything with this knowledge. Perhaps it needs some threshold that literal ranges shorter than 32 never get broken, bypass threader entirely, and either get pasted in literally, or wrapped with Static.

It's possible that some ranges defined dynamically should also become static, maybe something like out[i,j] := x[i-a, j-b] * $k[a,b] should mark k as fixed & small, and call Act(..., Val(size(k)), ...)?

from tullio.jl.

chriselrod commented on May 25, 2024

At the moment, threader is happy to break index i among threads, and o until it thinks the blocks are small enough. Of course it shouldn't here, since its range is so short, the macro doesn't do anything with this knowledge. Perhaps it needs some threshold that literal ranges shorter than 32 never get broken,

That sounds reasonable. It's also not likely to make much of a difference if the loop is >= 32 iterations anyway. E.g., in matmul:
https://github.com/chriselrod/PaddedMatrices.jl/blob/master/docs/src/assets/sizedarraybenchmarks_cascadelake_AVX512.svg
On the AVX2 CPUs I tested on, the difference more quickly.

But the behavior is problem dependent. For example, testing dynamic vs static 2d convolutions:

using PaddedMatrices, VectorizationBase, LoopVectorization, OffsetArrays

sleep_to_reduce_throttling = true # added soeeps because my laptop would thermally throttle when running benchmarks

function conv2d!(out, A, kern)
    @avx for I in CartesianIndices(out)
        tmp = zero(eltype(out))
        for J in CartesianIndices(kern)
            tmp += A[I+J]*kern[J]
        end
        out[I] = tmp
    end
    out
end

A = rand(100,100);

# StrideArrays use 1-indexed by default. There's no convenient API for changing indexing yet.
# Manually constructing one with -1 based indexing:
function set_indexing(A::PtrArray{S,D,T,N,C,B,R}, O::Tuple{Vararg{StaticInt,N}}) where {S,D,T,N,C,B,R}
    PtrArray(
        VectorizationBase.StridedPointer{T, N, C, B, R}(
            pointer(A), VectorizationBase.bytestrides(A), O
        ),
        PaddedMatrices.size(A), PaddedMatrices.dense_dims(A)
    )
end
set_indexing(A::StrideArray, O::Tuple) = StrideArray(set_indexing(PtrArray(A), O), A.data)

kern_base1 = @StrideArray rand(3,3);

kern3x3 = set_indexing(kern_base1, (StaticInt(-1), StaticInt(-1)));
kern3x3dynamic = OffsetArray(Array(kern_base1), -1:1, -1:1);

out1 = OffsetArray(similar(A, size(A) .- 2), 1, 1);
out2 = similar(out1);

kern_base2 = @StrideArray rand(5,5);


kern5x5 = set_indexing(kern_base2, (StaticInt(-2), StaticInt(-2)));
kern5x5dynamic = OffsetArray(Array(kern_base2), -2:2, -2:2);

out3 = OffsetArray(similar(A, size(A) .- 4), 2, 2);
out4 = similar(out3);


kern_base3 = @StrideArray rand(7,7);

kern7x7 = set_indexing(kern_base3, (StaticInt(-3), StaticInt(-3)));
kern7x7dynamic = OffsetArray(Array(kern_base3), -3:3, -3:3);

out5 = OffsetArray(similar(A, size(A) .- 6), 3, 3);
out6 = similar(out5);


@time conv2d!(out1, A, kern3x3);
@time conv2d!(out3, A, kern5x5);
@time conv2d!(out5, A, kern7x7);
@time conv2d!(out2, A, kern3x3dynamic);
@time conv2d!(out4, A, kern5x5dynamic);
@time conv2d!(out6, A, kern7x7dynamic);

out1 ≈ out2
out3 ≈ out4
out5 ≈ out6

sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out1, $A, $kern3x3)
sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out2, $A, $kern3x3dynamic)
sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out3, $A, $kern5x5)
sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out4, $A, $kern5x5dynamic)
sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out5, $A, $kern7x7)
sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out6, $A, $kern7x7dynamic)

Dynamic vs static makes very little difference in runtime on the laptop, but a massive difference in compilation time.
Starting a fresh session on my cascadelake desktop (so the first @time is also the first @avx loop getting compiled):

julia> @time conv2d!(out1, A, kern3x3);
  7.970366 seconds (13.60 M allocations: 810.243 MiB, 2.70% gc time, 100.00% compilation time)

julia> @time conv2d!(out3, A, kern5x5);
  2.379231 seconds (3.21 M allocations: 243.932 MiB, 2.81% gc time, 100.00% compilation time)

julia> @time conv2d!(out5, A, kern7x7);
  1.604241 seconds (2.51 M allocations: 170.270 MiB, 5.69% gc time, 100.00% compilation time)

julia> @time conv2d!(out2, A, kern3x3dynamic);
 16.502299 seconds (8.42 M allocations: 908.643 MiB, 3.18% gc time, 100.00% compilation time)

julia> @time conv2d!(out4, A, kern5x5dynamic);
  0.000022 seconds

julia> @time conv2d!(out6, A, kern7x7dynamic);
  0.000040 seconds

julia> out1 ≈ out2
true

julia> out3 ≈ out4
true

julia> out5 ≈ out6
true

julia> @benchmark conv2d!($out1, $A, $kern3x3)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.287 μs (0.00% GC)
  median time:      3.332 μs (0.00% GC)
  mean time:        3.336 μs (0.00% GC)
  maximum time:     6.570 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> @benchmark conv2d!($out2, $A, $kern3x3dynamic)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.317 μs (0.00% GC)
  median time:      3.365 μs (0.00% GC)
  mean time:        3.369 μs (0.00% GC)
  maximum time:     8.738 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> @benchmark conv2d!($out3, $A, $kern5x5)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.799 μs (0.00% GC)
  median time:      4.819 μs (0.00% GC)
  mean time:        4.826 μs (0.00% GC)
  maximum time:     11.534 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark conv2d!($out4, $A, $kern5x5dynamic)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.015 μs (0.00% GC)
  median time:      5.037 μs (0.00% GC)
  mean time:        5.044 μs (0.00% GC)
  maximum time:     11.246 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark conv2d!($out5, $A, $kern7x7)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.837 μs (0.00% GC)
  median time:      9.894 μs (0.00% GC)
  mean time:        9.911 μs (0.00% GC)
  maximum time:     37.181 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark conv2d!($out6, $A, $kern7x7dynamic)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.412 μs (0.00% GC)
  median time:      10.566 μs (0.00% GC)
  mean time:        10.582 μs (0.00% GC)
  maximum time:     36.509 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

It'd take a fair number of different kernel sizes before being dynamically sized actually results in faster compilation, although the performance difference is pretty small. (That's because in the dynamic case, it generates a ton of code -- hurting compile times -- to try to handle different cases well, seemingly doing a rather good job of it.)
This difference was more extreme on my laptop (also avx512):

julia> @time conv2d!(out1, A, kern3x3);
  7.329627 seconds (13.59 M allocations: 810.053 MiB, 3.12% gc time, 100.00% compilation time)

julia> @time conv2d!(out3, A, kern5x5);
  1.926970 seconds (3.21 M allocations: 243.925 MiB, 12.00% gc time, 100.00% compilation time)

julia> @time conv2d!(out5, A, kern7x7);
  1.180640 seconds (2.51 M allocations: 170.332 MiB, 100.00% compilation time)

julia> @time conv2d!(out2, A, kern3x3dynamic);
 23.222628 seconds (8.42 M allocations: 908.639 MiB, 0.82% gc time, 100.00% compilation time)

julia> @time conv2d!(out4, A, kern5x5dynamic);
  0.000030 seconds

julia> @time conv2d!(out6, A, kern7x7dynamic);
  0.000040 seconds

julia> out1 ≈ out2
true

julia> out3 ≈ out4
true

julia> out5 ≈ out6
true

julia> sleep(5)

julia> @benchmark conv2d!($out1, $A, $kern3x3)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.765 μs (0.00% GC)
  median time:      4.065 μs (0.00% GC)
  mean time:        4.127 μs (0.00% GC)
  maximum time:     10.190 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> sleep(5)

julia> @benchmark conv2d!($out2, $A, $kern3x3dynamic)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.784 μs (0.00% GC)
  median time:      4.168 μs (0.00% GC)
  mean time:        4.216 μs (0.00% GC)
  maximum time:     10.834 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> sleep(5)

julia> @benchmark conv2d!($out3, $A, $kern5x5)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.463 μs (0.00% GC)
  median time:      7.098 μs (0.00% GC)
  mean time:        7.060 μs (0.00% GC)
  maximum time:     30.257 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> sleep(5)

julia> @benchmark conv2d!($out4, $A, $kern5x5dynamic)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.467 μs (0.00% GC)
  median time:      6.939 μs (0.00% GC)
  mean time:        7.054 μs (0.00% GC)
  maximum time:     27.534 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> sleep(5)

julia> @benchmark conv2d!($out5, $A, $kern7x7)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.405 μs (0.00% GC)
  median time:      13.347 μs (0.00% GC)
  mean time:        13.537 μs (0.00% GC)
  maximum time:     104.795 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> sleep(5)

julia> @benchmark conv2d!($out6, $A, $kern7x7dynamic)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.413 μs (0.00% GC)
  median time:      14.342 μs (0.00% GC)
  mean time:        14.644 μs (0.00% GC)
  maximum time:     63.376 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

So I think convenient syntax for Act(..., Val(size(k)), ...) could be pretty nice. A single dynamic dispatch in otherwise type stable code is cheap, and it could actually (markedly) improve compile times in some cases.

from tullio.jl.

mcabbott commented on May 25, 2024

FWIW, I see an even bigger difference on my desktop, although none on my laptop:

julia> @benchmark rollingmean10_lv!($out1, $data)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     349.402 ns (0.00% GC)
  median time:      358.785 ns (0.00% GC)
  mean time:        357.761 ns (0.00% GC)
  maximum time:     693.308 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     214

julia> @benchmark rollingmean10_tullio!($out2, $data)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     419.482 ns (0.00% GC)
  median time:      430.020 ns (0.00% GC)
  mean time:        431.951 ns (0.00% GC)
  maximum time:     823.864 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     199

julia> 419.482 / 349.402
1.2005712617558

julia> 313.283 / 280.783
1.1157477482611127

julia> versioninfo()
Julia Version 1.7.0-DEV.77
Commit 80ace52b03* (2020-12-15 02:48 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin17.7.0)
  CPU: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 6

but more surprisingly, on a fresh start:

julia> @time conv2d!(out1, A, kern3x3);
  6.694171 seconds (12.28 M allocations: 717.005 MiB, 3.71% gc time, 100.00% compilation time)

julia> @time conv2d!(out3, A, kern5x5);
  1.119579 seconds (1.92 M allocations: 117.363 MiB, 4.69% gc time, 100.00% compilation time)

julia> @time conv2d!(out5, A, kern7x7);
  2.276177 seconds (3.08 M allocations: 194.546 MiB, 3.49% gc time, 100.00% compilation time)

julia> @time conv2d!(out2, A, kern3x3dynamic);
  0.847391 seconds (1.51 M allocations: 93.533 MiB, 100.00% compilation time)

julia> @time conv2d!(out4, A, kern5x5dynamic);
  0.000019 seconds

julia> @time conv2d!(out6, A, kern7x7dynamic);
  0.000036 seconds

julia> out1 ≈ out2
true

with a fairly similar pattern of run-times, e.g.

julia> @benchmark conv2d!($out5, $A, $kern7x7)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.348 μs (0.00% GC)
  median time:      20.834 μs (0.00% GC)
  mean time:        20.944 μs (0.00% GC)
  maximum time:     53.954 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark conv2d!($out6, $A, $kern7x7dynamic)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.576 μs (0.00% GC)
  median time:      20.816 μs (0.00% GC)
  mean time:        21.320 μs (0.00% GC)
  maximum time:     67.207 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

from tullio.jl.

chriselrod commented on May 25, 2024

Interesting. I also have an old AVX2 laptop, and on it I see that same pattern in compile times:

julia> @time conv2d!(out1, A, kern3x3);
 20.462194 seconds (12.28 M allocations: 717.382 MiB, 2.90% gc time, 100.00% compilation time)

julia> @time conv2d!(out3, A, kern5x5);
  3.902242 seconds (1.92 M allocations: 117.260 MiB, 2.88% gc time, 100.00% compilation time)

julia> @time conv2d!(out5, A, kern7x7);
  9.182271 seconds (3.08 M allocations: 194.384 MiB, 1.87% gc time, 100.00% compilation time)

julia> @time conv2d!(out2, A, kern3x3dynamic);
  3.568283 seconds (1.51 M allocations: 93.453 MiB, 5.07% gc time, 100.00% compilation time)

julia> @time conv2d!(out4, A, kern5x5dynamic);
  0.000061 seconds

julia> @time conv2d!(out6, A, kern7x7dynamic);
  0.000099 seconds

julia> out1 ≈ out2
true

julia> out3 ≈ out4
true

julia> out5 ≈ out6
true

julia> sleep_to_reduce_throttling && sleep(5)

julia> @benchmark conv2d!($out1, $A, $kern3x3)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.718 μs (0.00% GC)
  median time:      11.872 μs (0.00% GC)
  mean time:        12.009 μs (0.00% GC)
  maximum time:     37.180 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> sleep_to_reduce_throttling && sleep(5)

julia> @benchmark conv2d!($out2, $A, $kern3x3dynamic)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.384 μs (0.00% GC)
  median time:      14.749 μs (0.00% GC)
  mean time:        15.468 μs (0.00% GC)
  maximum time:     68.761 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> sleep_to_reduce_throttling && sleep(5)

julia> @benchmark conv2d!($out3, $A, $kern5x5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     34.142 μs (0.00% GC)
  median time:      34.248 μs (0.00% GC)
  mean time:        34.413 μs (0.00% GC)
  maximum time:     118.955 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> sleep_to_reduce_throttling && sleep(5)

julia> @benchmark conv2d!($out4, $A, $kern5x5dynamic)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     37.028 μs (0.00% GC)
  median time:      37.144 μs (0.00% GC)
  mean time:        37.459 μs (0.00% GC)
  maximum time:     65.043 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> sleep_to_reduce_throttling && sleep(5)

julia> @benchmark conv2d!($out5, $A, $kern7x7)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     71.995 μs (0.00% GC)
  median time:      72.250 μs (0.00% GC)
  mean time:        75.938 μs (0.00% GC)
  maximum time:     229.936 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> sleep_to_reduce_throttling && sleep(5)

julia> @benchmark conv2d!($out6, $A, $kern7x7dynamic)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     74.693 μs (0.00% GC)
  median time:      75.132 μs (0.00% GC)
  mean time:        75.854 μs (0.00% GC)
  maximum time:     120.941 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

But it did at least get slightly better performance.

But that performance was similar here I guess makes sense, to see what LoopVectorization is doing:

lsdynamic = let kern = kern7x7dynamic, out = out5
    LoopVectorization.@avx_debug for I in CartesianIndices(out)
        tmp = zero(eltype(out))
        for J in CartesianIndices(kern)
            tmp += A[I+J]*kern[J]
        end
        out[I] = tmp
    end
end;
ls3static = let kern = kern3x3, out = out1
    LoopVectorization.@avx_debug for I in CartesianIndices(out)
        tmp = zero(eltype(out))
        for J in CartesianIndices(kern)
            tmp += A[I+J]*kern[J]
        end
        out[I] = tmp
    end
end;
ls5satic = let kern = kern5x5, out = out3
    LoopVectorization.@avx_debug for I in CartesianIndices(out)
        tmp = zero(eltype(out))
        for J in CartesianIndices(kern)
            tmp += A[I+J]*kern[J]
        end
        out[I] = tmp
    end
end;
ls7satic = let kern = kern7x7, out = out5
    LoopVectorization.@avx_debug for I in CartesianIndices(out)
        tmp = zero(eltype(out))
        for J in CartesianIndices(kern)
            tmp += A[I+J]*kern[J]
        end
        out[I] = tmp
    end
end;
LoopVectorization.choose_order(lsdynamic) |> Base.tail
LoopVectorization.choose_order(ls3static) |> Base.tail
LoopVectorization.choose_order(ls5static) |> Base.tail
LoopVectorization.choose_order(ls7static) |> Base.tail

On the computer with AVX(2):

julia> LoopVectorization.choose_order(lsdynamic) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 3, 5)

julia> LoopVectorization.choose_order(ls3static) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 3, 5)

julia> LoopVectorization.choose_order(ls5static) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 3, 5)

julia> LoopVectorization.choose_order(ls7static) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 3, 5)

It's always doing the exact same thing, although it wont' be generating clean up loops for the statically sized case, so it should still be compiling faster. Not sure why that doesn't seem to be the case.
The first two of three symbols indicate which loops are unrolled, i.e. the second J and I loops, by 3x and 5x respectively.
The third symbol indicates that the first I loop was SIMDed.

Versus on a computer with AVX512:

julia> LoopVectorization.choose_order(lsdynamic) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 5, 9)

julia> LoopVectorization.choose_order(ls3static) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 3, 14)

julia> LoopVectorization.choose_order(ls5static) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 5, 9)

julia> LoopVectorization.choose_order(ls7static) |> Base.tail
(Symbol("I#2#"), Symbol("J#2#"), Symbol("I#1#"), 6, 7)

Here, it's actually doing something different. For one thing, it always fully unrolls the J_2 loop (note that the order of I_2 and J_2 switched for the 7 static, so the 7 does correspond to J_2).
Additionally, it just does way more unrolling. With the 5 x 9 unrolling it does in the dynamic case (vs 3 x 5 with AVX2), it generates way more code for all the clean up loops, and I think that some part of codegen is blowing up compile times because of it (I suspect scaling worse than O(N) with function size). The statically sized arrays don't need clean up loops, and hence generate little enough code to coast by with passable compile times.

from tullio.jl.

chriselrod commented on May 25, 2024

The reason it takes longer to compile the statically sized variants with AVX2 is because it will heuristically automatically unroll statically sized loops, if the cumulative unrolling is below a certain threshold. That means with AVX2, it'll unroll the J_1 loop too.

Given that this doesn't help runtime performance, but has a marked negative impact on compile time performance, I'm going to have to think about how to change this behavior.

from tullio.jl.

chriselrod commented on May 25, 2024

LoopVectorization's master branch on the AVX2 laptop:

julia> @time conv2d!(out1, A, kern3x3);
 19.889589 seconds (12.30 M allocations: 717.991 MiB, 1.79% gc time, 100.00% compilation time)

julia> @time conv2d!(out3, A, kern5x5);
  1.859886 seconds (1.26 M allocations: 78.106 MiB, 3.53% gc time, 100.00% compilation time)

julia> @time conv2d!(out5, A, kern7x7);
  2.917579 seconds (1.61 M allocations: 103.400 MiB, 7.48% gc time, 100.00% compilation time)

julia> @time conv2d!(out2, A, kern3x3dynamic);
  3.146461 seconds (1.51 M allocations: 93.459 MiB, 4.89% gc time, 100.00% compilation time)

julia> @time conv2d!(out4, A, kern5x5dynamic);
  0.000059 seconds

julia> @time conv2d!(out6, A, kern7x7dynamic);
  0.000101 seconds

Not great, but better.

from tullio.jl.

Pass LoopVectorization static size information about tullio.jl HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent