Comments (6)
Interesting, this deserves some thought.
At the moment, threader
is happy to break index i
among threads, and o
until it thinks the blocks are small enough. Of course it shouldn't here, since its range is so short, the macro doesn't do anything with this knowledge. Perhaps it needs some threshold that literal ranges shorter than 32 never get broken, bypass threader
entirely, and either get pasted in literally, or wrapped with Static
.
It's possible that some ranges defined dynamically should also become static, maybe something like out[i,j] := x[i-a, j-b] * $k[a,b]
should mark k
as fixed & small, and call Act(..., Val(size(k)), ...)
?
from tullio.jl.
At the moment,
threader
is happy to break indexi
among threads, ando
until it thinks the blocks are small enough. Of course it shouldn't here, since its range is so short, the macro doesn't do anything with this knowledge. Perhaps it needs some threshold that literal ranges shorter than 32 never get broken,
That sounds reasonable. It's also not likely to make much of a difference if the loop is >= 32 iterations anyway. E.g., in matmul:
https://github.com/chriselrod/PaddedMatrices.jl/blob/master/docs/src/assets/sizedarraybenchmarks_cascadelake_AVX512.svg
On the AVX2 CPUs I tested on, the difference more quickly.
But the behavior is problem dependent. For example, testing dynamic vs static 2d convolutions:
using PaddedMatrices, VectorizationBase, LoopVectorization, OffsetArrays
sleep_to_reduce_throttling = true # added soeeps because my laptop would thermally throttle when running benchmarks
function conv2d!(out, A, kern)
@avx for I in CartesianIndices(out)
tmp = zero(eltype(out))
for J in CartesianIndices(kern)
tmp += A[I+J]*kern[J]
end
out[I] = tmp
end
out
end
A = rand(100,100);
# StrideArrays use 1-indexed by default. There's no convenient API for changing indexing yet.
# Manually constructing one with -1 based indexing:
function set_indexing(A::PtrArray{S,D,T,N,C,B,R}, O::Tuple{Vararg{StaticInt,N}}) where {S,D,T,N,C,B,R}
PtrArray(
VectorizationBase.StridedPointer{T, N, C, B, R}(
pointer(A), VectorizationBase.bytestrides(A), O
),
PaddedMatrices.size(A), PaddedMatrices.dense_dims(A)
)
end
set_indexing(A::StrideArray, O::Tuple) = StrideArray(set_indexing(PtrArray(A), O), A.data)
kern_base1 = @StrideArray rand(3,3);
kern3x3 = set_indexing(kern_base1, (StaticInt(-1), StaticInt(-1)));
kern3x3dynamic = OffsetArray(Array(kern_base1), -1:1, -1:1);
out1 = OffsetArray(similar(A, size(A) .- 2), 1, 1);
out2 = similar(out1);
kern_base2 = @StrideArray rand(5,5);
kern5x5 = set_indexing(kern_base2, (StaticInt(-2), StaticInt(-2)));
kern5x5dynamic = OffsetArray(Array(kern_base2), -2:2, -2:2);
out3 = OffsetArray(similar(A, size(A) .- 4), 2, 2);
out4 = similar(out3);
kern_base3 = @StrideArray rand(7,7);
kern7x7 = set_indexing(kern_base3, (StaticInt(-3), StaticInt(-3)));
kern7x7dynamic = OffsetArray(Array(kern_base3), -3:3, -3:3);
out5 = OffsetArray(similar(A, size(A) .- 6), 3, 3);
out6 = similar(out5);
@time conv2d!(out1, A, kern3x3);
@time conv2d!(out3, A, kern5x5);
@time conv2d!(out5, A, kern7x7);
@time conv2d!(out2, A, kern3x3dynamic);
@time conv2d!(out4, A, kern5x5dynamic);
@time conv2d!(out6, A, kern7x7dynamic);
out1 ≈ out2
out3 ≈ out4
out5 ≈ out6
sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out1, $A, $kern3x3)
sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out2, $A, $kern3x3dynamic)
sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out3, $A, $kern5x5)
sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out4, $A, $kern5x5dynamic)
sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out5, $A, $kern7x7)
sleep_to_reduce_throttling && sleep(5)
@benchmark conv2d!($out6, $A, $kern7x7dynamic)
Dynamic vs static makes very little difference in runtime on the laptop, but a massive difference in compilation time.
Starting a fresh session on my cascadelake desktop (so the first @time
is also the first @avx
loop getting compiled):
julia> @time conv2d!(out1, A, kern3x3);
7.970366 seconds (13.60 M allocations: 810.243 MiB, 2.70% gc time, 100.00% compilation time)
julia> @time conv2d!(out3, A, kern5x5);
2.379231 seconds (3.21 M allocations: 243.932 MiB, 2.81% gc time, 100.00% compilation time)
julia> @time conv2d!(out5, A, kern7x7);
1.604241 seconds (2.51 M allocations: 170.270 MiB, 5.69% gc time, 100.00% compilation time)
julia> @time conv2d!(out2, A, kern3x3dynamic);
16.502299 seconds (8.42 M allocations: 908.643 MiB, 3.18% gc time, 100.00% compilation time)
julia> @time conv2d!(out4, A, kern5x5dynamic);
0.000022 seconds
julia> @time conv2d!(out6, A, kern7x7dynamic);
0.000040 seconds
julia> out1 ≈ out2
true
julia> out3 ≈ out4
true
julia> out5 ≈ out6
true
julia> @benchmark conv2d!($out1, $A, $kern3x3)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.287 μs (0.00% GC)
median time: 3.332 μs (0.00% GC)
mean time: 3.336 μs (0.00% GC)
maximum time: 6.570 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
julia> @benchmark conv2d!($out2, $A, $kern3x3dynamic)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.317 μs (0.00% GC)
median time: 3.365 μs (0.00% GC)
mean time: 3.369 μs (0.00% GC)
maximum time: 8.738 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
julia> @benchmark conv2d!($out3, $A, $kern5x5)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.799 μs (0.00% GC)
median time: 4.819 μs (0.00% GC)
mean time: 4.826 μs (0.00% GC)
maximum time: 11.534 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
julia> @benchmark conv2d!($out4, $A, $kern5x5dynamic)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.015 μs (0.00% GC)
median time: 5.037 μs (0.00% GC)
mean time: 5.044 μs (0.00% GC)
maximum time: 11.246 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
julia> @benchmark conv2d!($out5, $A, $kern7x7)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 9.837 μs (0.00% GC)
median time: 9.894 μs (0.00% GC)
mean time: 9.911 μs (0.00% GC)
maximum time: 37.181 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark conv2d!($out6, $A, $kern7x7dynamic)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 10.412 μs (0.00% GC)
median time: 10.566 μs (0.00% GC)
mean time: 10.582 μs (0.00% GC)
maximum time: 36.509 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
It'd take a fair number of different kernel sizes before being dynamically sized actually results in faster compilation, although the performance difference is pretty small. (That's because in the dynamic case, it generates a ton of code -- hurting compile times -- to try to handle different cases well, seemingly doing a rather good job of it.)
This difference was more extreme on my laptop (also avx512):
julia> @time conv2d!(out1, A, kern3x3);
7.329627 seconds (13.59 M allocations: 810.053 MiB, 3.12% gc time, 100.00% compilation time)
julia> @time conv2d!(out3, A, kern5x5);
1.926970 seconds (3.21 M allocations: 243.925 MiB, 12.00% gc time, 100.00% compilation time)
julia> @time conv2d!(out5, A, kern7x7);
1.180640 seconds (2.51 M allocations: 170.332 MiB, 100.00% compilation time)
julia> @time conv2d!(out2, A, kern3x3dynamic);
23.222628 seconds (8.42 M allocations: 908.639 MiB, 0.82% gc time, 100.00% compilation time)
julia> @time conv2d!(out4, A, kern5x5dynamic);
0.000030 seconds
julia> @time conv2d!(out6, A, kern7x7dynamic);
0.000040 seconds
julia> out1 ≈ out2
true
julia> out3 ≈ out4
true
julia> out5 ≈ out6
true
julia> sleep(5)
julia> @benchmark conv2d!($out1, $A, $kern3x3)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.765 μs (0.00% GC)
median time: 4.065 μs (0.00% GC)
mean time: 4.127 μs (0.00% GC)
maximum time: 10.190 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
julia> sleep(5)
julia> @benchmark conv2d!($out2, $A, $kern3x3dynamic)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.784 μs (0.00% GC)
median time: 4.168 μs (0.00% GC)
mean time: 4.216 μs (0.00% GC)
maximum time: 10.834 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
julia> sleep(5)
julia> @benchmark conv2d!($out3, $A, $kern5x5)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.463 μs (0.00% GC)
median time: 7.098 μs (0.00% GC)
mean time: 7.060 μs (0.00% GC)
maximum time: 30.257 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 5
julia> sleep(5)
julia> @benchmark conv2d!($out4, $A, $kern5x5dynamic)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.467 μs (0.00% GC)
median time: 6.939 μs (0.00% GC)
mean time: 7.054 μs (0.00% GC)
maximum time: 27.534 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 5
julia> sleep(5)
julia> @benchmark conv2d!($out5, $A, $kern7x7)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.405 μs (0.00% GC)
median time: 13.347 μs (0.00% GC)
mean time: 13.537 μs (0.00% GC)
maximum time: 104.795 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> sleep(5)
julia> @benchmark conv2d!($out6, $A, $kern7x7dynamic)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.413 μs (0.00% GC)
median time: 14.342 μs (0.00% GC)
mean time: 14.644 μs (0.00% GC)
maximum time: 63.376 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
So I think convenient syntax for Act(..., Val(size(k)), ...)
could be pretty nice. A single dynamic dispatch in otherwise type stable code is cheap, and it could actually (markedly) improve compile times in some cases.
from tullio.jl.
FWIW, I see an even bigger difference on my desktop, although none on my laptop:
julia> @benchmark rollingmean10_lv!($out1, $data)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 349.402 ns (0.00% GC)
median time: 358.785 ns (0.00% GC)
mean time: 357.761 ns (0.00% GC)
maximum time: 693.308 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 214
julia> @benchmark rollingmean10_tullio!($out2, $data)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 419.482 ns (0.00% GC)
median time: 430.020 ns (0.00% GC)
mean time: 431.951 ns (0.00% GC)
maximum time: 823.864 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 199
julia> 419.482 / 349.402
1.2005712617558
julia> 313.283 / 280.783
1.1157477482611127
julia> versioninfo()
Julia Version 1.7.0-DEV.77
Commit 80ace52b03* (2020-12-15 02:48 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin17.7.0)
CPU: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.0 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 6
but more surprisingly, on a fresh start:
julia> @time conv2d!(out1, A, kern3x3);
6.694171 seconds (12.28 M allocations: 717.005 MiB, 3.71% gc time, 100.00% compilation time)
julia> @time conv2d!(out3, A, kern5x5);
1.119579 seconds (1.92 M allocations: 117.363 MiB, 4.69% gc time, 100.00% compilation time)
julia> @time conv2d!(out5, A, kern7x7);
2.276177 seconds (3.08 M allocations: 194.546 MiB, 3.49% gc time, 100.00% compilation time)
julia> @time conv2d!(out2, A, kern3x3dynamic);
0.847391 seconds (1.51 M allocations: 93.533 MiB, 100.00% compilation time)
julia> @time conv2d!(out4, A, kern5x5dynamic);
0.000019 seconds
julia> @time conv2d!(out6, A, kern7x7dynamic);
0.000036 seconds
julia> out1 ≈ out2
true
with a fairly similar pattern of run-times, e.g.
julia> @benchmark conv2d!($out5, $A, $kern7x7)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.348 μs (0.00% GC)
median time: 20.834 μs (0.00% GC)
mean time: 20.944 μs (0.00% GC)
maximum time: 53.954 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark conv2d!($out6, $A, $kern7x7dynamic)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.576 μs (0.00% GC)
median time: 20.816 μs (0.00% GC)
mean time: 21.320 μs (0.00% GC)
maximum time: 67.207 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
from tullio.jl.
Interesting. I also have an old AVX2 laptop, and on it I see that same pattern in compile times:
julia> @time conv2d!(out1, A, kern3x3);
20.462194 seconds (12.28 M allocations: 717.382 MiB, 2.90% gc time, 100.00% compilation time)
julia> @time conv2d!(out3, A, kern5x5);
3.902242 seconds (1.92 M allocations: 117.260 MiB, 2.88% gc time, 100.00% compilation time)
julia> @time conv2d!(out5, A, kern7x7);
9.182271 seconds (3.08 M allocations: 194.384 MiB, 1.87% gc time, 100.00% compilation time)
julia> @time conv2d!(out2, A, kern3x3dynamic);
3.568283 seconds (1.51 M allocations: 93.453 MiB, 5.07% gc time, 100.00% compilation time)
julia> @time conv2d!(out4, A, kern5x5dynamic);
0.000061 seconds
julia> @time conv2d!(out6, A, kern7x7dynamic);
0.000099 seconds
julia> out1 ≈ out2
true
julia> out3 ≈ out4
true
julia> out5 ≈ out6
true
julia> sleep_to_reduce_throttling && sleep(5)
julia> @benchmark conv2d!($out1, $A, $kern3x3)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 11.718 μs (0.00% GC)
median time: 11.872 μs (0.00% GC)
mean time: 12.009 μs (0.00% GC)
maximum time: 37.180 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> sleep_to_reduce_throttling && sleep(5)
julia> @benchmark conv2d!($out2, $A, $kern3x3dynamic)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 14.384 μs (0.00% GC)
median time: 14.749 μs (0.00% GC)
mean time: 15.468 μs (0.00% GC)
maximum time: 68.761 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> sleep_to_reduce_throttling && sleep(5)
julia> @benchmark conv2d!($out3, $A, $kern5x5)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 34.142 μs (0.00% GC)
median time: 34.248 μs (0.00% GC)
mean time: 34.413 μs (0.00% GC)
maximum time: 118.955 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> sleep_to_reduce_throttling && sleep(5)
julia> @benchmark conv2d!($out4, $A, $kern5x5dynamic)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 37.028 μs (0.00% GC)
median time: 37.144 μs (0.00% GC)
mean time: 37.459 μs (0.00% GC)
maximum time: 65.043 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> sleep_to_reduce_throttling && sleep(5)
julia> @benchmark conv2d!($out5, $A, $kern7x7)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 71.995 μs (0.00% GC)
median time: 72.250 μs (0.00% GC)
mean time: 75.938 μs (0.00% GC)
maximum time: 229.936 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> sleep_to_reduce_throttling && sleep(5)
julia> @benchmark conv2d!($out6, $A, $kern7x7dynamic)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 74.693 μs (0.00% GC)
median time: 75.132 μs (0.00% GC)
mean time: 75.854 μs (0.00% GC)
maximum time: 120.941 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
But it did at least get slightly better performance.
But that performance was similar here I guess makes sense, to see what LoopVectorization is doing:
lsdynamic = let kern = kern7x7dynamic, out = out5
LoopVectorization.@avx_debug for I in CartesianIndices(out)
tmp = zero(eltype(out))
for J in CartesianIndices(kern)
tmp += A[I+J]*kern[J]
end
out[I] = tmp
end
end;
ls3static = let kern = kern3x3, out = out1
LoopVectorization.@avx_debug for I in CartesianIndices(out)
tmp = zero(eltype(out))
for J in CartesianIndices(kern)
tmp += A[I+J]*kern[J]
end
out[I] = tmp
end
end;
ls5satic = let kern = kern5x5, out = out3
LoopVectorization.@avx_debug for I in CartesianIndices(out)
tmp = zero(eltype(out))
for J in CartesianIndices(kern)
tmp += A[I+J]*kern[J]
end
out[I] = tmp
end
end;
ls7satic = let kern = kern7x7, out = out5
LoopVectorization.@avx_debug for I in CartesianIndices(out)
tmp = zero(eltype(out))
for J in CartesianIndices(kern)
tmp += A[I+J]*kern[J]
end
out[I] = tmp
end
end;
LoopVectorization.choose_order(lsdynamic) |> Base.tail
LoopVectorization.choose_order(ls3static) |> Base.tail
LoopVectorization.choose_order(ls5static) |> Base.tail
LoopVectorization.choose_order(ls7static) |> Base.tail
On the computer with AVX(2):
julia> LoopVectorization.choose_order(lsdynamic) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 3, 5)
julia> LoopVectorization.choose_order(ls3static) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 3, 5)
julia> LoopVectorization.choose_order(ls5static) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 3, 5)
julia> LoopVectorization.choose_order(ls7static) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 3, 5)
It's always doing the exact same thing, although it wont' be generating clean up loops for the statically sized case, so it should still be compiling faster. Not sure why that doesn't seem to be the case.
The first two of three symbols indicate which loops are unrolled, i.e. the second J and I loops, by 3x and 5x respectively.
The third symbol indicates that the first I loop was SIMDed.
Versus on a computer with AVX512:
julia> LoopVectorization.choose_order(lsdynamic) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 5, 9)
julia> LoopVectorization.choose_order(ls3static) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 3, 14)
julia> LoopVectorization.choose_order(ls5static) |> Base.tail
(Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 5, 9)
julia> LoopVectorization.choose_order(ls7static) |> Base.tail
(Symbol("I#2#"), Symbol("J#2#"), Symbol("I#1#"), 6, 7)
Here, it's actually doing something different. For one thing, it always fully unrolls the J_2
loop (note that the order of I_2
and J_2
switched for the 7 static, so the 7 does correspond to J_2
).
Additionally, it just does way more unrolling. With the 5 x 9 unrolling it does in the dynamic case (vs 3 x 5 with AVX2), it generates way more code for all the clean up loops, and I think that some part of codegen is blowing up compile times because of it (I suspect scaling worse than O(N) with function size). The statically sized arrays don't need clean up loops, and hence generate little enough code to coast by with passable compile times.
from tullio.jl.
The reason it takes longer to compile the statically sized variants with AVX2 is because it will heuristically automatically unroll statically sized loops, if the cumulative unrolling is below a certain threshold. That means with AVX2, it'll unroll the J_1 loop too.
Given that this doesn't help runtime performance, but has a marked negative impact on compile time performance, I'm going to have to think about how to change this behavior.
from tullio.jl.
LoopVectorization's master branch on the AVX2 laptop:
julia> @time conv2d!(out1, A, kern3x3);
19.889589 seconds (12.30 M allocations: 717.991 MiB, 1.79% gc time, 100.00% compilation time)
julia> @time conv2d!(out3, A, kern5x5);
1.859886 seconds (1.26 M allocations: 78.106 MiB, 3.53% gc time, 100.00% compilation time)
julia> @time conv2d!(out5, A, kern7x7);
2.917579 seconds (1.61 M allocations: 103.400 MiB, 7.48% gc time, 100.00% compilation time)
julia> @time conv2d!(out2, A, kern3x3dynamic);
3.146461 seconds (1.51 M allocations: 93.459 MiB, 4.89% gc time, 100.00% compilation time)
julia> @time conv2d!(out4, A, kern5x5dynamic);
0.000059 seconds
julia> @time conv2d!(out6, A, kern7x7dynamic);
0.000101 seconds
Not great, but better.
from tullio.jl.
Related Issues (20)
- Alternative to Tullio for Chained Multiplication HOT 4
- @views macro causes module compilation failure HOT 3
- Reporting a bug when Tullio being included with LoopVectorization HOT 1
- [Question] Is it possible to create a vector of SVectors from a Matrix using Tullio? HOT 2
- [Question] How to change summation order? HOT 5
- Use package extensions HOT 1
- How finalizers `|>` work HOT 5
- Method error when broadcast and sum of matrices HOT 1
- GPU Kernel Compilation Failed with Interpolations HOT 2
- Upgrade to CUDA.CUDAKernels HOT 9
- Bug when using Tullio + LoopVectorization HOT 5
- Add Finch.jl backend HOT 4
- CUDA v4 support HOT 2
- Using threads, vs setting threads=false gives different result HOT 3
- Issue with vectorized functions on GPU HOT 3
- Error when specifying the range of an index with a UnitRange HOT 4
- Scalar indexing with CUDA HOT 10
- Please update dep of FillArrays to v1.
- Bad interaction with Enzyme? HOT 6
- Zygote with Tullio gives wrong gradients/pullbacks using CUDA HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tullio.jl.