Comments (16)
Did some more experimentation, and with JuliaGPU/GPUArrays.jl#454 the kernel is pretty fast now, getting 120GBs out of 180GBs (previously it only reacher 8GBs). I guess that getting rid of the keeps
branches would further improve performance, but this is a very good start already.
from metal.jl.
I suspect this is because @cartesianidx(dest, i)
in the broadcast kernel, which calls CartesianIndices(x)[i]
, ends up doing some division?
; @ /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:55 within `broadcast_kernel`
; ┌ @ /Users/kichi/.julia/dev/GPUArrays/src/device/indexing.jl:81 within `macro expansion`
; │┌ @ multidimensional.jl:262 within `CartesianIndices`
; ││┌ @ abstractarray.jl:95 within `axes`
; │││┌ @ tuple.jl:222 within `map`
; ││││┌ @ range.jl:455 within `oneto`
; │││││┌ @ range.jl:453 within `OneTo` @ range.jl:440
; ││││││┌ @ promotion.jl:488 within `max`
; │││││││┌ @ essentials.jl:489 within `ifelse`
%8 = icmp sgt i64 %.unpack5.unpack, 0
; │└└└└└└└
; │┌ @ abstractarray.jl:1241 within `getindex`
; ││┌ @ abstractarray.jl:1286 within `_getindex`
; │││┌ @ abstractarray.jl:1293 within `_to_subscript_indices`
; ││││┌ @ abstractarray.jl:1315 within `_unsafe_ind2sub`
; │││││┌ @ abstractarray.jl:2639 within `_ind2sub` @ abstractarray.jl:2677
; ││││││┌ @ int.jl:86 within `-`
%9 = add nsw i64 %6, -1
; ││││││└
; ││││││┌ @ abstractarray.jl:2690 within `_ind2sub_recurse`
; │││││││┌ @ abstractarray.jl:2697 within `_div`
; ││││││││┌ @ int.jl:288 within `div`
br i1 %8, label %pass, label %fail
If I change the broadcast kernel to use @linearidx
instead of @cartesianidx
, it runs almost as fast as the handwritten kernel and the result seems to be correct. But surely there must be a reason @cartesianidx
was used?
from metal.jl.
Yes, it does an integer division in order to go from a linear index (which @cartesianidx
uses) to, well, a Cartesian (N-dimensional) one. That's generally pretty slow, but there's no easy way around. Some platforms offer multi- or ND hardware indices (e.g., CUDA's threadIdx.xyz
, or OpenCL's get_global_id(dim)
), but those typically aren't fully general (either not supporting arbitrary rank, or being limited in size) so there's no straightforward path to using those without having to special-case kernels for platforms and inputs.
And the reason we use Cartesian indices and not linear ones is that you can pass non-contiguous GPU inputs to the broadcast kernel (say, a non-contiguous view) in which case there's no linear index that would work.
from metal.jl.
I assumed the non-contiguous view would still take care of translating linear indices to whatever indices underneath, or is that not true in general?
julia> a = MtlArray(Float32.(reshape(1:100, 10, 10)))
10×10 MtlMatrix{Float32}:
1.0 11.0 21.0 31.0 41.0 51.0 61.0 71.0 81.0 91.0
2.0 12.0 22.0 32.0 42.0 52.0 62.0 72.0 82.0 92.0
3.0 13.0 23.0 33.0 43.0 53.0 63.0 73.0 83.0 93.0
4.0 14.0 24.0 34.0 44.0 54.0 64.0 74.0 84.0 94.0
5.0 15.0 25.0 35.0 45.0 55.0 65.0 75.0 85.0 95.0
6.0 16.0 26.0 36.0 46.0 56.0 66.0 76.0 86.0 96.0
7.0 17.0 27.0 37.0 47.0 57.0 67.0 77.0 87.0 97.0
8.0 18.0 28.0 38.0 48.0 58.0 68.0 78.0 88.0 98.0
9.0 19.0 29.0 39.0 49.0 59.0 69.0 79.0 89.0 99.0
10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0
julia> a1 = @view a[[1,3,5], [2,4,6]]
3×3 view(::MtlMatrix{Float32}, [1, 3, 5], [2, 4, 6]) with eltype Float32:
11.0 31.0 51.0
13.0 33.0 53.0
15.0 35.0 55.0
julia> a1[1:9]'
1×9 adjoint(::MtlVector{Float32}) with eltype Float32:
11.0 13.0 15.0 31.0 33.0 35.0 51.0 53.0 55.0
Or can we specialize the kernel if the Broadcasted
is one of the known styles? (not sure if this works, I'm not super familiar with the broadcasting machinery)
Just curious, does CUDA have this slowdown as well?
from metal.jl.
CUDA suffers from this as well, but recent generations of the hardware have been much much better at dealing with more complex operations like this. That's also probably why we haven't reimplemented broadcast ourselves yet.
IIRC, another potential slowdown is how the keeps
tuple argument is passed (it's something that could be specialized on). Again, used to be a problem on older NVIDIA GPUs, but I don't think it matters that much anymore.
from metal.jl.
Can we do something like "if dest
and bc
are contiguous, use linearidx"?
For dest
I'm thinking of checking lastindex(dest)-firstindex(dest)+1 == length(dest)
and for bc
check if the axes are unit ranges? I'm not confident about this however, it'd be good to add some tests to make cover different types of inputs. Do you have an example of making a non-contiguous view that breaks linear indexing?
For keeps
(I assume that's Extruded.keeps
) what kind of specialization are you thinking about?
from metal.jl.
Can we do something like "if
dest
andbc
are contiguous, use linearidx"?
That would make for a good fast path, yes.
Alternatively, or in addition, we have the power of a JIT, of course. Hacker's Delight has two chapters on integer division, including one on integer division by known constants. We could probably use that to generate a kernel which performs the indexing without any actual division. We'd need to generate a new kernel for each size of input, but maybe that isn't too bad for GPU applications.
Do you have an example of making a non-contiguous view that breaks linear indexing?
julia> A = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.427609 0.264936
0.841486 0.732059
# contiguous
julia> view(A, :, 1)
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.42760944
0.84148616
# strided
julia> view(A, 1, :)
2-element view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 1, :) with eltype Float32:
0.42760944
0.2649361
# arbitrary
julia> view(A, cu([1, 2, 4]))
3-element view(::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, [1, 2, 4]) with eltype Float32:
0.42760944
0.84148616
0.73205924
For
keeps
(I assume that'sExtruded.keeps
) what kind of specialization are you thinking about?
Similar to the above, IIRC keeps
is used to determine which part of the multidimensional indices to keep. Since keeps
is a parameter to the kernel, that results in code with branches that depend on a parameter load. Instead, we should consider specializing the kernel instead, as for GPU execution the compilation overhead is probably worth it.
from metal.jl.
@vchuravy mentioned that LLVM is actually smart enough to optimize the CartesianIndices getindex away (which causes the integer division) as long as the CartesianIndex is constant. KernelAbstraction does this by making the group size getters (blockDim etc) constant and constructing a CartesianIndices around those, however I'm not sure that's what we want (I'm not sure how that deals with ND input sizes, and IIUC it causes idle threads when the shape of the inputs doesn't match the shape of the launch). However, that also means it might just be as simple as making sure that the input iterators are passed to the kernel as a constant, i.e., as a Val
. And if we're specializing the kernel then anyway it might make sense to make the size queries constant as well, e.g., by wrapping inputs in StaticArrays.SizedArray
.
Lots of things to try here 🙂
from metal.jl.
Thanks Tim for the pointers! I'll experiment a bit.
from metal.jl.
I tried passing the iterator dimensions as a Val
, it does eliminate the CartesianIndices getindex, but the division is simply moved to the time it indexes into the 2D array, so it's still slow. Then I tried wrapping the arrays in a SizedArray
but Metal complained that they are not isbits. Actually, even if we can make the array sizes static, would LLVM be able to turn the divisions into bit twiddling code? If not we'll still have a division to deal with.
Here are my logs: https://gist.github.com/maxwindiff/8d530dd43b3d626bf0aac3fe4f943ee9
I also have another question about indexing non-contiguous views... if I change @cartesianidx
to @linearidx
, it appears broadcasting on non-contiguous views still work correctly:
22:21:25 GPUArrays (master) % git diff
diff --git a/src/host/broadcast.jl b/src/host/broadcast.jl
index 9a44fb8..1cba162 100644
--- a/src/host/broadcast.jl
+++ b/src/host/broadcast.jl
@@ -55,7 +55,7 @@ end
i = 0
while i < nelem
i += 1
- I = @cartesianidx(dest, i)
+ I = @linearidx(dest, i)
@inbounds dest[I] = bc′[I]
end
return
julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, :, 1) .*= 10; A
3×3 MtlMatrix{Int64}:
10 2 3
40 5 6
70 8 9
julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, 1, :) .*= 10; A
3×3 MtlMatrix{Int64}:
10 20 30
4 5 6
7 8 9
julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, [1,3,4,5]) .*= 10; A
3×3 MtlMatrix{Int64}:
10 20 3
4 50 6
70 8 9
And broadcasting on contiguous arrays is now fast (the @cartesianidx
version took 110ms`):
julia> @btime @Metal.sync begin
global a .= b
end
11.805 ms (196 allocations: 5.95 KiB)
So it seems this lets us pay for the expensive calculations only when needed. Is there any reason we can't do this?
from metal.jl.
the division is simply moved to the time it indexes into the 2D array,
Yeah, the sizes probably need to constant too. And yes, it might be a little tricky to pass the necessary information to the kernel, but in principle I think it should work (or at least if LLVM knows how to optimize given constant denominators, which @vchuravy suggested it does).
I also have another question about indexing non-contiguous views... if I change
@cartesianidx
to@linearidx
, it appears broadcasting on non-contiguous views still work correctly:
Huh, yes that appears to be the case. If we're using non-contiguous arrays, the SubArray can still be indexed linearly, so I guess we shouldn't be using cartesian indices here. That would make the solution much easier :-) I'll create a PR to run CI on this.
from metal.jl.
Actually, arbitrary broadcastable objects don't appear to be linearly indexable:
julia> bc = Base.Broadcast.broadcasted(identity, rand(2,2))
Base.Broadcast.Broadcasted(identity, ([0.7110622323877549 0.5733414791299075; 0.620167447659585 0.017047464877951835],))
julia> IndexStyle(bc)
IndexCartesian()
julia> bc[4]
ERROR: BoundsError: attempt to access Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}} at index [4]
Stacktrace:
[1] throw_boundserror(A::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}}, I::Tuple{Int64})
@ Base ./abstractarray.jl:744
[2] checkbounds
@ ./broadcast.jl:621 [inlined]
[3] getindex(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}}, I::Int64)
@ Base.Broadcast ./broadcast.jl:609
[4] top-level scope
@ REPL[17]:1
from metal.jl.
JuliaGPU/GPUArrays.jl#454 should help a bit, but most broadcast objects apparently have IndexStyle==Cartesian
from metal.jl.
I'm curious, what makes the bc′
in _copyto!
support linear indexing when forced? At first I thought this is because the index is passed as-is to the MtlArray
inside the Broadcasted
object, but then I tried something like a .= 10
and it still worked. Or is this all undefined behavior and I just happened to get lucky?
0:18:34 GPUArrays (tb/avoid_cartesian) % git diff
diff --git a/src/host/broadcast.jl b/src/host/broadcast.jl
index 5a8da77..64b46f3 100644
--- a/src/host/broadcast.jl
+++ b/src/host/broadcast.jl
@@ -58,15 +58,14 @@ end
# HACK: cartesian iteration is slow, so avoid it if possible
# TODO: generalize this, much like how `eachindex` picks an appropriate iterator
# (::AnyGPUArray methods shouldn't be hardcoded to use linear indexing)
- I = if isa(IndexStyle(dest), IndexLinear) && isa(IndexStyle(bc′), IndexLinear)
- @linearidx(dest, i)
- else
- @cartesianidx(dest, i)
- end
+ I = @linearidx(dest, i)
@inbounds dest[I] = bc′[I]
end
return
end
+ @info "IndexStyle" IndexStyle(dest) IndexStyle(bc) IndexStyle(bc′)
+ @info "typeof" typeof(dest) typeof(bc) typeof(bc′)
+ @info "bc′" bc′
+ @show @inbounds bc′[10]
elements = length(dest)
elements_per_thread = typemax(Int)
heuristic = launch_heuristic(backend(dest), broadcast_kernel, dest, bc′, 1;
julia> a .= b;
┌ Info: IndexStyle
│ IndexStyle(dest) = IndexLinear()
│ IndexStyle(bc) = IndexCartesian()
└ IndexStyle(bc′) = IndexCartesian()
┌ Info: typeof
│ typeof(dest) = MtlMatrix{Float32} (alias for MtlArray{Float32, 2})
│ typeof(bc) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{MtlMatrix{Float32}}}
└ typeof(bc′) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{MtlMatrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}
┌ Info: bc′
└ bc′ = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}}(identity, (Base.Broadcast.Extruded{MtlMatrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}(Float32[-0.45997995 -0.9323782 … -0.78047013 -0.10763502; -0.41987753 -0.958211 … -0.11566621 -0.01383841; … ; -0.21164703 -0.6332897 … -0.13726139 -0.76514477; -0.71918046 -0.14217287 … -0.42983288 -0.99601007], (true, true), (1, 1)),))
#= /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:69 =# @inbounds(bc′[10]) = -0.624596f0
julia> a .= 10;
┌ Info: IndexStyle
│ IndexStyle(dest) = IndexLinear()
│ IndexStyle(bc) = IndexCartesian()
└ IndexStyle(bc′) = IndexCartesian()
┌ Info: typeof
│ typeof(dest) = MtlMatrix{Float32} (alias for MtlArray{Float32, 2})
│ typeof(bc) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}
└ typeof(bc′) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}
┌ Info: bc′
└ bc′ = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}}(identity, (10,))
#= /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:69 =# @inbounds(bc′[10]) = 10
from metal.jl.
I think it may start to matter when adding singleton dimensions into the mix.
from metal.jl.
Just a note here: the GPUArrays.jl change got reverted, JuliaGPU/GPUArrays.jl#463, so performance is bad again.
from metal.jl.
Related Issues (20)
- tag new version HOT 1
- Panic during profiling tests on 14.4 beta HOT 5
- M3 backend cannot handle atomics with complicated pointer conversions HOT 3
- Int128 does not compile HOT 4
- Two suspicious `mtl`-related behaviours HOT 6
- Add Support for BFloat16 HOT 3
- LU factorization: add allowsingular keyword argument HOT 1
- Autorelease changes lead to use after free with errors
- Shader validator error with linear broadcast kernel HOT 3
- Support for Paravirtualized Graphics for Github Actions CI HOT 4
- Reductions don't work on Shared Arrays HOT 1
- Port the opportunistic synchronization from CUDA.jl HOT 1
- Register v1.1.0 HOT 4
- Tests sporadically timing out on 1.11 HOT 9
- ReshapedArray indexing broken because of Int128 operation HOT 11
- KernelAbstractions copyto! typo
- Segmentation Faults HOT 11
- Port `accmulate!` and `findall` from CUDA.jl HOT 4
- `MTL.append_copy!` silently ignores Metal documentation restriction HOT 1
- Tests failing with `GPUCompiler` v0.26.5 and `LLVM` v7.1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from metal.jl.