The following code evaluates the performance of the copy of 2 2D square MTL arrays <co

Did some more experimentation, and with <a class="issue-link js-issue-link" data-error

I suspect this is because <a href="https://github.com/JuliaGPU/GPUArrays.jl/blob/maste

Can we do something like "if dest and <code class="no

Can we do something like "if dest and <c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

slow broadcast copy in 2D,about juliagpu/metal.jl

Comments (16)

maleadt commented on May 25, 2024 1

Did some more experimentation, and with JuliaGPU/GPUArrays.jl#454 the kernel is pretty fast now, getting 120GBs out of 180GBs (previously it only reacher 8GBs). I guess that getting rid of the keeps branches would further improve performance, but this is a very good start already.

from metal.jl.

maxwindiff commented on May 25, 2024

I suspect this is because @cartesianidx(dest, i) in the broadcast kernel, which calls CartesianIndices(x)[i], ends up doing some division?

;  @ /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:55 within `broadcast_kernel`
; ┌ @ /Users/kichi/.julia/dev/GPUArrays/src/device/indexing.jl:81 within `macro expansion`
; │┌ @ multidimensional.jl:262 within `CartesianIndices`
; ││┌ @ abstractarray.jl:95 within `axes`
; │││┌ @ tuple.jl:222 within `map`
; ││││┌ @ range.jl:455 within `oneto`
; │││││┌ @ range.jl:453 within `OneTo` @ range.jl:440
; ││││││┌ @ promotion.jl:488 within `max`
; │││││││┌ @ essentials.jl:489 within `ifelse`
          %8 = icmp sgt i64 %.unpack5.unpack, 0
; │└└└└└└└
; │┌ @ abstractarray.jl:1241 within `getindex`
; ││┌ @ abstractarray.jl:1286 within `_getindex`
; │││┌ @ abstractarray.jl:1293 within `_to_subscript_indices`
; ││││┌ @ abstractarray.jl:1315 within `_unsafe_ind2sub`
; │││││┌ @ abstractarray.jl:2639 within `_ind2sub` @ abstractarray.jl:2677
; ││││││┌ @ int.jl:86 within `-`
         %9 = add nsw i64 %6, -1
; ││││││└
; ││││││┌ @ abstractarray.jl:2690 within `_ind2sub_recurse`
; │││││││┌ @ abstractarray.jl:2697 within `_div`
; ││││││││┌ @ int.jl:288 within `div`
           br i1 %8, label %pass, label %fail

If I change the broadcast kernel to use @linearidx instead of @cartesianidx, it runs almost as fast as the handwritten kernel and the result seems to be correct. But surely there must be a reason @cartesianidx was used?

from metal.jl.

maleadt commented on May 25, 2024

Yes, it does an integer division in order to go from a linear index (which @cartesianidx uses) to, well, a Cartesian (N-dimensional) one. That's generally pretty slow, but there's no easy way around. Some platforms offer multi- or ND hardware indices (e.g., CUDA's threadIdx.xyz, or OpenCL's get_global_id(dim)), but those typically aren't fully general (either not supporting arbitrary rank, or being limited in size) so there's no straightforward path to using those without having to special-case kernels for platforms and inputs.

And the reason we use Cartesian indices and not linear ones is that you can pass non-contiguous GPU inputs to the broadcast kernel (say, a non-contiguous view) in which case there's no linear index that would work.

from metal.jl.

maxwindiff commented on May 25, 2024

I assumed the non-contiguous view would still take care of translating linear indices to whatever indices underneath, or is that not true in general?

julia> a = MtlArray(Float32.(reshape(1:100, 10, 10)))
10×10 MtlMatrix{Float32}:
  1.0  11.0  21.0  31.0  41.0  51.0  61.0  71.0  81.0   91.0
  2.0  12.0  22.0  32.0  42.0  52.0  62.0  72.0  82.0   92.0
  3.0  13.0  23.0  33.0  43.0  53.0  63.0  73.0  83.0   93.0
  4.0  14.0  24.0  34.0  44.0  54.0  64.0  74.0  84.0   94.0
  5.0  15.0  25.0  35.0  45.0  55.0  65.0  75.0  85.0   95.0
  6.0  16.0  26.0  36.0  46.0  56.0  66.0  76.0  86.0   96.0
  7.0  17.0  27.0  37.0  47.0  57.0  67.0  77.0  87.0   97.0
  8.0  18.0  28.0  38.0  48.0  58.0  68.0  78.0  88.0   98.0
  9.0  19.0  29.0  39.0  49.0  59.0  69.0  79.0  89.0   99.0
 10.0  20.0  30.0  40.0  50.0  60.0  70.0  80.0  90.0  100.0

julia> a1 = @view a[[1,3,5], [2,4,6]]
3×3 view(::MtlMatrix{Float32}, [1, 3, 5], [2, 4, 6]) with eltype Float32:
 11.0  31.0  51.0
 13.0  33.0  53.0
 15.0  35.0  55.0

julia> a1[1:9]'
1×9 adjoint(::MtlVector{Float32}) with eltype Float32:
 11.0  13.0  15.0  31.0  33.0  35.0  51.0  53.0  55.0

Or can we specialize the kernel if the Broadcasted is one of the known styles? (not sure if this works, I'm not super familiar with the broadcasting machinery)

Just curious, does CUDA have this slowdown as well?

from metal.jl.

maleadt commented on May 25, 2024

CUDA suffers from this as well, but recent generations of the hardware have been much much better at dealing with more complex operations like this. That's also probably why we haven't reimplemented broadcast ourselves yet.

IIRC, another potential slowdown is how the keeps tuple argument is passed (it's something that could be specialized on). Again, used to be a problem on older NVIDIA GPUs, but I don't think it matters that much anymore.

from metal.jl.

maxwindiff commented on May 25, 2024

Can we do something like "if dest and bc are contiguous, use linearidx"?
For dest I'm thinking of checking lastindex(dest)-firstindex(dest)+1 == length(dest) and for bc check if the axes are unit ranges? I'm not confident about this however, it'd be good to add some tests to make cover different types of inputs. Do you have an example of making a non-contiguous view that breaks linear indexing?

For keeps (I assume that's Extruded.keeps) what kind of specialization are you thinking about?

from metal.jl.

maleadt commented on May 25, 2024

Can we do something like "if dest and bc are contiguous, use linearidx"?

That would make for a good fast path, yes.

Alternatively, or in addition, we have the power of a JIT, of course. Hacker's Delight has two chapters on integer division, including one on integer division by known constants. We could probably use that to generate a kernel which performs the indexing without any actual division. We'd need to generate a new kernel for each size of input, but maybe that isn't too bad for GPU applications.

Do you have an example of making a non-contiguous view that breaks linear indexing?

julia> A = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.427609  0.264936
 0.841486  0.732059

# contiguous
julia> view(A, :, 1)
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.42760944
 0.84148616

# strided
julia> view(A, 1, :)
2-element view(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 1, :) with eltype Float32:
 0.42760944
 0.2649361

# arbitrary
julia> view(A, cu([1, 2, 4]))
3-element view(::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, [1, 2, 4]) with eltype Float32:
 0.42760944
 0.84148616
 0.73205924

For keeps (I assume that's Extruded.keeps) what kind of specialization are you thinking about?

Similar to the above, IIRC keeps is used to determine which part of the multidimensional indices to keep. Since keeps is a parameter to the kernel, that results in code with branches that depend on a parameter load. Instead, we should consider specializing the kernel instead, as for GPU execution the compilation overhead is probably worth it.

from metal.jl.

maleadt commented on May 25, 2024

@vchuravy mentioned that LLVM is actually smart enough to optimize the CartesianIndices getindex away (which causes the integer division) as long as the CartesianIndex is constant. KernelAbstraction does this by making the group size getters (blockDim etc) constant and constructing a CartesianIndices around those, however I'm not sure that's what we want (I'm not sure how that deals with ND input sizes, and IIUC it causes idle threads when the shape of the inputs doesn't match the shape of the launch). However, that also means it might just be as simple as making sure that the input iterators are passed to the kernel as a constant, i.e., as a Val. And if we're specializing the kernel then anyway it might make sense to make the size queries constant as well, e.g., by wrapping inputs in StaticArrays.SizedArray.

Lots of things to try here 🙂

from metal.jl.

maxwindiff commented on May 25, 2024

Thanks Tim for the pointers! I'll experiment a bit.

from metal.jl.

maxwindiff commented on May 25, 2024

I tried passing the iterator dimensions as a Val, it does eliminate the CartesianIndices getindex, but the division is simply moved to the time it indexes into the 2D array, so it's still slow. Then I tried wrapping the arrays in a SizedArray but Metal complained that they are not isbits. Actually, even if we can make the array sizes static, would LLVM be able to turn the divisions into bit twiddling code? If not we'll still have a division to deal with.

Here are my logs: https://gist.github.com/maxwindiff/8d530dd43b3d626bf0aac3fe4f943ee9

I also have another question about indexing non-contiguous views... if I change @cartesianidx to @linearidx, it appears broadcasting on non-contiguous views still work correctly:

 22:21:25 GPUArrays (master) % git diff
diff --git a/src/host/broadcast.jl b/src/host/broadcast.jl
index 9a44fb8..1cba162 100644
--- a/src/host/broadcast.jl
+++ b/src/host/broadcast.jl
@@ -55,7 +55,7 @@ end
         i = 0
         while i < nelem
             i += 1
-            I = @cartesianidx(dest, i)
+            I = @linearidx(dest, i)
             @inbounds dest[I] = bc′[I]
         end
         return

julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, :, 1) .*= 10; A
3×3 MtlMatrix{Int64}:
 10  2  3
 40  5  6
 70  8  9

julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, 1, :) .*= 10; A
3×3 MtlMatrix{Int64}:
 10  20  30
  4   5   6
  7   8   9

julia> A = MtlArray([1 2 3; 4 5 6; 7 8 9]); view(A, [1,3,4,5]) .*= 10; A
3×3 MtlMatrix{Int64}:
 10  20  3
  4  50  6
 70   8  9

And broadcasting on contiguous arrays is now fast (the @cartesianidx version took 110ms`):

julia> @btime @Metal.sync begin
           global a .= b
         end
  11.805 ms (196 allocations: 5.95 KiB)

So it seems this lets us pay for the expensive calculations only when needed. Is there any reason we can't do this?

from metal.jl.

maleadt commented on May 25, 2024

the division is simply moved to the time it indexes into the 2D array,

Yeah, the sizes probably need to constant too. And yes, it might be a little tricky to pass the necessary information to the kernel, but in principle I think it should work (or at least if LLVM knows how to optimize given constant denominators, which @vchuravy suggested it does).

I also have another question about indexing non-contiguous views... if I change @cartesianidx to @linearidx, it appears broadcasting on non-contiguous views still work correctly:

Huh, yes that appears to be the case. If we're using non-contiguous arrays, the SubArray can still be indexed linearly, so I guess we shouldn't be using cartesian indices here. That would make the solution much easier :-) I'll create a PR to run CI on this.

from metal.jl.

maleadt commented on May 25, 2024

Actually, arbitrary broadcastable objects don't appear to be linearly indexable:

julia> bc = Base.Broadcast.broadcasted(identity, rand(2,2))
Base.Broadcast.Broadcasted(identity, ([0.7110622323877549 0.5733414791299075; 0.620167447659585 0.017047464877951835],))

julia> IndexStyle(bc)
IndexCartesian()

julia> bc[4]
ERROR: BoundsError: attempt to access Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}} at index [4]
Stacktrace:
 [1] throw_boundserror(A::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}}, I::Tuple{Int64})
   @ Base ./abstractarray.jl:744
 [2] checkbounds
   @ ./broadcast.jl:621 [inlined]
 [3] getindex(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Nothing, typeof(identity), Tuple{Matrix{Float64}}}, I::Int64)
   @ Base.Broadcast ./broadcast.jl:609
 [4] top-level scope
   @ REPL[17]:1

from metal.jl.

maleadt commented on May 25, 2024

JuliaGPU/GPUArrays.jl#454 should help a bit, but most broadcast objects apparently have IndexStyle==Cartesian

from metal.jl.

maxwindiff commented on May 25, 2024

I'm curious, what makes the bc′ in _copyto! support linear indexing when forced? At first I thought this is because the index is passed as-is to the MtlArray inside the Broadcasted object, but then I tried something like a .= 10 and it still worked. Or is this all undefined behavior and I just happened to get lucky?

 0:18:34 GPUArrays (tb/avoid_cartesian) % git diff
diff --git a/src/host/broadcast.jl b/src/host/broadcast.jl
index 5a8da77..64b46f3 100644
--- a/src/host/broadcast.jl
+++ b/src/host/broadcast.jl
@@ -58,15 +58,14 @@ end
             # HACK: cartesian iteration is slow, so avoid it if possible
             # TODO: generalize this, much like how `eachindex` picks an appropriate iterator
             #       (::AnyGPUArray methods shouldn't be hardcoded to use linear indexing)
-            I = if isa(IndexStyle(dest), IndexLinear) && isa(IndexStyle(bc′), IndexLinear)
-                @linearidx(dest, i)
-            else
-                @cartesianidx(dest, i)
-            end
+            I = @linearidx(dest, i)
             @inbounds dest[I] = bc′[I]
         end
         return
     end
+    @info "IndexStyle" IndexStyle(dest) IndexStyle(bc) IndexStyle(bc′)
+    @info "typeof" typeof(dest) typeof(bc) typeof(bc′)
+    @info "bc′" bc′
+    @show @inbounds bc′[10]
     elements = length(dest)
     elements_per_thread = typemax(Int)
     heuristic = launch_heuristic(backend(dest), broadcast_kernel, dest, bc′, 1;

julia> a .= b;
┌ Info: IndexStyle
│   IndexStyle(dest) = IndexLinear()
│   IndexStyle(bc) = IndexCartesian()
└   IndexStyle(bc′) = IndexCartesian()
┌ Info: typeof
│   typeof(dest) = MtlMatrix{Float32} (alias for MtlArray{Float32, 2})
│   typeof(bc) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{MtlMatrix{Float32}}}
└   typeof(bc′) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{MtlMatrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}
┌ Info: bc′
└   bc′ = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}}(identity, (Base.Broadcast.Extruded{MtlMatrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}(Float32[-0.45997995 -0.9323782 … -0.78047013 -0.10763502; -0.41987753 -0.958211 … -0.11566621 -0.01383841; … ; -0.21164703 -0.6332897 … -0.13726139 -0.76514477; -0.71918046 -0.14217287 … -0.42983288 -0.99601007], (true, true), (1, 1)),))
#= /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:69 =# @inbounds(bc′[10]) = -0.624596f0

julia> a .= 10;
┌ Info: IndexStyle
│   IndexStyle(dest) = IndexLinear()
│   IndexStyle(bc) = IndexCartesian()
└   IndexStyle(bc′) = IndexCartesian()
┌ Info: typeof
│   typeof(dest) = MtlMatrix{Float32} (alias for MtlArray{Float32, 2})
│   typeof(bc) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}
└   typeof(bc′) = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}
┌ Info: bc′
└   bc′ = Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{2}}(identity, (10,))
#= /Users/kichi/.julia/dev/GPUArrays/src/host/broadcast.jl:69 =# @inbounds(bc′[10]) = 10

from metal.jl.

maleadt commented on May 25, 2024

I think it may start to matter when adding singleton dimensions into the mix.

from metal.jl.

maleadt commented on May 25, 2024

Just a note here: the GPUArrays.jl change got reverted, JuliaGPU/GPUArrays.jl#463, so performance is bad again.

from metal.jl.

slow broadcast copy in 2D about metal.jl HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent