Probably a known issue by the devs but just for the record: <div class="highlight

I tried writing a <a href="https://gist.github.com/maxwindiff/768a9f1ac532e549ff9b0128

On my computer: <div class="highlight highlight-source-julia notranslate position-

Good article: <a href="https://betterprogramming.pub/optimizing-parallel-reduction-in-

Another good source: <a href="https://github.com/ml-explore/mlx/blob/main/mlx/backend/

Poor performance of mapreduce about metal.jl HOT 8 CLOSED

FuZhiyu commented on May 25, 2024

Poor performance of mapreduce

from metal.jl.

Comments (8)

maxwindiff commented on May 25, 2024 2

In-place is slow because it's hitting the init === nothing code path: https://github.com/JuliaGPU/Metal.jl/blob/main/src/mapreduce.jl#L230-L237

If GPUArrays.neutral_element() returned nothing by default, we may be able to something like:

-Base.mapreducedim!(f, op, R::AnyGPUArray, A::AbstractArray) = mapreducedim!(f, op, R, A)
+Base.mapreducedim!(f, op, R::AnyGPUArray{T}, A::AbstractArray) where {T} =
+  mapreducedim!(f, op, R, A; init=neutral_element(op, T))

With my limited Julia fundamentals knowledge, I don't know how to extend neutral_element without breaking compatibility. Let me try other ways of initializing the partial reduction array...

from metal.jl.

maxwindiff commented on May 25, 2024 1

I tried writing a reduction kernel which only supports 1d arrays, and it's about 4x as fast as the current implementation. I'll try to see if the generic implementation can be further improved.

from metal.jl.

maxwindiff commented on May 25, 2024 1

Reductions are generally faster now, however in-place is still very slow:

julia> @btime sum($a)
  760.000 μs (0 allocations: 0 bytes)
5.001241f6

julia> @btime sum($Ma)
  708.083 μs (1197 allocations: 27.76 KiB)
5.001241f6

julia> @btime Metal.@sync sum!($r, $Ma)
  376.325 ms (101199 allocations: 2.00 MiB)
1-element MtlVector{Float32}:
 5.001241f6

from metal.jl.

mchitre commented on May 25, 2024

Similar results on Ventura as well, so that's not the cause.

from metal.jl.

maxwindiff commented on May 25, 2024

On my computer:

julia> a = fill(Float32(1.0), 10*1024*1024);
julia> da = MtlArray(a);
julia> @btime sum(a)
  844.500 μs (1 allocation: 16 bytes)
1.048576f7
julia> @btime sum(da)
  2.707 ms (857 allocations: 23.66 KiB)
1.048576f7

Now, if we do this:

diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index 1d84d78..900f21d 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -123,7 +123,7 @@ function partial_mapreduce_device(f, op, neutral, maxthreads, Rreduce, Rother, s
             ireduce += localDim_reduce * groupDim_reduce
         end
 
-        val = reduce_group(op, val, neutral, shuffle, maxthreads)
+        val = 1 # reduce_group(op, val, neutral, shuffle, maxthreads)
 
         # write back to memory
         if localIdx_reduce == 1

It still takes 2ms to simply loop over the input/output arrays!

julia> @btime sum(da)
  2.015 ms (857 allocations: 23.66 KiB)
1.0f0

My guess is that the slowdown is from all the indexing calculations (same as #41). But it's even harder to eliminate the cartesian indexing because the reduction process itself can add additional dimensions...

from metal.jl.

maleadt commented on May 25, 2024

Good article: https://betterprogramming.pub/optimizing-parallel-reduction-in-metal-for-apple-m1-8e8677b49b01

from metal.jl.

rveltz commented on May 25, 2024

It would be good to write a similar blog using Metal.jl

from metal.jl.

maleadt commented on May 25, 2024

Another good source: https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/reduce.metal + https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/reduce.cpp

from metal.jl.

Poor performance of mapreduce about metal.jl HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent