Although Metal.jl works fine on Julia 1.9 locally, it for some reason fails on CI. May

I found where the test hangs for my machine. <a href="https://github.com/JuliaGPU/

Julia 1.9.0-beta4 output of <code class="notranslate"

Kernel hang related to noreturn function attributes,about juliagpu/metal.jl

Comments (28)

maleadt commented on May 25, 2024 1

So it doesn't hang if you run that operation in isolation? Did you try with --check-bounds=yes, to mimic the test runner?

from metal.jl.

maleadt commented on May 25, 2024 1

Thanks! This gets us much closer to something debuggable.

FYI, the debug layer needs both MTL_DEBUG_LAYER and MTL_SHADER_VALIDATION set, the first one enables debug mode, the second one toggles a specific kind of debug validation.

from metal.jl.

maleadt commented on May 25, 2024 1

Yeah there's still an issue.

from metal.jl.

christiangnrd commented on May 25, 2024

I'm on 13.2.1 but I can reproduce the hanging behaviour during tests locally. Except for my particular machine, instead of hanging during gpuarrays/random, it consistently hangs during gpuarrays/broadcasting.

I compared activity monitor behaviour between julia --project test/runstests gpuarrays/broadcasting (1.8.5) and julia +beta --project test/runstests gpuarrays/broadcasting (1.9.0-beta4) and in 1.8, for the julia process, cpu is at 97% the whole time and gpu is at ~20% until the test completes, while in 1.9 beta, the cpu is at 100% for a few seconds at the start while gpu is at ~4% for the julia process, then it very quickly drops to 0% for both and the process never finishes.

Another thing I've noticed is that when I stop the test with ctrl+c, I get Distributed.jl warnings telling me that the process was not removed, and indeed looking at activity monitor, I have 5 Julia processes that shouldn't be there.

from metal.jl.

maleadt commented on May 25, 2024

Another thing I've noticed is that when I stop the test with ctrl+c, I get Distributed.jl warnings telling me that the process was not removed, and indeed looking at activity monitor, I have 5 Julia processes that shouldn't be there.

That's just the test suite runner capturing your interrupt and exiting. One way to run tests in isolation is, from the Metal.jl repository, do something like julia --project -e 'using Pkg; Pkg.test(; julia_args=gpuarrays/broadcast), where the positional args passed as julia_args indicate the tests you want to run (try passing --help for more information). Hopefully the backtrace you get then from interrupting the process is more informative? Ideally we'd isolate this down to the single operation that makes Metal.jl hang on 1.9.

from metal.jl.

christiangnrd commented on May 25, 2024

I'm having trouble getting the test to run outside of the test suite runner because it comes from GPUArrays and I can't figure out how to call it from Metal.

I also did a bisect of Julia, and it seems like JuliaLang/julia@a12c2f0 is the commit that caused the issue (or at least caused it to surface).

from metal.jl.

maleadt commented on May 25, 2024

To run code from the GPUArrays test suite, you can do something like:

julia --project=test -L test/setup.jl
julia> AT=MtlArray
julia> eltypes = [Int16, Int32, Int64, Complex{Int16}, Complex{Int32}, Complex{Int64}, ComplexF16, ComplexF32]

That should set-up the required environment. Note that you also might have to start Julia with --check-bounds=yes, like the test runner does.

from metal.jl.

christiangnrd commented on May 25, 2024

I found where the test hangs for my machine.
This line, more specifically, here when it tries to run getindex.(Ref(x), 1) on the gpu array. I looked at the generated llvm and I'll be posting each version in their own comment. The one difference of note between the two is that the llvm code from 1.9.0-beta 4 has store atomic {}* %0, {}** %14 release, align 8 that is missing in the 1.8.5 version.

The code I'm referring to is right above the line that looks like

; └└

from metal.jl.

christiangnrd commented on May 25, 2024

Julia 1.8.5 output of @code_llvm f(gpu_in...) where f is x->getindex.(Ref(x), 1), and gpu_in is (Int16[0],):

;  @ none within `#315`
define i16 @"julia_#315_1053"({}* nonnull align 8 dereferenceable(32) %0) #0 {
top:
  %gcframe4 = alloca [3 x {}*], align 16
  %gcframe4.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 0
  %1 = bitcast [3 x {}*]* %gcframe4 to i8*
  call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %1, i8 0, i32 24, i1 false)
  %2 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 2
  %3 = alloca { { {}*, i64 } }, align 8
  %4 = call {}*** inttoptr (i64 7226199708 to {}*** (i64)*)(i64 260) #4
; ┌ @ refpointer.jl:134 within `Ref`
; │┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
    %5 = bitcast [3 x {}*]* %gcframe4 to i64*
    store i64 4, i64* %5, align 16
    %6 = load {}**, {}*** %4, align 8
    %7 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 1
    %8 = bitcast {}** %7 to {}***
    store {}** %6, {}*** %8, align 8
    %9 = bitcast {}*** %4 to {}***
    store {}** %gcframe4.sub, {}*** %9, align 8
    %ptls_field5 = getelementptr inbounds {}**, {}*** %4, i64 2
    %10 = bitcast {}*** %ptls_field5 to i8**
    %ptls_load67 = load i8*, i8** %10, align 8
    %11 = call noalias nonnull {}* @ijl_gc_pool_alloc(i8* %ptls_load67, i32 1392, i32 16) #5
    %12 = bitcast {}* %11 to i64*
    %13 = getelementptr inbounds i64, i64* %12, i64 -1
    store atomic i64 4547519648, i64* %13 unordered, align 8
    %14 = bitcast {}* %11 to {}**
    store {}* %0, {}** %14, align 8                                       ; HERE
; └└
; ┌ @ broadcast.jl:860 within `materialize`
   %.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 0
   store {}* %11, {}** %2, align 16
   store {}* %11, {}** %.fca.0.0.gep, align 8
   %.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 1
   store i64 1, i64* %.fca.0.1.gep, align 8
   %15 = call i16 @j_copy_1055({ { {}*, i64 } }* nocapture readonly %3) #0
   %16 = load {}*, {}** %7, align 8
   %17 = bitcast {}*** %4 to {}**
   store {}* %16, {}** %17, align 8
; └
  ret i16 %15
}

from metal.jl.

christiangnrd commented on May 25, 2024

Julia 1.9.0-beta4 output of @code_llvm f(gpu_in...) where f is x->getindex.(Ref(x), 1), and gpu_in is (Int16[0],):

;  @ none within `#315`
define i16 @"julia_#315_1018"({}* noundef nonnull align 8 dereferenceable(32) %0) #0 {
top:
  %gcframe4 = alloca [3 x {}*], align 16
  %gcframe4.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 0
  %1 = bitcast [3 x {}*]* %gcframe4 to i8*
  call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %1, i8 0, i32 24, i1 false)
  %2 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 2
  %3 = alloca { { {}*, i64 } }, align 8
  %4 = call {}*** inttoptr (i64 7226199708 to {}*** (i64)*)(i64 261) #3
; ┌ @ refpointer.jl:136 within `Ref`
; │┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
    %5 = bitcast [3 x {}*]* %gcframe4 to i64*
    store i64 4, i64* %5, align 16
    %6 = load {}**, {}*** %4, align 8
    %7 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 1
    %8 = bitcast {}** %7 to {}***
    store {}** %6, {}*** %8, align 8
    %9 = bitcast {}*** %4 to {}***
    store {}** %gcframe4.sub, {}*** %9, align 8
    %ptls_field5 = getelementptr inbounds {}**, {}*** %4, i64 2
    %10 = bitcast {}*** %ptls_field5 to i8**
    %ptls_load67 = load i8*, i8** %10, align 8
    %11 = call noalias nonnull {}* @ijl_gc_pool_alloc(i8* %ptls_load67, i32 1392, i32 16) #4
    %12 = bitcast {}* %11 to i64*
    %13 = getelementptr inbounds i64, i64* %12, i64 -1
    store atomic i64 6245911824, i64* %13 unordered, align 8
    %14 = bitcast {}* %11 to {}**
    store {}* null, {}** %14, align 8                                     ; HERE
    store atomic {}* %0, {}** %14 release, align 8                        ; AND ALSO HERE
; └└
; ┌ @ broadcast.jl:873 within `materialize`
   %.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 0
   store {}* %11, {}** %2, align 16
   store {}* %11, {}** %.fca.0.0.gep, align 8
   %.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 1
   store i64 1, i64* %.fca.0.1.gep, align 8
   %15 = call i16 @j_copy_1020({ { {}*, i64 } }* nocapture readonly %3) #0
   %16 = load {}*, {}** %7, align 8
   %17 = bitcast {}*** %4 to {}**
   store {}* %16, {}** %17, align 8
; └
  ret i16 %15
}

from metal.jl.

christiangnrd commented on May 25, 2024

I was hoping the removal of libcmt would magically fix the 1.9 ci issues but unfortunately the tests still hang.

from metal.jl.

maleadt commented on May 25, 2024

Sadly this looks codegen related; the atomic store is probably a good clue. I hope I'll have time to investigate next week, but it's problematic that I can't reproduce this.

Which hardware do you have exactly?

from metal.jl.

christiangnrd commented on May 25, 2024

I have a 30-core M2 Max Macbook Pro and all of my comments have been using that computer.

I also have access to a base model M1 Mac mini in my lab where I was able to reproduce the hang.

from metal.jl.

christiangnrd commented on May 25, 2024

'I ran test Metal many times today on the lab M1 and I have some gist of test outputs. All were run using the currently released version so I don’t know how useful they’ll be, but I got some really weird results on 1.8.5 and 1.9.0-rc1.

Of the three 1.8.5 tests I ran and saved, the first one had a failure in gpuarrays/reductions/== isequal (and gpuarrays/math/power had no tests??), but the other 2 I ran right after passed. I tried to reproduce this failure on my M2 Max but all my 1.8.5 runs passed.
Gist for these tests

Then, I ran the Metal 0.2.0 tests on 1.9.0-rc1, and for the first time since seeing this issue, a test pass completed on 1.9.0, although with some errors. Gist

Hopeful, I reran it on 1.9.0-rc1 twice and unfortunately, both times the test hanged, with some errors in previous tests. The second time had a new error in the unified memory example (hint?) Gist 1, Gist 2

All of the gists are from M1 runs. On my M2 Max, all tests consistently pass on 1.8.5, and I never get any errors on 1.9.0-rc1 (other than broadcasting hanging).

I don’t know how useful these gists will be, but I figure since you can’t reproduce, I might as well give you as much as you can. Last time I had a bug like this that was very inconsistent, it ended being that I wasn’t initializing some values but they weren’t caught in debug mode because all the memory gets 0 initialized when running the debugger.

I’ll run the tests a few more times in the background this weekend from the Master branch to see if anything has changed since there’s been quite a few changes.

from metal.jl.

maleadt commented on May 25, 2024

I haven't had the time to reproduce (probably only next week), but since you have a system on which the tests consistently hang: can you post the MWE that makes it hang in a clean session, and could you try running with julia -g2 or after setting MTL_DEBUG_LAYER=1 and MTL_SHADER_VALIDATION=1 in your environment before loading Metal.jl?

from metal.jl.

christiangnrd commented on May 25, 2024

MWE (running from the Metal folder):

# Only run the failing test
$ julia --project=test -e'using Pkg; Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="metal-mwe-hang")'

# 1.8.5: Passes
$ julia --project=test test/runtests.jl 'gpuarrays/broadcasting'

# 1.8.5: Passes
$ julia -g2 --project=test test/runtests.jl 'gpuarrays/broadcasting'

# 1.9.0-rc1: Fails (hang)
$ Julia +beta --project=test test/runtests.jl 'gpuarrays/broadcasting'

# 1.9.0-rc1: Passes
$ Julia +beta -g2 --project=test test/runtests.jl 'gpuarrays/broadcasting'

from metal.jl.

maleadt commented on May 25, 2024

That's not really a minimal example, can you reduce it to any of the tests that make the GPU hang? You mentioned x->getindex.(Ref(x), 1) above?

EDIT: OK, I have access to a system on which this hangs as well. I'll try reducing next week.

from metal.jl.

christiangnrd commented on May 25, 2024

The metal-mwe-hang branch has all broadcasting tests but the x->getindex.(Ref(x), 1) one commented out so it should pass or hang within a minute. I'm trying to reduce it but at the moment my code works.

Faster to run code of the same as above

using Pkg;
Pkg.activate(temp=true);
Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="metal-mwe-hang")
Pkg.add(url="https://github.com/JuliaGPU/Metal.jl", rev="main")
Pkg.test("Metal", test_args=["gpuarrays/broadcasting"])

from metal.jl.

christiangnrd commented on May 25, 2024

I forgot about --check-bounds=yes. Here's a real MWE for you:

using Pkg;
Pkg.activate(temp=true);
Pkg.add(url="https://github.com/JuliaGPU/Metal.jl", rev="main")
using Metal

getindex.(Ref(MtlArray([0])), 1)

Pasting the above code into the REPL after starting julia in the following ways did not hang:

julia --check-bounds=yes (1.8.5)
MTL_SHADER_VALIDATION=1 julia --check-bounds=yes (1.8.5)
MTL_SHADER_VALIDATION=1 julia +beta --check-bounds=yes (1.9.0-rc1)

However, when julia was started with j +beta --check-bounds=yes, it hangs.

I dumped the generated llvm code for each version. Both 1.8.5 versions were identical, I'm pretty sure both 1.9.0-rc1 versions are identical (some function names different), but I put both in in case I'm wrong. The difference between 1.8.5 and 1.9.0-rc1 is the noundef in the function definition.

Setting MTL_DEBUG_LAYER=1 in my environment before launching julia made no difference in the results.

@code_llvm for 1.8.5 (both)

;  @ none within `##dotfunction#312#1`
define i64 @"julia_##dotfunction#312#1_626"({}* nonnull align 8 dereferenceable(8) %0, i64 signext %1) #0 {
top:
  %gcframe2 = alloca [3 x {}*], align 16
  %gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
  %2 = bitcast [3 x {}*]* %gcframe2 to i8*
  call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
  %3 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
  %4 = alloca { { {}*, i64 } }, align 8
  %5 = call {}*** inttoptr (i64 7005032092 to {}*** (i64)*)(i64 260) #3
; ┌ @ broadcast.jl:860 within `materialize`
   %6 = bitcast [3 x {}*]* %gcframe2 to i64*
   store i64 4, i64* %6, align 16
   %7 = load {}**, {}*** %5, align 8
   %8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
   %9 = bitcast {}** %8 to {}***
   store {}** %7, {}*** %9, align 8
   %10 = bitcast {}*** %5 to {}***
   store {}** %gcframe2.sub, {}*** %10, align 8
   %.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 0
   store {}* %0, {}** %3, align 16
   store {}* %0, {}** %.fca.0.0.gep, align 8
   %.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 1
   store i64 %1, i64* %.fca.0.1.gep, align 8
   %11 = call i64 @j_copy_628({ { {}*, i64 } }* nocapture readonly %4) #0
   %12 = load {}*, {}** %8, align 8
   %13 = bitcast {}*** %5 to {}**
   store {}* %12, {}** %13, align 8
; └
  ret i64 %11
}

@code_llvm for 1.9.0-rc1 with shader validation (no hang)

2023-03-15 18:43:39.747 julia[55303:1772369] Metal GPU Validation Enabled
;  @ none within `##dotfunction#292#3`
define i64 @"julia_##dotfunction#292#3_554"({}* noundef nonnull align 8 dereferenceable(8) %0, i64 signext %1) #0 {
top:
  %gcframe2 = alloca [3 x {}*], align 16
  %gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
  %2 = bitcast [3 x {}*]* %gcframe2 to i8*
  call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
  %3 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
  %4 = alloca { { {}*, i64 } }, align 8
  %5 = call {}*** inttoptr (i64 7005032092 to {}*** (i64)*)(i64 261) #2
; ┌ @ broadcast.jl:873 within `materialize`
   %6 = bitcast [3 x {}*]* %gcframe2 to i64*
   store i64 4, i64* %6, align 16
   %7 = load {}**, {}*** %5, align 8
   %8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
   %9 = bitcast {}** %8 to {}***
   store {}** %7, {}*** %9, align 8
   %10 = bitcast {}*** %5 to {}***
   store {}** %gcframe2.sub, {}*** %10, align 8
   %.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 0
   store {}* %0, {}** %3, align 16
   store {}* %0, {}** %.fca.0.0.gep, align 8
   %.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 1
   store i64 %1, i64* %.fca.0.1.gep, align 8
   %11 = call i64 @j_copy_556({ { {}*, i64 } }* nocapture readonly %4) #0
   %12 = load {}*, {}** %8, align 8
   %13 = bitcast {}*** %5 to {}**
   store {}* %12, {}** %13, align 8
; └
  ret i64 %11
}

@code_llvm for 1.9.0-rc1 with no shader validation (hang)

;  @ none within `##dotfunction#292#3`
define i64 @"julia_##dotfunction#292#3_528"({}* noundef nonnull align 8 dereferenceable(8) %0, i64 signext %1) #0 {
top:
  %gcframe2 = alloca [3 x {}*], align 16
  %gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
  %2 = bitcast [3 x {}*]* %gcframe2 to i8*
  call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
  %3 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
  %4 = alloca { { {}*, i64 } }, align 8
  %5 = call {}*** inttoptr (i64 7005032092 to {}*** (i64)*)(i64 261) #2
; ┌ @ broadcast.jl:873 within `materialize`
   %6 = bitcast [3 x {}*]* %gcframe2 to i64*
   store i64 4, i64* %6, align 16
   %7 = load {}**, {}*** %5, align 8
   %8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
   %9 = bitcast {}** %8 to {}***
   store {}** %7, {}*** %9, align 8
   %10 = bitcast {}*** %5 to {}***
   store {}** %gcframe2.sub, {}*** %10, align 8
   %.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 0
   store {}* %0, {}** %3, align 16
   store {}* %0, {}** %.fca.0.0.gep, align 8
   %.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 1
   store i64 %1, i64* %.fca.0.1.gep, align 8
   %11 = call i64 @j_copy_530({ { {}*, i64 } }* nocapture readonly %4) #0
   %12 = load {}*, {}** %8, align 8
   %13 = bitcast {}*** %5 to {}**
   store {}* %12, {}** %13, align 8
; └
  ret i64 %11
}

from metal.jl.

maleadt commented on May 25, 2024

Looks like that specific failure doesn't reproduce anymore after JuliaGPU/GPUArrays.jl#454, so let's try bumping GPUArrays to at least work around the immediate issue.

from metal.jl.

maleadt commented on May 25, 2024

MWE:

using Metal

function kernel(dest, nelem)
    j = 0
    while j < nelem
        j += 1
        i = Metal.thread_position_in_grid_1d() + (j-1) * Metal.threads_per_grid_1d()
        i > length(dest) && return
        I = @inbounds CartesianIndices(dest)[i]
        @inbounds dest[I] = 42
    end
    return
end

arr = MtlArray{Int64}(undef)
Metal.@sync @metal kernel(arr, 1)

from metal.jl.

maleadt commented on May 25, 2024

I've spent some time debugging this, and I don't notice significant differences between the --check-bounds=yes IR on 1.8 and 1.9. Specifically, there were two differences:

some additional noreturn function attributes
a missing SDK Version module flag (well, it's there but with uninitialized values)

The former doesn't seem to be the culprit, I think (after manually stripping those attributes and still reproducing the hang). The latter may be related, but I wonder why our LLVM back-end messes up here. We have this code, https://github.com/JuliaGPU/llvm-metal/blob/llvm_release_14/llvm/lib/Target/Metal/Metal.cpp#L285-L324, and strangely if I compile our back-end it just sets the metadata correctly. I wonder if something's up with the Yggdrasil build.

Instead of debugging this, I'm going to try to set this flag from Julia, see maleadt/LLVM.jl#329. Can't test this right now though as the machine where I could reproduce this has died 🤦

EDIT: Setting the SDK version didn't help.

from metal.jl.

christiangnrd commented on May 25, 2024

Bumping GPUArrays seems to have fixed the hanging for me. If I understand correctly this gets around the issue by not calling the problematic code but the problem still exists?

from metal.jl.

maleadt commented on May 25, 2024

Reduced the hang to the following IR:

; ModuleID = 'kernel.ll'
source_filename = "text"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32"
target triple = "air64-apple-macosx13.2.1"

; Function Attrs: cold noreturn nounwind
declare void @llvm.trap() #0

; Function Attrs: noinline
define internal void @throw() #1 {
top:
  tail call void @llvm.trap()
  unreachable
}

define cc103 void @kernel({ i8 addrspace(1)* } addrspace(1)* %0, i64 addrspace(1)* %1) {
entry:
  %2 = load i64, i64 addrspace(1)* %1, align 8
  %.not2 = icmp sgt i64 %2, 0
  br i1 %.not2, label %oob, label %exit

oob:                                              ; preds = %entry
  tail call void @throw()
  unreachable

exit:                                             ; preds = %entry
  ret void
}

attributes #0 = { cold noreturn nounwind }
attributes #1 = { noinline }

!air.kernel = !{!0}
!air.version = !{!5}
!llvm.module.flags = !{!6}

!0 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !1, !2}
!1 = !{}
!2 = !{!3, !4}
!3 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!4 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!5 = !{i32 2, i32 4, i32 0}
!6 = !{i32 2, !"SDK Version", [2 x i32] [i32 13, i32 2]}

After compiling this IR with our Metal back-end:

using Metal

function main(path)
    metallib = read(path)

    dev = current_device()
    lib = MTLLibraryFromData(dev, metallib)
    fun = MTLFunction(lib, "kernel")
    pipeline = MTLComputePipelineState(dev, fun)

    f = identity
    ft = typeof(f)
    tt = Tuple{ft, Tuple{MtlDeviceArray{Int64, 0, 1}, Int64}}
    kernel = Metal.HostKernel{ft, tt}(f, pipeline)

    arr = MtlArray{Int64}(undef)
    println("Waiting...")
    Metal.@sync kernel(arr, 1)
end

isinteractive() || main(ARGS...)

This hangs when the metallib was generated by our back-end based on LLVM 14, but not when using the LLVM 13 version. The difference:

 ; ModuleID = 'bc_module'
 source_filename = "text"
@@ -38,7 +38,8 @@
 ; Function Attrs: cold noreturn nounwind
 declare void @llvm.trap() #0

-define internal void @throw() {
+; Function Attrs: noinline
+define internal void @throw() #1 {
 top:
   tail call void @llvm.trap()
   unreachable
@@ -59,6 +60,7 @@
 }

 attributes #0 = { cold noreturn nounwind }
+attributes #1 = { noinline }

 !air.kernel = !{!0}
 !air.version = !{!5}

i.e. on LLVM 13 we drop the noinline attr which causes the unreachable to get inlined. When outlined, it hangs. This does seem like a bug in the Metal compiler, however, it does once again trace back to divergent control flow (like JuliaGPU/CUDAnative.jl#4, or now JuliaGPU/CUDA.jl#1746, which has been plaguing us for years).

kernels.zip

from metal.jl.

maleadt commented on May 25, 2024

ObjC loader:

#import <Foundation/Foundation.h>
#import <Metal/Metal.h>

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        if (argc != 2) {
            NSLog(@"Usage: %s [Metal Library Filename]", argv[0]);
            return 1;
        }

        NSString *libraryFilePath = [NSString stringWithUTF8String:argv[1]];
        NSError *error = nil;

        id<MTLDevice> device = MTLCreateSystemDefaultDevice();
        if (!device) {
            NSLog(@"Metal is not supported on this device");
            return 1;
        }

        NSURL *libraryFileURL = [NSURL fileURLWithPath:libraryFilePath];
        id<MTLLibrary> library = [device newLibraryWithURL:libraryFileURL error:&error];
        if (!library) {
            NSLog(@"Failed to create Metal library: %@", error);
            return 1;
        }

        id<MTLFunction> kernelFunction = [library newFunctionWithName:@"kernel"];
        if (!kernelFunction) {
            NSLog(@"Failed to find the 'kernel' function");
            return 1;
        }

        id<MTLComputePipelineState> pipeline = [device newComputePipelineStateWithFunction:kernelFunction error:&error];
        if (!pipeline) {
            NSLog(@"Failed to create compute pipeline state: %@", error);
            return 1;
        }

        id<MTLCommandQueue> commandQueue = [device newCommandQueue];
        id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
        id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];

        [computeEncoder setComputePipelineState:pipeline];

        NSUInteger bufferSize = sizeof(int64_t);
        id<MTLBuffer> buffer1 = [device newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];
        id<MTLBuffer> buffer2 = [device newBufferWithBytes:&(int64_t){1} length:sizeof(int64_t) options:MTLResourceStorageModeShared];

        [computeEncoder setBuffer:buffer1 offset:0 atIndex:0];
        [computeEncoder setBuffer:buffer2 offset:0 atIndex:1];

        MTLSize gridSize = MTLSizeMake(1, 1, 1);
        MTLSize threadgroupSize = MTLSizeMake(1, 1, 1);
        [computeEncoder dispatchThreadgroups:gridSize threadsPerThreadgroup:threadgroupSize];

        [computeEncoder endEncoding];

        MTLCommandBufferHandler completionHandler = ^(id<MTLCommandBuffer> cb) {
            NSLog(@"Kernel execution completed");
        };

        [commandBuffer addCompletedHandler:completionHandler];
        [commandBuffer commit];
        NSLog(@"Waiting...");
        [commandBuffer waitUntilCompleted];
    }
    return 0;
}

from metal.jl.

maleadt commented on May 25, 2024

Hmm, I can actually reconstruct this IR using a Metal kernel:

#include <metal_stdlib>
using namespace metal;

struct Array {
    device int8_t *data;
};

__attribute__((noinline)) void perform_throw() {
    __builtin_trap();
}

kernel void kernel_fun(device Array *a, device int64_t *b [[ buffer(0) ]]) {
    if (*b > 0)
        perform_throw();
}

... but that one executes correctly. Trying to narrow down the differences, it looks like a metadata-related issue.

from metal.jl.

maleadt commented on May 25, 2024

So with the following base IR:

; ModuleID = 'bc_module'
source_filename = "kernel"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32"
target triple = "air64-apple-macosx13.0.0"

declare void @llvm.trap()

; Function Attrs: noinline
define internal fastcc void @throw() #0 {
  tail call void @llvm.trap()
  unreachable
}

define void @kernel({ i8 addrspace(1)* } addrspace(1)* %0, i64 addrspace(1)* %1) {
entry:
  %2 = load i64, i64 addrspace(1)* %1, align 8
  %.not2 = icmp sgt i64 %2, 0
  br i1 %.not2, label %oob, label %exit

oob:                                              ; preds = %entry
  tail call void @throw()
  unreachable

exit:                                             ; preds = %entry
  ret void
}

attributes #0 = { noinline }

!air.version = !{!0}
!0 = !{i32 2, i32 4, i32 0}

... it works with the following metadata:

!air.kernel = !{!14}
!14 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !15, !16}
!15 = !{}
!16 = !{!17, !20}

!17 = !{i32 0, !"air.indirect_buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.struct_type_info", !18, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!18 = !{i32 0, i32 8, i32 0, !"char", !"data", !"air.indirect_argument", !19}
!19 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 1, !"air.arg_type_align_size", i32 1}

!20 = !{i32 1, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}

... but fails with what we emit:

!air.kernel = !{!1}
!1 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !2, !3}
!2 = !{}
!3 = !{!4, !5}
!4 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!5 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}

So this is the original metadata issue again (where we emit a simple buffer, for bindless oepration, while Metal apparently expects a fleshed out metadata tree).

from metal.jl.

maleadt commented on May 25, 2024

Let's narrow this issue down to the kernel hang seen with noreturn function attributes. I have put a workaround in place in GPUCompiler (unreleased as of now, so use the master branch if you want to test this), but will keep this issue open so that we don't forget about it.

Disabling the workaround and running the MWE above on --check-bounds=yes still reproduces the hang.

from metal.jl.

Kernel hang related to noreturn function attributes about metal.jl HOT 28 OPEN

Comments (28)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent