Comments (28)
So it doesn't hang if you run that operation in isolation? Did you try with --check-bounds=yes
, to mimic the test runner?
from metal.jl.
Thanks! This gets us much closer to something debuggable.
FYI, the debug layer needs both MTL_DEBUG_LAYER
and MTL_SHADER_VALIDATION
set, the first one enables debug mode, the second one toggles a specific kind of debug validation.
from metal.jl.
Yeah there's still an issue.
from metal.jl.
I'm on 13.2.1 but I can reproduce the hanging behaviour during tests locally. Except for my particular machine, instead of hanging during gpuarrays/random, it consistently hangs during gpuarrays/broadcasting.
I compared activity monitor behaviour between julia --project test/runstests gpuarrays/broadcasting
(1.8.5) and julia +beta --project test/runstests gpuarrays/broadcasting
(1.9.0-beta4) and in 1.8, for the julia process, cpu is at 97% the whole time and gpu is at ~20% until the test completes, while in 1.9 beta, the cpu is at 100% for a few seconds at the start while gpu is at ~4% for the julia process, then it very quickly drops to 0% for both and the process never finishes.
Another thing I've noticed is that when I stop the test with ctrl+c, I get Distributed.jl warnings telling me that the process was not removed, and indeed looking at activity monitor, I have 5 Julia processes that shouldn't be there.
from metal.jl.
Another thing I've noticed is that when I stop the test with ctrl+c, I get Distributed.jl warnings telling me that the process was not removed, and indeed looking at activity monitor, I have 5 Julia processes that shouldn't be there.
That's just the test suite runner capturing your interrupt and exiting. One way to run tests in isolation is, from the Metal.jl repository, do something like julia --project -e 'using Pkg; Pkg.test(; julia_args=
gpuarrays/broadcast)
, where the positional args passed as julia_args
indicate the tests you want to run (try passing --help
for more information). Hopefully the backtrace you get then from interrupting the process is more informative? Ideally we'd isolate this down to the single operation that makes Metal.jl hang on 1.9.
from metal.jl.
I'm having trouble getting the test to run outside of the test suite runner because it comes from GPUArrays
and I can't figure out how to call it from Metal.
I also did a bisect of Julia, and it seems like JuliaLang/julia@a12c2f0 is the commit that caused the issue (or at least caused it to surface).
from metal.jl.
To run code from the GPUArrays test suite, you can do something like:
julia --project=test -L test/setup.jl
julia> AT=MtlArray
julia> eltypes = [Int16, Int32, Int64, Complex{Int16}, Complex{Int32}, Complex{Int64}, ComplexF16, ComplexF32]
That should set-up the required environment. Note that you also might have to start Julia with --check-bounds=yes
, like the test runner does.
from metal.jl.
I found where the test hangs for my machine.
This line, more specifically, here when it tries to run getindex.(Ref(x), 1)
on the gpu array. I looked at the generated llvm and I'll be posting each version in their own comment. The one difference of note between the two is that the llvm code from 1.9.0-beta 4 has store atomic {}* %0, {}** %14 release, align 8
that is missing in the 1.8.5 version.
The code I'm referring to is right above the line that looks like
; └└
from metal.jl.
Julia 1.8.5
output of @code_llvm f(gpu_in...)
where f
is x->getindex.(Ref(x), 1)
, and gpu_in
is (Int16[0],)
:
; @ none within `#315`
define i16 @"julia_#315_1053"({}* nonnull align 8 dereferenceable(32) %0) #0 {
top:
%gcframe4 = alloca [3 x {}*], align 16
%gcframe4.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 0
%1 = bitcast [3 x {}*]* %gcframe4 to i8*
call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %1, i8 0, i32 24, i1 false)
%2 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 2
%3 = alloca { { {}*, i64 } }, align 8
%4 = call {}*** inttoptr (i64 7226199708 to {}*** (i64)*)(i64 260) #4
; ┌ @ refpointer.jl:134 within `Ref`
; │┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
%5 = bitcast [3 x {}*]* %gcframe4 to i64*
store i64 4, i64* %5, align 16
%6 = load {}**, {}*** %4, align 8
%7 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 1
%8 = bitcast {}** %7 to {}***
store {}** %6, {}*** %8, align 8
%9 = bitcast {}*** %4 to {}***
store {}** %gcframe4.sub, {}*** %9, align 8
%ptls_field5 = getelementptr inbounds {}**, {}*** %4, i64 2
%10 = bitcast {}*** %ptls_field5 to i8**
%ptls_load67 = load i8*, i8** %10, align 8
%11 = call noalias nonnull {}* @ijl_gc_pool_alloc(i8* %ptls_load67, i32 1392, i32 16) #5
%12 = bitcast {}* %11 to i64*
%13 = getelementptr inbounds i64, i64* %12, i64 -1
store atomic i64 4547519648, i64* %13 unordered, align 8
%14 = bitcast {}* %11 to {}**
store {}* %0, {}** %14, align 8 ; HERE
; └└
; ┌ @ broadcast.jl:860 within `materialize`
%.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 0
store {}* %11, {}** %2, align 16
store {}* %11, {}** %.fca.0.0.gep, align 8
%.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 1
store i64 1, i64* %.fca.0.1.gep, align 8
%15 = call i16 @j_copy_1055({ { {}*, i64 } }* nocapture readonly %3) #0
%16 = load {}*, {}** %7, align 8
%17 = bitcast {}*** %4 to {}**
store {}* %16, {}** %17, align 8
; └
ret i16 %15
}
from metal.jl.
Julia 1.9.0-beta4
output of @code_llvm f(gpu_in...)
where f
is x->getindex.(Ref(x), 1)
, and gpu_in
is (Int16[0],)
:
; @ none within `#315`
define i16 @"julia_#315_1018"({}* noundef nonnull align 8 dereferenceable(32) %0) #0 {
top:
%gcframe4 = alloca [3 x {}*], align 16
%gcframe4.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 0
%1 = bitcast [3 x {}*]* %gcframe4 to i8*
call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %1, i8 0, i32 24, i1 false)
%2 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 2
%3 = alloca { { {}*, i64 } }, align 8
%4 = call {}*** inttoptr (i64 7226199708 to {}*** (i64)*)(i64 261) #3
; ┌ @ refpointer.jl:136 within `Ref`
; │┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
%5 = bitcast [3 x {}*]* %gcframe4 to i64*
store i64 4, i64* %5, align 16
%6 = load {}**, {}*** %4, align 8
%7 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 1
%8 = bitcast {}** %7 to {}***
store {}** %6, {}*** %8, align 8
%9 = bitcast {}*** %4 to {}***
store {}** %gcframe4.sub, {}*** %9, align 8
%ptls_field5 = getelementptr inbounds {}**, {}*** %4, i64 2
%10 = bitcast {}*** %ptls_field5 to i8**
%ptls_load67 = load i8*, i8** %10, align 8
%11 = call noalias nonnull {}* @ijl_gc_pool_alloc(i8* %ptls_load67, i32 1392, i32 16) #4
%12 = bitcast {}* %11 to i64*
%13 = getelementptr inbounds i64, i64* %12, i64 -1
store atomic i64 6245911824, i64* %13 unordered, align 8
%14 = bitcast {}* %11 to {}**
store {}* null, {}** %14, align 8 ; HERE
store atomic {}* %0, {}** %14 release, align 8 ; AND ALSO HERE
; └└
; ┌ @ broadcast.jl:873 within `materialize`
%.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 0
store {}* %11, {}** %2, align 16
store {}* %11, {}** %.fca.0.0.gep, align 8
%.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 1
store i64 1, i64* %.fca.0.1.gep, align 8
%15 = call i16 @j_copy_1020({ { {}*, i64 } }* nocapture readonly %3) #0
%16 = load {}*, {}** %7, align 8
%17 = bitcast {}*** %4 to {}**
store {}* %16, {}** %17, align 8
; └
ret i16 %15
}
from metal.jl.
I was hoping the removal of libcmt would magically fix the 1.9 ci issues but unfortunately the tests still hang.
from metal.jl.
Sadly this looks codegen related; the atomic store is probably a good clue. I hope I'll have time to investigate next week, but it's problematic that I can't reproduce this.
Which hardware do you have exactly?
from metal.jl.
I have a 30-core M2 Max Macbook Pro and all of my comments have been using that computer.
I also have access to a base model M1 Mac mini in my lab where I was able to reproduce the hang.
from metal.jl.
'I ran test Metal
many times today on the lab M1 and I have some gist of test outputs. All were run using the currently released version so I don’t know how useful they’ll be, but I got some really weird results on 1.8.5 and 1.9.0-rc1.
Of the three 1.8.5 tests I ran and saved, the first one had a failure in gpuarrays/reductions/== isequal
(and gpuarrays/math/power
had no tests??), but the other 2 I ran right after passed. I tried to reproduce this failure on my M2 Max but all my 1.8.5 runs passed.
Gist for these tests
Then, I ran the Metal 0.2.0 tests on 1.9.0-rc1, and for the first time since seeing this issue, a test pass completed on 1.9.0, although with some errors. Gist
Hopeful, I reran it on 1.9.0-rc1 twice and unfortunately, both times the test hanged, with some errors in previous tests. The second time had a new error in the unified memory example (hint?) Gist 1, Gist 2
All of the gists are from M1 runs. On my M2 Max, all tests consistently pass on 1.8.5, and I never get any errors on 1.9.0-rc1 (other than broadcasting hanging).
I don’t know how useful these gists will be, but I figure since you can’t reproduce, I might as well give you as much as you can. Last time I had a bug like this that was very inconsistent, it ended being that I wasn’t initializing some values but they weren’t caught in debug mode because all the memory gets 0 initialized when running the debugger.
I’ll run the tests a few more times in the background this weekend from the Master branch to see if anything has changed since there’s been quite a few changes.
from metal.jl.
I haven't had the time to reproduce (probably only next week), but since you have a system on which the tests consistently hang: can you post the MWE that makes it hang in a clean session, and could you try running with julia -g2
or after setting MTL_DEBUG_LAYER=1
and MTL_SHADER_VALIDATION=1
in your environment before loading Metal.jl?
from metal.jl.
MWE (running from the Metal folder):
# Only run the failing test
$ julia --project=test -e'using Pkg; Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="metal-mwe-hang")'
# 1.8.5: Passes
$ julia --project=test test/runtests.jl 'gpuarrays/broadcasting'
# 1.8.5: Passes
$ julia -g2 --project=test test/runtests.jl 'gpuarrays/broadcasting'
# 1.9.0-rc1: Fails (hang)
$ Julia +beta --project=test test/runtests.jl 'gpuarrays/broadcasting'
# 1.9.0-rc1: Passes
$ Julia +beta -g2 --project=test test/runtests.jl 'gpuarrays/broadcasting'
from metal.jl.
That's not really a minimal example, can you reduce it to any of the tests that make the GPU hang? You mentioned x->getindex.(Ref(x), 1)
above?
EDIT: OK, I have access to a system on which this hangs as well. I'll try reducing next week.
from metal.jl.
The metal-mwe-hang
branch has all broadcasting tests but the x->getindex.(Ref(x), 1)
one commented out so it should pass or hang within a minute. I'm trying to reduce it but at the moment my code works.
Faster to run code of the same as above
using Pkg;
Pkg.activate(temp=true);
Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="metal-mwe-hang")
Pkg.add(url="https://github.com/JuliaGPU/Metal.jl", rev="main")
Pkg.test("Metal", test_args=["gpuarrays/broadcasting"])
from metal.jl.
I forgot about --check-bounds=yes
. Here's a real MWE for you:
using Pkg;
Pkg.activate(temp=true);
Pkg.add(url="https://github.com/JuliaGPU/Metal.jl", rev="main")
using Metal
getindex.(Ref(MtlArray([0])), 1)
Pasting the above code into the REPL after starting julia in the following ways did not hang:
julia --check-bounds=yes
(1.8.5)MTL_SHADER_VALIDATION=1 julia --check-bounds=yes
(1.8.5)MTL_SHADER_VALIDATION=1 julia +beta --check-bounds=yes
(1.9.0-rc1)
However, when julia was started with j +beta --check-bounds=yes
, it hangs.
I dumped the generated llvm code for each version. Both 1.8.5 versions were identical, I'm pretty sure both 1.9.0-rc1 versions are identical (some function names different), but I put both in in case I'm wrong. The difference between 1.8.5 and 1.9.0-rc1 is the noundef
in the function definition.
Setting MTL_DEBUG_LAYER=1
in my environment before launching julia made no difference in the results.
@code_llvm for 1.8.5 (both)
; @ none within `##dotfunction#312#1`
define i64 @"julia_##dotfunction#312#1_626"({}* nonnull align 8 dereferenceable(8) %0, i64 signext %1) #0 {
top:
%gcframe2 = alloca [3 x {}*], align 16
%gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
%2 = bitcast [3 x {}*]* %gcframe2 to i8*
call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
%3 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
%4 = alloca { { {}*, i64 } }, align 8
%5 = call {}*** inttoptr (i64 7005032092 to {}*** (i64)*)(i64 260) #3
; ┌ @ broadcast.jl:860 within `materialize`
%6 = bitcast [3 x {}*]* %gcframe2 to i64*
store i64 4, i64* %6, align 16
%7 = load {}**, {}*** %5, align 8
%8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
%9 = bitcast {}** %8 to {}***
store {}** %7, {}*** %9, align 8
%10 = bitcast {}*** %5 to {}***
store {}** %gcframe2.sub, {}*** %10, align 8
%.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 0
store {}* %0, {}** %3, align 16
store {}* %0, {}** %.fca.0.0.gep, align 8
%.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 1
store i64 %1, i64* %.fca.0.1.gep, align 8
%11 = call i64 @j_copy_628({ { {}*, i64 } }* nocapture readonly %4) #0
%12 = load {}*, {}** %8, align 8
%13 = bitcast {}*** %5 to {}**
store {}* %12, {}** %13, align 8
; └
ret i64 %11
}
@code_llvm for 1.9.0-rc1 with shader validation (no hang)
2023-03-15 18:43:39.747 julia[55303:1772369] Metal GPU Validation Enabled
; @ none within `##dotfunction#292#3`
define i64 @"julia_##dotfunction#292#3_554"({}* noundef nonnull align 8 dereferenceable(8) %0, i64 signext %1) #0 {
top:
%gcframe2 = alloca [3 x {}*], align 16
%gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
%2 = bitcast [3 x {}*]* %gcframe2 to i8*
call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
%3 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
%4 = alloca { { {}*, i64 } }, align 8
%5 = call {}*** inttoptr (i64 7005032092 to {}*** (i64)*)(i64 261) #2
; ┌ @ broadcast.jl:873 within `materialize`
%6 = bitcast [3 x {}*]* %gcframe2 to i64*
store i64 4, i64* %6, align 16
%7 = load {}**, {}*** %5, align 8
%8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
%9 = bitcast {}** %8 to {}***
store {}** %7, {}*** %9, align 8
%10 = bitcast {}*** %5 to {}***
store {}** %gcframe2.sub, {}*** %10, align 8
%.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 0
store {}* %0, {}** %3, align 16
store {}* %0, {}** %.fca.0.0.gep, align 8
%.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 1
store i64 %1, i64* %.fca.0.1.gep, align 8
%11 = call i64 @j_copy_556({ { {}*, i64 } }* nocapture readonly %4) #0
%12 = load {}*, {}** %8, align 8
%13 = bitcast {}*** %5 to {}**
store {}* %12, {}** %13, align 8
; └
ret i64 %11
}
@code_llvm for 1.9.0-rc1 with no shader validation (hang)
; @ none within `##dotfunction#292#3`
define i64 @"julia_##dotfunction#292#3_528"({}* noundef nonnull align 8 dereferenceable(8) %0, i64 signext %1) #0 {
top:
%gcframe2 = alloca [3 x {}*], align 16
%gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
%2 = bitcast [3 x {}*]* %gcframe2 to i8*
call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
%3 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
%4 = alloca { { {}*, i64 } }, align 8
%5 = call {}*** inttoptr (i64 7005032092 to {}*** (i64)*)(i64 261) #2
; ┌ @ broadcast.jl:873 within `materialize`
%6 = bitcast [3 x {}*]* %gcframe2 to i64*
store i64 4, i64* %6, align 16
%7 = load {}**, {}*** %5, align 8
%8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
%9 = bitcast {}** %8 to {}***
store {}** %7, {}*** %9, align 8
%10 = bitcast {}*** %5 to {}***
store {}** %gcframe2.sub, {}*** %10, align 8
%.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 0
store {}* %0, {}** %3, align 16
store {}* %0, {}** %.fca.0.0.gep, align 8
%.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 1
store i64 %1, i64* %.fca.0.1.gep, align 8
%11 = call i64 @j_copy_530({ { {}*, i64 } }* nocapture readonly %4) #0
%12 = load {}*, {}** %8, align 8
%13 = bitcast {}*** %5 to {}**
store {}* %12, {}** %13, align 8
; └
ret i64 %11
}
from metal.jl.
Looks like that specific failure doesn't reproduce anymore after JuliaGPU/GPUArrays.jl#454, so let's try bumping GPUArrays to at least work around the immediate issue.
from metal.jl.
MWE:
using Metal
function kernel(dest, nelem)
j = 0
while j < nelem
j += 1
i = Metal.thread_position_in_grid_1d() + (j-1) * Metal.threads_per_grid_1d()
i > length(dest) && return
I = @inbounds CartesianIndices(dest)[i]
@inbounds dest[I] = 42
end
return
end
arr = MtlArray{Int64}(undef)
Metal.@sync @metal kernel(arr, 1)
from metal.jl.
I've spent some time debugging this, and I don't notice significant differences between the --check-bounds=yes IR on 1.8 and 1.9. Specifically, there were two differences:
- some additional
noreturn
function attributes - a missing
SDK Version
module flag (well, it's there but with uninitialized values)
The former doesn't seem to be the culprit, I think (after manually stripping those attributes and still reproducing the hang). The latter may be related, but I wonder why our LLVM back-end messes up here. We have this code, https://github.com/JuliaGPU/llvm-metal/blob/llvm_release_14/llvm/lib/Target/Metal/Metal.cpp#L285-L324, and strangely if I compile our back-end it just sets the metadata correctly. I wonder if something's up with the Yggdrasil build.
Instead of debugging this, I'm going to try to set this flag from Julia, see maleadt/LLVM.jl#329. Can't test this right now though as the machine where I could reproduce this has died 🤦
EDIT: Setting the SDK version didn't help.
from metal.jl.
Bumping GPUArrays seems to have fixed the hanging for me. If I understand correctly this gets around the issue by not calling the problematic code but the problem still exists?
from metal.jl.
Reduced the hang to the following IR:
; ModuleID = 'kernel.ll'
source_filename = "text"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32"
target triple = "air64-apple-macosx13.2.1"
; Function Attrs: cold noreturn nounwind
declare void @llvm.trap() #0
; Function Attrs: noinline
define internal void @throw() #1 {
top:
tail call void @llvm.trap()
unreachable
}
define cc103 void @kernel({ i8 addrspace(1)* } addrspace(1)* %0, i64 addrspace(1)* %1) {
entry:
%2 = load i64, i64 addrspace(1)* %1, align 8
%.not2 = icmp sgt i64 %2, 0
br i1 %.not2, label %oob, label %exit
oob: ; preds = %entry
tail call void @throw()
unreachable
exit: ; preds = %entry
ret void
}
attributes #0 = { cold noreturn nounwind }
attributes #1 = { noinline }
!air.kernel = !{!0}
!air.version = !{!5}
!llvm.module.flags = !{!6}
!0 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !1, !2}
!1 = !{}
!2 = !{!3, !4}
!3 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!4 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!5 = !{i32 2, i32 4, i32 0}
!6 = !{i32 2, !"SDK Version", [2 x i32] [i32 13, i32 2]}
After compiling this IR with our Metal back-end:
using Metal
function main(path)
metallib = read(path)
dev = current_device()
lib = MTLLibraryFromData(dev, metallib)
fun = MTLFunction(lib, "kernel")
pipeline = MTLComputePipelineState(dev, fun)
f = identity
ft = typeof(f)
tt = Tuple{ft, Tuple{MtlDeviceArray{Int64, 0, 1}, Int64}}
kernel = Metal.HostKernel{ft, tt}(f, pipeline)
arr = MtlArray{Int64}(undef)
println("Waiting...")
Metal.@sync kernel(arr, 1)
end
isinteractive() || main(ARGS...)
This hangs when the metallib was generated by our back-end based on LLVM 14, but not when using the LLVM 13 version. The difference:
; ModuleID = 'bc_module'
source_filename = "text"
@@ -38,7 +38,8 @@
; Function Attrs: cold noreturn nounwind
declare void @llvm.trap() #0
-define internal void @throw() {
+; Function Attrs: noinline
+define internal void @throw() #1 {
top:
tail call void @llvm.trap()
unreachable
@@ -59,6 +60,7 @@
}
attributes #0 = { cold noreturn nounwind }
+attributes #1 = { noinline }
!air.kernel = !{!0}
!air.version = !{!5}
i.e. on LLVM 13 we drop the noinline
attr which causes the unreachable
to get inlined. When outlined, it hangs. This does seem like a bug in the Metal compiler, however, it does once again trace back to divergent control flow (like JuliaGPU/CUDAnative.jl#4, or now JuliaGPU/CUDA.jl#1746, which has been plaguing us for years).
from metal.jl.
ObjC loader:
#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
int main(int argc, const char * argv[]) {
@autoreleasepool {
if (argc != 2) {
NSLog(@"Usage: %s [Metal Library Filename]", argv[0]);
return 1;
}
NSString *libraryFilePath = [NSString stringWithUTF8String:argv[1]];
NSError *error = nil;
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
if (!device) {
NSLog(@"Metal is not supported on this device");
return 1;
}
NSURL *libraryFileURL = [NSURL fileURLWithPath:libraryFilePath];
id<MTLLibrary> library = [device newLibraryWithURL:libraryFileURL error:&error];
if (!library) {
NSLog(@"Failed to create Metal library: %@", error);
return 1;
}
id<MTLFunction> kernelFunction = [library newFunctionWithName:@"kernel"];
if (!kernelFunction) {
NSLog(@"Failed to find the 'kernel' function");
return 1;
}
id<MTLComputePipelineState> pipeline = [device newComputePipelineStateWithFunction:kernelFunction error:&error];
if (!pipeline) {
NSLog(@"Failed to create compute pipeline state: %@", error);
return 1;
}
id<MTLCommandQueue> commandQueue = [device newCommandQueue];
id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setComputePipelineState:pipeline];
NSUInteger bufferSize = sizeof(int64_t);
id<MTLBuffer> buffer1 = [device newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];
id<MTLBuffer> buffer2 = [device newBufferWithBytes:&(int64_t){1} length:sizeof(int64_t) options:MTLResourceStorageModeShared];
[computeEncoder setBuffer:buffer1 offset:0 atIndex:0];
[computeEncoder setBuffer:buffer2 offset:0 atIndex:1];
MTLSize gridSize = MTLSizeMake(1, 1, 1);
MTLSize threadgroupSize = MTLSizeMake(1, 1, 1);
[computeEncoder dispatchThreadgroups:gridSize threadsPerThreadgroup:threadgroupSize];
[computeEncoder endEncoding];
MTLCommandBufferHandler completionHandler = ^(id<MTLCommandBuffer> cb) {
NSLog(@"Kernel execution completed");
};
[commandBuffer addCompletedHandler:completionHandler];
[commandBuffer commit];
NSLog(@"Waiting...");
[commandBuffer waitUntilCompleted];
}
return 0;
}
from metal.jl.
Hmm, I can actually reconstruct this IR using a Metal kernel:
#include <metal_stdlib>
using namespace metal;
struct Array {
device int8_t *data;
};
__attribute__((noinline)) void perform_throw() {
__builtin_trap();
}
kernel void kernel_fun(device Array *a, device int64_t *b [[ buffer(0) ]]) {
if (*b > 0)
perform_throw();
}
... but that one executes correctly. Trying to narrow down the differences, it looks like a metadata-related issue.
from metal.jl.
So with the following base IR:
; ModuleID = 'bc_module'
source_filename = "kernel"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32"
target triple = "air64-apple-macosx13.0.0"
declare void @llvm.trap()
; Function Attrs: noinline
define internal fastcc void @throw() #0 {
tail call void @llvm.trap()
unreachable
}
define void @kernel({ i8 addrspace(1)* } addrspace(1)* %0, i64 addrspace(1)* %1) {
entry:
%2 = load i64, i64 addrspace(1)* %1, align 8
%.not2 = icmp sgt i64 %2, 0
br i1 %.not2, label %oob, label %exit
oob: ; preds = %entry
tail call void @throw()
unreachable
exit: ; preds = %entry
ret void
}
attributes #0 = { noinline }
!air.version = !{!0}
!0 = !{i32 2, i32 4, i32 0}
... it works with the following metadata:
!air.kernel = !{!14}
!14 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !15, !16}
!15 = !{}
!16 = !{!17, !20}
!17 = !{i32 0, !"air.indirect_buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.struct_type_info", !18, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!18 = !{i32 0, i32 8, i32 0, !"char", !"data", !"air.indirect_argument", !19}
!19 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 1, !"air.arg_type_align_size", i32 1}
!20 = !{i32 1, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
... but fails with what we emit:
!air.kernel = !{!1}
!1 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !2, !3}
!2 = !{}
!3 = !{!4, !5}
!4 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!5 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
So this is the original metadata issue again (where we emit a simple buffer, for bindless oepration, while Metal apparently expects a fleshed out metadata tree).
from metal.jl.
Let's narrow this issue down to the kernel hang seen with noreturn
function attributes. I have put a workaround in place in GPUCompiler (unreleased as of now, so use the master branch if you want to test this), but will keep this issue open so that we don't forget about it.
Disabling the workaround and running the MWE above on --check-bounds=yes
still reproduces the hang.
from metal.jl.
Related Issues (20)
- tag new version HOT 1
- Panic during profiling tests on 14.4 beta HOT 5
- M3 backend cannot handle atomics with complicated pointer conversions HOT 3
- Int128 does not compile HOT 4
- Two suspicious `mtl`-related behaviours HOT 6
- Add Support for BFloat16 HOT 3
- LU factorization: add allowsingular keyword argument HOT 1
- Autorelease changes lead to use after free with errors
- Shader validator error with linear broadcast kernel HOT 3
- Support for Paravirtualized Graphics for Github Actions CI HOT 4
- Reductions don't work on Shared Arrays HOT 1
- Port the opportunistic synchronization from CUDA.jl HOT 1
- Register v1.1.0 HOT 4
- Tests sporadically timing out on 1.11 HOT 9
- ReshapedArray indexing broken because of Int128 operation HOT 11
- KernelAbstractions copyto! typo
- Segmentation Faults HOT 11
- Port `accmulate!` and `findall` from CUDA.jl HOT 4
- `MTL.append_copy!` silently ignores Metal documentation restriction HOT 1
- Tests failing with `GPUCompiler` v0.26.5 and `LLVM` v7.1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from metal.jl.