juliagpu / metal.jl Goto Github PK

View Code? Open in Web Editor NEW

328.0 15.0 30.0 4.57 MB

Metal programming in Julia

License: MIT License

Julia 99.67% Metal 0.21% Makefile 0.12%

julia gpu mac metal-framework apple-gpu apple-silicon hacktoberfest

metal.jl's Introduction

Metal.jl

Metal programming in Julia

With Metal.jl it's possible to program GPUs on macOS using the Metal programming framework.

The package is a work-in-progress. There are bugs, functionality is missing, and performance hasn't been optimized. Expect to have to make changes to this package if you want to use it. PRs are very welcome!

Requirements

Mac device with M-series chip
Julia 1.8-1.10
macOS 13 (Ventura) or 14 (Sonoma)

These requirements are fairly strict, and are due to our limited development resources (manpower, hardware). Technically, they can be relaxed. If you are interested in contributing to this, see this issue for more details. In practice, Metal.jl will probably work on any macOS 10.15+, and other GPUs that are supported by Metal might also function (if only partially), but such combinations are unsupported for now.

Quick start

Metal.jl can be installed with the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:

pkg> add Metal

Or, equivalently, via the Pkg API:

julia> import Pkg; Pkg.add("Metal")

For an overview of the toolchain in use, you can run the following command after importing the package:

julia> using Metal

julia> Metal.versioninfo()
macOS 13.5.0, Darwin 22.6.0

Toolchain:
- Julia: 1.9.3
- LLVM: 14.0.6

Julia packages:
- Metal.jl: 0.5.0
- LLVMDowngrader_jll: 0.1.0+0

1 device:
- Apple M2 Max (64.000 KiB allocated)

Array abstraction

The easiest way to work with Metal.jl, is by using its array abstraction. The MtlArray type is both meant to be a convenient container for device memory, as well as provide a data-parallel abstraction for using the GPU without writing your own kernels:

julia> a = MtlArray([1])
1-element MtlArray{Int64, 1}:
 1

julia> a .+ 1
1-element MtlArray{Int64, 1}:
 2

Kernel programming

The above array abstractions are all implemented using Metal kernels written in Julia. These kernels follow a similar programming style to Julia's other GPU back-ends, and with that deviate from how kernels are implemented in Metal C (i.e., indexing intrinsics are functions not arguments, arbitrary aggregate arguments are supported, etc):

julia> function vadd(a, b, c)
           i = thread_position_in_grid_1d()
           c[i] = a[i] + b[i]
           return
       end
vadd (generic function with 1 method)

julia> a = MtlArray([1,1,1,1]); b = MtlArray([2,2,2,2]); c = similar(a);

julia> @metal threads=2 groups=2 vadd(a, b, c)

julia> Array(c)
4-element Vector{Int64}:
 3
 3
 3
 3

Metal API wrapper

Finally, all of the above functionality is made possible by interfacing with the Metal libraries through ObjectiveC.jl. We provide low-level objects and functions that map These low-level API wrappers, along with some slightly higher-level Julia wrappers, are available in the MTL submodule exported by Metal.jl:

julia> dev = MTLDevice(1)
<AGXG13XDevice: 0x14c17f200>
    name = Apple M1 Pro

julia> dev.name
NSString("Apple M1 Pro")

Acknowledgements

This package builds upon the experience of several Julia contributors to CUDA.jl, AMDGPU.jl and oneAPI.jl.

metal.jl's People

Contributors

Stargazers

Watchers

metal.jl's Issues

Support for multiple devices

Hello,

Thanks for your work.
I have a question: How to change current device to AMD Radeon?

julia> using Metal

julia> devices()
2-element Vector{MtlDevice}:
 MtlDevice(Intel(R) UHD Graphics 630)
 MtlDevice(AMD Radeon Pro 5500M)

julia> current_device()
MtlDevice:
 name:             Intel(R) UHD Graphics 630
 lowpower:         false
 headless:         true
 removable:        false
 unified memory:   true
 registry id:      4294968893
 transfer rate:    0

Thank you.

installation issue (libz.1.dylib not found) [+workaround]

I have this setup:

ProductName:		macOS
ProductVersion:		13.0
BuildVersion:		22A5342f
julia version 1.8.1

I did the following:

brew update && brew install julia

import Pkg; Pkg.add("Metal")
using Metal
a = MtlArray([1])
a .+ 1

and got this:

dyld[6982]: Library not loaded: @rpath/libz.1.dylib
  Referenced from: <C4FFAA73-76B7-3379-A4AE-0675493DFA4C> /Users/ds/.julia/artifacts/0ec307395e9dd683100b6ecbb14963c6ddd232e6/bin/metallib-as
  Reason: tried: '/Users/ds/.julia/artifacts/0ec307395e9dd683100b6ecbb14963c6ddd232e6/bin/../lib/libz.1.dylib' (no such file), '/Users/ds/.julia/artifacts/0ec307395e9dd683100b6ecbb14963c6ddd232e6/bin/../lib/libz.1.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS@rpath/libz.1.dylib' (no such file), '/Users/ds/.julia/artifacts/0ec307395e9dd683100b6ecbb14963c6ddd232e6/bin/../lib/libz.1.dylib' (no such file), '/Users/ds/.julia/artifacts/0ec307395e9dd683100b6ecbb14963c6ddd232e6/bin/../lib/libz.1.dylib' (no such file), '/opt/homebrew/Cellar/julia/1.8.1/bin/../lib/julia/libz.1.dylib' (no such file), '/opt/homebrew/Cellar/julia/1.8.1/bin/../lib/libz.1.dylib' (no such file)
ERROR: Failed to translate LLVM code to MetalLib.
If you think this is a bug, please file an issue and attach /var/folders/02/9dvkb0fj5x12f3936tdjkjs40000gn/T/jl_ZELOpYd4Xc.bc.

This fixes the issue:

brew install zlib
ln -s /opt/homebrew/Cellar/zlib/1.2.12_1/lib/libz.1.dylib /opt/homebrew/Cellar/julia/1.8.1/lib/libz.1.dylib

Not sure where to report it.

Use Autoreleasepools with Metal

Objective-C methods in the Metal.framework should be executed within Autorelease Pools. This is also true for most other Cocoa frameworks, and not invoking methods within Autorelease pools leads to leaked memory.
todo: Make sure there is consensus on this conclusion / provide a compelling and concise argument if there are doubts.

We can easily expose autoreleasepools in libcmt with a single new function e.g.:
MtAutoreleasePool* mtNewAutoreleasePool() {return [[NSAutoreleasePool alloc] init];}

Weeds:

NSAutoreleasePool can only be explicitly created if Automatic Reference Counting is off (otherwise the compiler will force the use of @autoreleasepool blocks instead)
libcmt is currently compiled with arc off (the default)
Still we should mark that explicitly in CMake e.g. target_compile_options(cmt PRIVATE -fno-objc-arc)

Autorelease Pools are useful in macos/ios world and care should be taken not to hide them from the end user. With that said, they probably need to be auto-inserted by Metal.jl in the array programming interface, and also maybe in @Metal.sync

Autorelease Pools do have some complications:

need to consider how they will interact with Julia's gc
They create a qualitative different experience than say using CUDA.jl and probably make adding Metal.jl support to KernalAbstractions.jl more challenging
the underlying implementation uses threadlocal storage, which will need some consideration about how to use best with Julia tasks.

XGBoost on Metal.jl

I note that this is not the best place to ask this question. Do we know if anyone is working on porting existing libs like XGBoost using Metal.jl? Looks like XGBoost official have no plans to support Apple M1

thanks!

Thanks for the previous fix - had a go

julia --project -e 'using Pkg; Pkg.build(); Pkg.instantiate()'
ERROR: Build path for GPUCompiler does not exist: /Users/xxx/.julia/packages/GPUCompiler/wK8OU
Stacktrace:

I then created the wK8OU directory by copying the one that is there. Its starts to compile but then I get 👍

Building Metal → ~/Downloads/Metal.jl/deps/build.log
Precompiling project...
✗ Metal
2 dependencies successfully precompiled in 5 seconds. 19 already precompiled.
1 dependency errored. To see a full report either run import Pkg; Pkg.precompile() or load the package

Stacktrace:
[1] macro expansion
@ ./loading.jl:1047 [inlined]
[2] macro expansion
@ ./lock.jl:223 [inlined]
[3] require(into::Module, mod::Symbol)
@ Base ./loading.jl:1028

latest Mac Ultra - Xcode etc etc .. Julia 1.8-beta3

I can't 'using Metal' - it doesn't load.

more informative 👍

(base) xxx@xxxmacstudio Metal.jl % julia --project -e 'import Pkg; Pkg.precompile()'
Precompiling project...
✗ Metal
0 dependencies successfully precompiled in 3 seconds. 21 already precompiled.

ERROR: The following 1 direct dependency failed to precompile:

Metal [dde4c033-4e86-420c-a63e-0dd931031962]

Failed to precompile Metal [dde4c033-4e86-420c-a63e-0dd931031962] to /Users/xxx/.julia/compiled/v1.8/Metal/jl_SWcTNK.
ERROR: LoadError: UndefVarError: MetalCompilerTarget not defined
Stacktrace:
[1] top-level scope
@ ~/Downloads/Metal.jl/src/compiler/gpucompiler.jl:5
[2] include(mod::Module, _path::String)
@ Base ./Base.jl:422
[3] include(x::String)
@ Metal ~/Downloads/Metal.jl/src/Metal.jl:1
[4] top-level scope
@ ~/Downloads/Metal.jl/src/Metal.jl:23
[5] include
@ ./Base.jl:422 [inlined]
[6] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
@ Base ./loading.jl:1400
[7] top-level scope
@ stdin:1
in expression starting at /Users/xxx/Downloads/Metal.jl/src/compiler/gpucompiler.jl:5
in expression starting at /Users/xxx/Downloads/Metal.jl/src/Metal.jl:1
in expression starting at stdin:1
Stacktrace:
[1] pkgerror(msg::String)
@ Pkg.Types /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Pkg/src/Types.jl:67
[2] precompile(ctx::Pkg.Types.Context, pkgs::Vector{String}; internal_call::Bool, strict::Bool, warn_loaded::Bool, already_instantiated::Bool, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Pkg.API /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Pkg/src/API.jl:1427
[3] precompile
@ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Pkg/src/API.jl:1060 [inlined]
[4] #precompile#225
@ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Pkg/src/API.jl:1057 [inlined]
[5] precompile (repeats 2 times)
@ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Pkg/src/API.jl:1057 [inlined]
[6] top-level scope
@ none:1

Segfault when using with GTK4.jl

I'm encountering segfault with Metal.jl when using with Gtk4.jl. A minimal example can be found at https://github.com/habemus-papadum/MetalGtk4Bug

I will try my best to sort this out, but am creating this issue for tracking purposes.

Some quick notes:
In the linked repo julia --project cpu.jl demos what the desired output should be
julia --project mtl.jl is the version that segfaults:

julia --project mtl.jl         

[64279] signal (11.2): Segmentation fault: 11
in expression starting at /Users/nehal/src/MetalGtk4Bug/mtl.jl:32
objc_release at /usr/lib/libobjc.A.dylib (unknown line)
Allocations: 16760639 (Pool: 16750536; Big: 10103); GC: 24
zsh: segmentation fault  julia --project mtl.jl

julia> versioninfo()
Julia Version 1.9.0-beta3
Commit 24204a73447 (2023-01-18 07:20 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.4.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 10 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 10

Project MetalGtk4Bug v0.1.0
Status `~/src/MetalGtk4Bug/Project.toml`
  [5ae59095] Colors v0.12.10
  [53c48c17] FixedPointNumbers v0.8.4
  [9db2cae5] Gtk4 v0.3.1
  [dde4c033] Metal v0.1.2 `https://github.com/JuliaGPU/Metal.jl.git#main`

I don't think this is related to Gtk4.jl. I have seen similar issues with Gtk.jl and Qml.jl. I think the issue is calling a metal kernel in a tight loop and something about libcmt calling a method on disposed MTLCmdBuffer. (I have seen this issue in a few different contexts with longer stacks traces pointing into libcmt, but it is all very highly dependent on the julia version & Metal version)

I'll let you know if I make any progress, but any suggestions or comments are welcome.

Optimally choosing threads and grid

Thank you for your work!
Following the technical preview at:
https://juliagpu.org/post/2022-06-24-metal/
I see that a kernel is called as:

@metal threads=512 grid=2 memset_kernel(a, 42)

Adapting the code in that preview to my use case (dynamic programming problem in macroeconomics) works (works great), but I am wondering how one should choose threads and grid optimally. Thank you in advance.

mapreduce kernel uses too many threads

I'm getting some errors on 1.9-rc1 on my M2 Max device that I can't reproduce on M1 (on main branch).

gpuarrays/reductions/== isequal has 242 passing tests and 6 that error with the exact same error shown below.

I was able to identify the failing tests to be lines 188 and 191 in GPUArrays.jl/test/testsuite/reductions.jl, only for ET in [Int16, Int32, Int] and when sz == (10,10,10)

When I manually reverted #112 to keep the other changes to main, the errors went away.

Error in testset gpuarrays/reductions/== isequal:
Error During Test at /Users/christian/.julia/packages/GPUArrays/6STCb/test/testsuite/reductions.jl:170
  Test threw exception
  Expression: compare(((A, B)->begin
            isequal(A, B)
        end), AT, rand(range, sz), rand(range, sz))
  ArgumentError: Number of threads in group (832) should not exceed 768
  Stacktrace:
    [1] (::Metal.HostKernel{typeof(Metal.partial_mapreduce_device), Tuple{typeof(identity), typeof(&), Bool, Val{0x0000000000000340}, Val{CartesianIndices((10, 10, 10))}, Val{CartesianIndices((1, 1, 1))}, Val{0x0000000000000001}, Val{2}, Val{false}, MtlDeviceArray{Bool, 4, 1}, Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{3}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(isequal), Tuple{MtlDeviceArray{Int64, 3, 1}, MtlDeviceArray{Int64, 3, 1}}}}})(::Function, ::Vararg{Any}; groups::Int64, threads::UInt64, queue::Metal.MTL.MTLCommandQueueInstance)
      @ Metal ~/.julia/packages/Metal/J1c82/src/compiler/execution.jl:201
    [2] macro expansion
      @ ~/.julia/packages/Metal/J1c82/src/compiler/execution.jl:66 [inlined]
    [3] mapreducedim!(f::typeof(identity), op::typeof(&), R::MtlArray{Bool, 3}, A::Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{3}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(isequal), Tuple{MtlArray{Int64, 3}, MtlArray{Int64, 3}}}; init::Bool)
      @ Metal ~/.julia/packages/Metal/J1c82/src/mapreduce.jl:224
    [4] _mapreduce(::typeof(isequal), ::typeof(&), ::MtlArray{Int64, 3}, ::MtlArray{Int64, 3}; dims::Colon, init::Bool)
      @ GPUArrays ~/.julia/packages/GPUArrays/6STCb/src/host/mapreduce.jl:69
    [5] _mapreduce
      @ ~/.julia/packages/GPUArrays/6STCb/src/host/mapreduce.jl:35 [inlined]
    [6] #mapreduce#31
      @ ~/.julia/packages/GPUArrays/6STCb/src/host/mapreduce.jl:31 [inlined]
    [7] mapreduce
      @ ~/.julia/packages/GPUArrays/6STCb/src/host/mapreduce.jl:31 [inlined]
    [8] isequal
      @ ~/.julia/packages/GPUArrays/6STCb/src/host/mapreduce.jl:110 [inlined]
    [9] (::Main.TestSuite.var"#290#303")(A::MtlArray{Int64, 3}, B::MtlArray{Int64, 3})
      @ Main.TestSuite ./none:0
   [10] compare(::Function, ::Type{MtlArray}, ::Array{Int64, 3}, ::Vararg{Array{Int64, 3}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ Main.TestSuite ~/.julia/packages/GPUArrays/6STCb/test/testsuite.jl:44
   [11] compare(::Function, ::Type{MtlArray}, ::Array{Int64, 3}, ::Vararg{Array{Int64, 3}})
      @ Main.TestSuite ~/.julia/packages/GPUArrays/6STCb/test/testsuite.jl:38
   [12] macro expansion
      @ ~/.julia/juliaup/julia-1.9.0-rc1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Test/src/Test.jl:478 [inlined]
   [13] macro expansion
      @ ~/.julia/packages/GPUArrays/6STCb/test/testsuite/reductions.jl:170 [inlined]
   [14] macro expansion
      @ ~/.julia/juliaup/julia-1.9.0-rc1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Test/src/Test.jl:1586 [inlined]
   [15] (::Main.TestSuite.var"#286#299")(AT::Type, eltypes::Vector{DataType})
      @ Main.TestSuite ~/.julia/packages/GPUArrays/6STCb/test/testsuite/reductions.jl:160

Threadgroup memory breaks on small datatypes

MWE:

using Metal

function kernel(out::AbstractArray{T}) where T
    i = thread_position_in_threadgroup_1d()
    temp = MtlThreadGroupArray(T, 1)
    @inbounds temp[i] = 42
    threadgroup_barrier(Metal.MemoryFlagThreadGroup)
    @inbounds out[] = temp[]
    return
end

function main(T=Int16)
    out = MtlArray{T}(undef, 1)
    @metal kernel(out)
    Array(out)[]
end

isinteractive() || @show main()

Shows 0 for Int8/Int16, 42 for Int32 and larger. Posted about this in https://developer.apple.com/forums/thread/708536, but no response yet. Somebody mentioned this might be alignment related, as the Metal Features Set Tables mentions an alignment of 16 bytes for threadgroup memory, but that doesn't seem to help (also, the Metal compiler is using a smaller alignment...).

This breaks mapreduce, so we have a workaround in place there.

Improved errors when calling device functions on CPU

Currently, calling a device function in CPU code causes an ambiguous error and then an infinite hang:

julia> thread_position_in_grid_1d()
JIT session error: Symbols not found: [ _julia.air.thread_position_in_grid.i32 ]
JIT session error: Symbols not found: [ _julia.air.thread_position_in_grid.i32 ]

I realized this when trying to call thread_execution_width(), not realizing it was device-only, but it wasn't clear to me if it wasn't properly wrapped or if something else was going on.

Maybe a clearer error could be raised?

Silent failure with unsupported(?) Intel Iris Graphics

I tried Metal.jl with my trusty late 2013 13" MacBook Pro with a 2.6 GHz Dual-Core Intel Core i5 and the built-in Intel Iris GPU with 1536 MB VRAM.

I hoped Metal would work, not necessarily faster, but apparently, it fails silently. Perhaps you should refuse to run on unsupported hardware. Can Metal support old hardware with simple code modifications? In any case, I admire your great work.

julia-1.8.5|testGPU> using Metal

julia-1.8.5|testGPU> Metal.versioninfo()
macOS 11.7.4, Darwin 21.4.0

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1

1 device:
- Intel Iris Graphics (0 bytes allocated)

julia-1.8.5|testGPU> a = MtlArray([1])
1-element MtlVector{Int64}:
 1

julia-1.8.5|testGPU> a
1-element MtlVector{Int64}:
 1

julia-1.8.5|testGPU> a .+ 1
1-element MtlVector{Int64}:
 0

Improve Objective-C interfacing

libcmt seems both useful but also difficult to maintain as Apple makes changes to the Metal.framework. It might make sense to come up with a libcmt plan.

I would suggest:

Minor fixes
- rename some functions e.g mtNewCommandBufferWithDescriptor -> mtCommandBufferWithDescriptor to better indicate what should be MRR and what is autoreleased
- add support for autoreleasepools
- make cmake more pedantic -fno-objc-arc
Audit libcmt for other memory problems
Build knowledge
- add metalcpp alongside cmt, and then create examples to show how one can do similar things with both and also show how to run apps to debug memory e.g. OBJC_DEBUG_* and xctrace record
- Using this to improve memory management in 'libcmt`
Replace libcmt with a pure Julia implementation
- use low-level Objective-C runtime functions (e.g. sel_getName,objc_msgSend, etc) in the way metalcpp does
  - explain why this is not a performance issue (Note: the Objective-C is highly dynamic; much of what clang is doing for objective-C is ensuring correctness and not statically optimizing, though it does do some performance optimizations as well based on its understanding of the semantics of the Objective-C runtime. )
- change Clang generator to generate pure julia binding
- create a standalone tools that goes from framework to julia package
  - need to support obj-c blocks...
  - creating Objective classes in julia (e.g. for delegates)
  - Most of this has been worked out to some extent in ObjectiveC.jl

Probably before spending too much time on revamping libcmt it could be useful to agree a scope for Metal.jl. I would be interested in support being complete enough to write a game engine in Julia e.g.

Support render pipeline and also compiling render shaders written in Julia
Support for raytracing in compute pipeline
Support for debug annotations, gpu events, etc to profile and benchmark fully using Instruments

Poor performance of mapreduce

Probably a known issue by the devs but just for the record:

using Metal, BenchmarkTools
N = 10_000_000
a = rand(Float32, N)
Ma = MtlArray(a)
@btime sum($a)
# 757.209 μs
@btime sum($Ma)
# 3.173 ms

An in place operation will yield even slower performance:

r = Metal.zeros(Float32, 1)
@btime Metal.@sync sum!($r, $Ma)
# 1.603 s (167108 allocations: 4.20 MiB)

Platform: Mac Studio with Apple M1 Max, v1.8.0,

just realized I'm not on Ventura, but Monterey instead. I don't know whether this is the cause of the performance. Other matrix operations are pretty fast though.

Add CITATION file

What is the best way to cite this repository?

Improve use of unified memory

Our buffers are currently allocated as GPU-only buffers by choosing the Private* storage mode. That's OK given our current CUDA-style programming model where we perform explicit copies to and from the GPU, but it would be nice if we'd also properly support buffers that are shared between CPU and GPU, by selecting Shared storage mode: https://developer.apple.com/documentation/metal/resource_fundamentals/choosing_a_resource_storage_mode_for_apple_gpus. This should probably be a kwarg to the MtlArray constructor.

*Since we choose Private storage mode, I'm not sure how the unified memory examples work...

rand! with Bools sometimes fails in tests in 1.9

As shown in this buildkite failure, rand! sometimes fails. I've encountered this before sporadically.

I'll add more info if I ever encounter this again locally.

slow broadcast copy in 2D

The following code evaluates the performance of the copy of 2 2D square MTL arrays a and b.
It gives a good performance (GBs: 360 GBs) using the kernel version (commented line) but a poor performance (GBs: 46.2) using the broadcast expression (a .= b)...

Note that the broadcast expression is OK (equivalent to kernel copy) for 1D arrays since the last bug fix.

using Metal

function kernel_copy!(a, b)
    (i,j) = thread_position_in_grid_2d()
    @inbounds a[i,j] = b[i,j]
    return
end

function device_copy(n=2^14,nsample=10)

    a = MtlArray(rand(Float32, n,n))
    b = MtlArray(rand(Float32, n,n))

    threads = (32,32)
    grid_size = cld.(n, threads)
    @show threads,grid_size

    ts=zeros(nsample)
    for i ∈ 1:nsample
        ts[i] = @elapsed Metal.@sync begin
            # @metal threads=threads grid=grid_size kernel_copy!(a, b)
            a .= b
        end
    end

    @assert Array(a)==Array(b)

    @show ts
    tmin = minimum(ts)

    size_in_bytes = 2*length(a)*sizeof(Float32) #1R+1W
    byte_per_ns = size_in_bytes / (tmin*1.e9)

    println("GBs: $(round(byte_per_ns; digits=3))")

    # Cleanup memory (is it necessary)
    finalize(a)
    finalize(b)
end
device_copy()

Argument buffer encoding is fragile

Simple kernel arguments are encoded easily and straightforwardly, but argument buffers (details here are more complicated. To encode them as arguments, one must use an aptly named argument encoder. These encoders are created from an argument buffer descriptor (that limits argument buffers from being nested structures) or from a metal function and a given argument index that must correspond to an argument buffer. Then the argument buffer fields must be set exactly according to the structure expected from the argument encoder.

Since we need to handle nested structures, we can't rely on argument descriptors. This necessitates creating argument encoders using the second method (which I believe parses the expected argument buffer structure from the module's metadata). Since we create the metal kernel LLVM/Apple IR ourselves though, we need to accurately create the metadata to match the Julia argument's type. The way this happens is currently fragile and will only be more difficult if moved to metallib-as. A better process must be adopted, but I'm not sure what that is.

Could we pre-generate the LLVM metadata from Metal.jl and pass that in some way to metallib-as which puts it in the module?

Super slow broadcast

Hi, the following MWE evaluates the observed bandwidth for a broadcasted copy of two Metal vector of Float32.
It returns a value of 0.05 GBs on my machine (M1 Max with 64Go) while I observe 350 GBs for the kernel copy...

using Metal

function slow_broadcast(n=2^20,nsample=10)

    a = MtlArray(rand(Float32, n))
    b = MtlArray(rand(Float32, n))

    nsample = 10
    ts=zeros(nsample)
    for i ∈ 1:nsample
        ts[i] = @elapsed Metal.@sync begin
            a .= b
        end
    end
    @show ts
    tmin = minimum(ts)

    size_in_bytes = 2n*sizeof(Float32) #1R+1W
    byte_per_ns = size_in_bytes / (tmin*1.e9)

    println("GBs: $(round(byte_per_ns; digits=4))")

    # Cleanup memory (is it necessary ?)
    finalize(a)
    finalize(b)
end

slow_broadcast()

Rename `grid` to `groups`

It's currently confusing whether the grid kwarg to @metal should indicate the total grid, or just the number of threadgroups that make up the grid. The underlying API call is https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/1443138-dispatchthreadgroups?language=objc, which uses threadgroupsPerGrid and threadsPerThreadgroup, so I guess we could rename to threads and groups for brevity.

error when using

after ]dev GPUCompiler MetalCore, brew install cmake, git clone [email protected]:PhilipVinc/cmt.git, ln -s /Users/arthurb/src/cmt .julia/dev/MetalCore/cmt i get the following error:

julia> using MetalCore
[ Info: Precompiling MetalCore [dde4c033-4e86-420c-a63e-0dd931031962]
ERROR: LoadError: LoadError: syntax: invalid iteration specification
Stacktrace:
 [1] top-level scope at /Users/arthurb/.julia/dev/MetalCore/src/execution/kernel.jl:84
 [2] include(::Function, ::Module, ::String) at ./Base.jl:380
 [3] include at ./Base.jl:368 [inlined]
 [4] include(::String) at /Users/arthurb/.julia/dev/MetalCore/src/MetalCore.jl:1
 [5] top-level scope at /Users/arthurb/.julia/dev/MetalCore/src/MetalCore.jl:25
 [6] include(::Function, ::Module, ::String) at ./Base.jl:380
 [7] include(::Module, ::String) at ./Base.jl:368
 [8] top-level scope at none:2
 [9] eval at ./boot.jl:331 [inlined]
 [10] eval(::Expr) at ./client.jl:467
 [11] top-level scope at ./none:3
in expression starting at /Users/arthurb/.julia/dev/MetalCore/src/execution/kernel.jl:84
in expression starting at /Users/arthurb/.julia/dev/MetalCore/src/MetalCore.jl:25
ERROR: Failed to precompile MetalCore [dde4c033-4e86-420c-a63e-0dd931031962] to /Users/arthurb/.julia/compiled/v1.5/MetalCore/ACDsk_Mctac.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1290
 [3] _require(::Base.PkgId) at ./loading.jl:1030
 [4] require(::Base.PkgId) at ./loading.jl:928
 [5] require(::Module, ::Symbol) at ./loading.jl:923

Relax package requirements

Metal.jl currently requires:

Julia 1.8
macOS 13, providing Metal 3
a mac with an M1 device

If people are interested in working on this, some of these can be relaxed:

Julia support: by back-porting our LLVM back-end to more LLVM versions: see the metal_release_13 and metal_release_14 branches, they'd need to be applied on top of llvm_release_12 for 1.7 compatibility
macOS support: we need the bindless gpuAddress property, which exists on 10.15+ (only officially on Metal 3 though, hence us only officially supporting Ventura), so it would be possible to support more versions of macOS. But this would need testing...
hardware support: Metal also supports non-M1 hardware, so it should be possible to support that. However, the floor project (which our LLVM back-end is inspired on) has a whole bunch of NVIDIA/AMD-specific codegen hacks, which I removed, so it's expected that some of those would need to be put back.

Custom IR verification

Catch any doubles.

Improve performance of Cartesian indexing

Metal GPUs suffer from the way we encode Cartesian indices, presumably because of the integer division that happens when mapping a linear index to a Cartesian, but there may be other causes. In #100 and JuliaGPU/GPUArrays.jl#454, we worked around some of the more egregious performance issues by putting the indices in the type domain such that are known to LLVM, allowing the back-end compiler to optimize code (again, presumably avoiding the division by a constant integer by mapping them onto a bunch of bit operations).

This isn't ideal because it results in significantly more kernels being compiled. Ideally we figure out a way to better encode Cartesian indices, although it's obviously hard to avoid the integer division at all.

Alternatively, we might want to improve https://github.com/maleadt/StaticCartesian.jl, or something similar, so that we can perform this optimization ourselves instead of relying on the Metal back-end compiler, because relying on such an optimization might be fragile (as observed in JuliaGPU/GPUArrays.jl#454 where we needed additional bounds information for the optimization to trigger).

I have 2 question about Metal.jl and Flux.jl

1 from flux document demo

using Metal,Flux

    function cu(par)
        MtlArray(par)
    end

  W = cu(rand(2, 5)) 
  b = cu(rand(2))

  predict(x) = W*x .+ b
  loss(x, y) = sum((predict(x) .- y).^2)

 x, y = cu(rand(5)), cu(rand(2)) 
 loss(x, y)

ERROR: LoadError: InvalidIRError: compiling kernel #63#64(Metal.mtlKernelContext, MtlDeviceVector{Float64, 1}, MtlDeviceMatrix{Float64, 1}, MtlDeviceVector{Float64, 1}) resulted in invalid LLVM IR
Reason: unsupported use of double floating-point value
Reason: unsupported use of double floating-point value

Just replace function's name
MtlArray work fine! But loss function seems has problem , I don't know how to solve

2 there are built-in `gpu` function like `CUDA.jl`?

  julia> using Flux, CUDA

julia> m = Dense(10, 5) |> gpu
Dense(10 => 5)      # 55 parameters

julia> x = rand(10) |> gpu

it can be define model's layer one by one , but fmap not working.

like this

using Metal,Flux

    function cu(par)
        MtlArray(par)
    end
d = Dense(10 => 5, σ)
d = fmap(cu, d)
d.weight 
d(cu(rand(10))) 

m = Chain(Dense(10 => 5, σ), Dense(5 => 2), softmax)
m = fmap(cu, m)
d(cu(rand(10)))

Base.unsafe_convert is ambiguous

julia> Base.unsafe_convert(Ptr{Float32}, MtlArray(rand(Float32, 5)))
ERROR: MethodError: unsafe_convert(::Type{Ptr{Float32}}, ::MtlArray{Float32, 1}) is ambiguous.

Candidates:
  unsafe_convert(::Type{Ptr{T}}, a::AbstractArray{T}) where T
    @ Base pointer.jl:67
  unsafe_convert(::Type{Ptr{S}}, a::AbstractArray{T}) where {S, T}
    @ Base pointer.jl:66
  unsafe_convert(::Type{<:Ptr}, x::MtlArray)
    @ Metal ~/.julia/packages/Metal/OGOSN/src/array.jl:128

Possible fix, define
  unsafe_convert(::Type{Ptr{T}}, ::MtlArray{T}) where T

Stacktrace:
 [1] top-level scope
   @ REPL[39]:1
 [2] top-level scope
   @ ~/.julia/packages/Metal/OGOSN/src/initialization.jl:25

(jl_764Pyi) pkg> st Metal
Status `/private/var/folders/yq/4p2zwd614y59gszh7y9ypyhh0000gn/T/jl_764Pyi/Project.toml`
  [dde4c033] Metal v0.1.1

I'm not sure whether this should or should not work. But I expected something like either of these two:

julia> using JLArrays  # Pkg.add(url="https://github.com/JuliaGPU/GPUArrays.jl.git", subdir="lib/JLArrays") for now

julia> Base.unsafe_convert(Ptr{Float32}, jl(rand(Float32, 5)))
Ptr{Float32} @0x0000000159404e00

julia> using CUDA

julia> Base.unsafe_convert(Ptr{Float32}, cu(rand(Float32, 5)))
ERROR: ArgumentError: cannot take the CPU address of a CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}
Stacktrace:
 [1] unsafe_convert(#unused#::Type{Ptr{Float32}}, x::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
   @ CUDA ~/.julia/packages/CUDA/tTK8Y/src/array.jl:319

Command buffer callbacks can cause bus error during thread adoption

I encountered an error while running tests for #132 in gpuarrays/linalg that did not appear in the next run. I'm pretty sure this isn't my first time running into this bug, but I never paid attention to it because it would go away after rerunning.

It happened on M2 Max running the 1.9.0-rc1 tests for #126 after rebasing on main.

This is all I can really say about it for now. I'll add more details if I encounter it in different circumstances.

Error output without the expected errors that will be fixed by #136:

     Testing Running tests...
┌ Info: System information:
│ macOS 13.2.1, Darwin 21.4.0
│ 
│ Toolchain:
│ - Julia: 1.9.0-rc1
│ - LLVM: 14.0.6
│ 
│ 1 device:
└ - Apple M2 Max (64.000 KiB allocated)
┌ Info: Using Metal LLVM back-end from /Users/christian/.julia/artifacts/3c74b0072cc694992a9d90b5778fb28f7ec53251/bin:
│ LLVM (http://llvm.org/):
│   LLVM version 14.0.0
│   Optimized build.
│   Default target: aarch64-apple-darwin22.3.0
└   Host CPU: cyclone
[ Info: Running 8 tests in parallel. If this is too many, specify the `--jobs` argument to the tests, or set the JULIA_CPU_THREADS environment variable.
                                                  |          | ---------------- CPU ---------------- |
Test                                     (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
      From worker 2:	┌ Warning: Metal does not support Float64 values, try using Float32 instead
      From worker 2:	└ @ Metal ~/.julia/dev/Metal/src/array.jl:38
metal                                         (5) |     1.62 |   0.03 |  1.8 |     226.30 |   480.25 |
mps                                           (6) |     2.50 |   0.06 |  2.2 |     439.30 |   475.64 |
      From worker 10:	2023-03-17 11:40:26.883 julia[29514:132830] Metal GPU Frame Capture Enabled
execution                                     (4) |     7.58 |   0.23 |  3.0 |    1272.39 |   589.84 |
      From worker 10:	[ Info: GPU frame capture saved to /private/var/folders/4g/lnkpkf3s4rxd_wbl8vwnqs4r0000gn/T/jl_hyw58a/test.gputrace/julia_capture_1.gputrace/
profiling                                    (10) |     6.43 |   0.21 |  3.3 |     969.57 |   558.34 |
gpuarrays/indexing scalar                     (9) |    11.29 |   0.35 |  3.1 |    1880.50 |   624.77 |
array                                         (2) |    15.90 |   0.53 |  3.3 |    2776.12 |   756.33 |
device/intrinsics                             (8) |    17.77 |   0.59 |  3.3 |    3267.70 |   679.89 |
gpuarrays/interface                           (8) |     1.63 |   0.10 |  6.0 |     356.34 |   687.48 |
gpuarrays/indexing multidimensional           (2) |    11.87 |   0.43 |  3.6 |    2322.29 |   825.42 |
gpuarrays/math/power                          (4) |    20.37 |   1.24 |  6.1 |    3803.57 |   891.00 |
gpuarrays/reductions/any all count            (8) |     9.85 |   0.47 |  4.8 |    2528.00 |   738.88 |
gpuarrays/uniformscaling                      (4) |     3.98 |   0.06 |  1.6 |     524.28 |   912.48 |
examples                                      (3) |    37.07 |   0.00 |  0.0 |      11.03 |   447.86 |
gpuarrays/indexing find                      (11) |    23.50 |   1.40 |  5.9 |    5013.99 |   856.53 |
gpuarrays/linalg/mul!/vector-matrix           (9) |    31.28 |   0.90 |  2.9 |    5702.68 |   848.75 |
gpuarrays/math/intrinsics                     (3) |     8.44 |   0.29 |  3.4 |    1333.39 |   618.14 |
gpuarrays/reductions/mapreducedim!_large      (8) |    32.91 |   0.91 |  2.8 |    8352.05 |  1650.42 |
gpuarrays/linalg                              (6) |         failed at 2023-03-17T11:41:32.879
gpuarrays/statistics                          (9) |    31.01 |   1.87 |  6.0 |    7226.16 |  1141.23 |
gpuarrays/reductions/reducedim!               (5) |    72.54 |   2.62 |  3.6 |   13272.53 |  1164.95 |
gpuarrays/linalg/mul!/matrix-matrix           (4) |    42.65 |   1.03 |  2.4 |    6976.63 |  1180.75 |
gpuarrays/constructors                        (8) |    13.78 |   0.49 |  3.6 |    2271.98 |  1779.42 |
gpuarrays/linalg/norm                        (11) |    40.44 |   2.14 |  5.3 |    7487.01 |  1127.58 |
gpuarrays/base                                (9) |    15.35 |   0.99 |  6.5 |    3650.37 |  1309.75 |
gpuarrays/random                             (12) |    15.21 |   0.56 |  3.7 |    2860.24 |   688.80 |
gpuarrays/reductions/== isequal               (5) |         failed at 2023-03-17T11:42:23.920
gpuarrays/reductions/mapreducedim!            (8) |    73.59 |   3.27 |  4.4 |   15854.44 |  2294.77 |
gpuarrays/reductions/minimum maximum extrema  (2) |   130.63 |   6.99 |  5.4 |   25798.04 |  1888.75 |
gpuarrays/reductions/mapreduce                (3) |   130.60 |   5.88 |  4.5 |   31158.60 |  1454.33 |
gpuarrays/reductions/sum prod                 (9) |    90.87 |   4.67 |  5.1 |   18240.08 |  1879.81 |
gpuarrays/broadcasting                        (4) |   110.91 |   6.51 |  5.9 |   20943.60 |  1871.67 |
gpuarrays/reductions/reduce                  (11) |   109.63 |   4.84 |  4.4 |   19502.77 |  1723.31 |
Testing finished in 3 minutes, 12 seconds, 526 milliseconds
Worker 6 failed running test gpuarrays/linalg:
Some tests did not pass: 232 passed, 1 failed, 0 errored, 0 broken.
gpuarrays/linalg: Test Failed at /Users/christian/.julia/packages/GPUArrays/7TiO1/test/testsuite/linalg.jl:32
  Expression: let
    x = rand(Float32, 4, [2 for _ = 2:18]...)
    pm = (18:-1:1...,)
    y = permutedims(x, pm)
    Array(GPUArrays._permutedims!(UInt64, AT(zero(y)), AT(x), pm)) ≈ y
end

Stacktrace:
 [1] backtrace()
   @ Base ./error.jl:114
 [2] record(ts::Test.DefaultTestSet, t::Union{Test.Error, Test.Fail})
   @ Test ~/.julia/juliaup/julia-1.9.0-rc1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Test/src/Test.jl:1041
 [3] top-level scope
   @ ~/.julia/dev/Metal/test/runtests.jl:363
 [4] include(fname::String)
   @ Base.MainInclude ./client.jl:478
 [5] top-level scope
   @ none:6
 [6] eval
   @ ./boot.jl:370 [inlined]
 [7] exec_options(opts::Base.JLOptions)
   @ Base ./client.jl:280
 [8] _start()
   @ Base ./client.jl:522
Worker 5 failed running test gpuarrays/reductions/== isequal:
Some tests did not pass: 242 passed, 0 failed, 6 errored, 0 broken.
### Start mapreduce threads errors
.
.
.
### End mapreduce threads errors
Test Summary:                                  | Pass  Fail  Error  Total  Time
  Overall                                      | 5823     1      6   5830      
    metal                                      |  127                 127      
    mps                                        |    5                   5      
    execution                                  |   17                  17      
    profiling                                  |   22                  22      
    gpuarrays/indexing scalar                  |  398                 398      
    array                                      |  190                 190      
    device/intrinsics                          |   25                  25      
    gpuarrays/interface                        |    7                   7      
    gpuarrays/indexing multidimensional        |   42                  42      
    gpuarrays/math/power                       |   60                  60      
    gpuarrays/reductions/any all count         |  101                 101      
    gpuarrays/uniformscaling                   |   56                  56      
    examples                                   |    3                   3      
    gpuarrays/indexing find                    |   45                  45      
    gpuarrays/linalg/mul!/vector-matrix        |  140                 140      
    gpuarrays/math/intrinsics                  |   10                  10      
    gpuarrays/reductions/mapreducedim!_large   |   40                  40      
    gpuarrays/linalg                           |  232     1           233      
    gpuarrays/statistics                       |   52                  52      
    gpuarrays/reductions/reducedim!            |  160                 160      
    gpuarrays/linalg/mul!/matrix-matrix        |  360                 360      
    gpuarrays/constructors                     |  770                 770      
    gpuarrays/linalg/norm                      |  264                 264      
    gpuarrays/base                             |   73                  73      
    gpuarrays/random                           |   50                  50      
    gpuarrays/reductions/== isequal            |  242            6    248      
    gpuarrays/reductions/mapreducedim!         |  260                 260      
    gpuarrays/reductions/minimum maximum extrema |  555                 555      
    gpuarrays/reductions/mapreduce             |  330                 330      
    gpuarrays/reductions/sum prod              |  636                 636      
    gpuarrays/broadcasting                     |  331                 331      
    gpuarrays/reductions/reduce                |  220                 220      
    FAILURE

Error in testset gpuarrays/linalg:
Test Failed at /Users/christian/.julia/packages/GPUArrays/7TiO1/test/testsuite/linalg.jl:32
  Expression: let
    x = rand(Float32, 4, [2 for _ = 2:18]...)
    pm = (18:-1:1...,)
    y = permutedims(x, pm)
    Array(GPUArrays._permutedims!(UInt64, AT(zero(y)), AT(x), pm)) ≈ y
end

Error in testset gpuarrays/reductions/== isequal:
### The rest is more mapreduce thread errors

Add support to creating MtlArray using a memory allocated by Array

Hi!

The package has already the support to create an Array given the memory allocated by a MtlArray, but not the other way around. It is possible, but required that the allocated memory is page-aligned. The following code is a MWE of how this can be achieved (without worrying about object destruction):

using Metal

# Obtain the page size.
pagesize = ccall(:getpagesize, Cint, ())

# Dimensions and type of the desired array.
dims = (1000, 1000)
T = Float32

# Compute how many pages we need to store the array.
npages = ceil(Int, prod(dims) * sizeof(T) / pagesize)

# Compute the total number of bytes.
nbytes = npages * pagesize

# Allocate memory aligned with the page.
addr = Ref(C_NULL)
ccall(
    :posix_memalign,
    Cint,
    (Ptr{Ptr{Cvoid}}, Csize_t, Csize_t),
    addr,
    pagesize,
    prod(dims) * sizeof(T)
)

# Wrap the allocated memory to a Julia array.
array = unsafe_wrap(Array{T, length(dims)}, reinterpret(Ptr{T}, addr[]), dims, own = false)

# Create the MtlArray.
pbuf = Metal.MTL.mtDeviceNewBufferWithBytesNoCopy(
    current_device(),
    addr[],
    nbytes,
    Metal.Shared | Metal.MTL.DefaultTracking | Metal.MTL.DefaultCPUCache
)
buf = MtlBuffer(pbuf)
marray = MtlArray{T, length(dims)}(buf, dims)

julia> array[1, 1] = 100.0
100.0

julia> marray[1, 1]
100.0

julia> array[1000, 1000] = 1986
1968

julia> marray[1000, 1000]
1986.0

julia> marray[1000, 1000] = 1987
1987

julia> array[1000, 1000]
1987.0

Kernel hang related to noreturn function attributes

Although Metal.jl works fine on Julia 1.9 locally, it for some reason fails on CI. Maybe this is related to the juliaecosystem machines running an outdated macOS (12.4, while I'm running 13, but 12.6 was also reported to work fine in #85 (comment)).

Errors running on M1 Max

Having trouble getting everything running, here's the versioninfo for Julia:

I installed the linked fork of GPUCompilers.jl, and tried running all three scripts in the examples folder. The following error shows up:

Any ideas?

Restore mtlcall

We used to have the ability to call pre-compiled kernels (e.g. by passing Metal source code to the appropriate API functions), see https://github.com/JuliaGPU/Metal.jl/tree/9afb62460f8005db00dd3ea71a278758853b24e9/examples/driver. That got lost when we started relying on argument metadata generated by GPUCompiler to configure argument encoders. Ideally we'd still retain the ability to use non-Julia kernels, but it's not clear how to do that (do we want to be able to reconstruct the argument metadata from a tuple type? or do we only allow 'simple' arguments for such kernels, i.e., that do not require argument encoders?).

Copysign intrinsic possibly wrong

# src/device/intrinsics/math.jl
@device_function copysign_fast(x::Float32) = ccall("extern air.fast_copysign.f32", llvmcall, Cfloat, (Cfloat,), x)
@device_override Base.copysign(x::Float32) = ccall("extern air.copysign.f32", llvmcall, Cfloat, (Cfloat,), x)
@device_override Base.copysign(x::Float16) = ccall("extern air.copysign.f16", llvmcall, Float16, (Float16,), x)

There is no function matching this signature (only one argument x) in Julia or the Metal docs:

# 13 methods for generic function "copysign":
[1] copysign(x::Signed, y::Signed) in Base at int.jl:150
[2] copysign(x::Signed, y::Float16) in Base at int.jl:151
[3] copysign(x::Signed, y::Float32) in Base at int.jl:152
[4] copysign(x::Signed, y::Float64) in Base at int.jl:153
[5] copysign(x::Signed, y::Real) in Base at int.jl:154
[6] copysign(x::Rational, y::Rational) in Base at rational.jl:259
[7] copysign(x::Rational, y::Real) in Base at rational.jl:258
[8] copysign(x::Float64, y::Float64) in Base at floatfuncs.jl:5
[9] copysign(x::Float64, y::Real) in Base at floatfuncs.jl:8
[10] copysign(x::Float32, y::Float32) in Base at floatfuncs.jl:6
[11] copysign(x::Float32, y::Real) in Base at floatfuncs.jl:7
[12] copysign(x::BigFloat, y::BigFloat) in Base.MPFR at mpfr.jl:852
[13] copysign(x::Real, y::Real) in Base at number.jl:209

/// Return x with its sign changed to match the sign of y.
T copysign(T x, T y)

Migrate to metal C++?

https://developer.apple.com/metal/cpp/

Validation-related back-end crash on macOS Ventura

The following IR, reduced from our test suite, fails under MTL_SHADER_VALIDATOR=1 on macOS Ventura:

; ModuleID = 'broken.ll'
source_filename = "text"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32"
target triple = "air64-apple-macosx13.0.0"

define void @kernel_function({ i8 addrspace(1)*, [1 x i64] } addrspace(1)* %0, float addrspace(1)* %1, { { i8 addrspace(1)*, [1 x i64] }, i8, i8 } addrspace(1)* %2, i64 addrspace(1)* %3, i64 addrspace(1)* %4, i32 %thread_position_in_threadgroup, i32 %threadgroup_position_in_grid, i32 %threads_per_threadgroup, i32 %threadgroups_per_grid) local_unnamed_addr {
conversion:
  %5 = bitcast { i8 addrspace(1)*, [1 x i64] } addrspace(1)* %0 to [2 x float] addrspace(1)* addrspace(1)*
  %.unpack913 = load [2 x float] addrspace(1)*, [2 x float] addrspace(1)* addrspace(1)* %5, align 8
  %6 = load { { i8 addrspace(1)*, [1 x i64] }, i8, i8 }, { { i8 addrspace(1)*, [1 x i64] }, i8, i8 } addrspace(1)* %2, align 8
  %.fca.0.0.extract = extractvalue { { i8 addrspace(1)*, [1 x i64] }, i8, i8 } %6, 0, 0
  %7 = alloca [2 x float], align 16
  %.sub = bitcast [2 x float]* %7 to i8*
  %8 = bitcast i8 addrspace(1)* %.fca.0.0.extract to [2 x float] addrspace(1)*
  %.elt = getelementptr inbounds [2 x float], [2 x float] addrspace(1)* %8, i64 undef, i64 0
  %.unpack = load float, float addrspace(1)* %.elt, align 4
  %.repack = getelementptr inbounds [2 x float], [2 x float]* %7, i64 0, i64 0
  store float %.unpack, float* %.repack, align 16
  %9 = getelementptr i8, i8* %.sub, i64 undef
  %10 = bitcast i8* %9 to float*
  %11 = load float, float* %10, align 4
  %.repack11 = getelementptr inbounds [2 x float], [2 x float] addrspace(1)* %.unpack913, i64 undef, i64 0
  store float %11, float addrspace(1)* %.repack11, align 4
  ret void
}

attributes #0 = { cold noreturn nounwind }

!llvm.module.flags = !{!0, !1, !2, !3, !4, !5, !6, !7, !8}
!air.kernel = !{!10}
!llvm.ident = !{!22}
!air.version = !{!23}
!air.language_version = !{!24}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{i32 7, !"air.max_device_buffers", i32 31}
!3 = !{i32 7, !"air.max_constant_buffers", i32 31}
!4 = !{i32 7, !"air.max_threadgroup_buffers", i32 31}
!5 = !{i32 7, !"air.max_textures", i32 128}
!6 = !{i32 7, !"air.max_read_write_textures", i32 8}
!7 = !{i32 7, !"air.max_samplers", i32 16}
!8 = !{i32 2, !"SDK Version", [2 x i32] [i32 13, i32 0]}
!10 = !{void ({ i8 addrspace(1)*, [1 x i64] } addrspace(1)*, float addrspace(1)*, { { i8 addrspace(1)*, [1 x i64] }, i8, i8 } addrspace(1)*, i64 addrspace(1)*, i64 addrspace(1)*, i32, i32, i32, i32)* @kernel_function, !11, !12}
!11 = !{}
!12 = !{!13, !14, !15, !16, !17, !18, !19, !20, !21}
!13 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 16, !"air.arg_type_align_size", i32 8}
!14 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 4, !"air.arg_type_align_size", i32 8}
!15 = !{i32 2, !"air.buffer", !"air.location_index", i32 2, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 24, !"air.arg_type_align_size", i32 8}
!16 = !{i32 3, !"air.buffer", !"air.location_index", i32 3, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!17 = !{i32 4, !"air.buffer", !"air.location_index", i32 4, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!18 = !{i32 5, !"air.thread_position_in_threadgroup", !"air.arg_type_name", !"uint"}
!19 = !{i32 6, !"air.threadgroup_position_in_grid", !"air.arg_type_name", !"uint"}
!20 = !{i32 7, !"air.threads_per_threadgroup", !"air.arg_type_name", !"uint"}
!21 = !{i32 8, !"air.threadgroups_per_grid", !"air.arg_type_name", !"uint"}
!22 = !{!"Apple metal version 31001.322 (metalfe-31001.322.1)"}
!23 = !{i32 2, i32 4, i32 0}
!24 = !{!"Metal", i32 2, i32 4, i32 0}

Looks like a back-end issue; from log stream:

LLVM ERROR: unable to legalize instruction: %244:_(p0) = 141 %243:_, 4
Context:
%244:_(p0) = 141 %243:_, 4
%243:_(p0) = 66 %493:_(s64)
%493:_(s64) = 101 %492:_(s32)
%492:_(s32) = 35 %490:_, %495:_
%490:_(s32) = 94 %489:_(s64)
%495:_(s32) = 95 i32 8
%489:_(s64) = 65 %242:_(p0)
%242:_(p0) = 15 $noreg
 (in function: agc.main)

Support for exceptions

For a simple situation where I need to take the floor of a number and convert it to a integer, the compilation step through GPUCompiler fails with

ERROR: InvalidIRError: compiling kernel #36#37(MtlDeviceVector{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to gpu_malloc)

The failing code is the following:

import Metal
Metal.@sync Metal.@metal (xs-> (x = xs[1]; k = Int(x); xs[1] = k; nothing))(Metal.MtlVector(Float32[1,2,3]))

mapreduce has poor performance

the good news is with single precision
https://github.com/mitmath/18337/blob/master/lecture1/the_dream.ipynb
works

the bad news, is @time doesn't show an improvement over my cpu even with n=50,000,000

Int64 not supported on AMD GPUs?

I am running Julia Version 1.8.0-rc1 (2022-05-27) on OS X 12.4 with an AMD Radeon Pro 5700 XT GPU.

julia> a .+ 1
┌ Warning: Compilation of MetalLib to native code failed.
│ If you think this is a bug, please file an issue and attach /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/jl_OUC1h1KIc6.metallib.
└ @ Metal ~/.julia/packages/Metal/fQowO/src/compiler/execution.jl:178
ERROR: MtlError: Compiler encountered an internal error (code 2, CompilerError)

Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/Metal/fQowO/lib/core/helpers.jl:68 [inlined]
  [2] MtlComputePipelineState(d::MtlDevice, f::MtlFunction)
    @ Metal.MTL ~/.julia/packages/Metal/fQowO/lib/core/compute_pipeline.jl:25
  [3] mtlfunction_link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry), Tuple{Vector{UInt8}, String}})
    @ Metal ~/.julia/packages/Metal/fQowO/src/compiler/execution.jl:172
  [4] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(Metal.mtlfunction_compile), linker::typeof(Metal.mtlfunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/iaKrd/src/cache.jl:95
  [5] mtlfunction(f::GPUArrays.var"#broadcast_kernel#15", tt::Type{Tuple{Metal.mtlKernelContext, MtlDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{MtlDeviceVector{Int64, 1}, Tuple{Bool}, Tuple{Int64}}, Int64}}, Int64}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Metal ~/.julia/packages/Metal/fQowO/src/compiler/execution.jl:143
  [6] mtlfunction
    @ ~/.julia/packages/Metal/fQowO/src/compiler/execution.jl:136 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/Metal/fQowO/src/compiler/execution.jl:64 [inlined]
  [8] #launch_heuristic#53
    @ ~/.julia/packages/Metal/fQowO/src/gpuarrays.jl:14 [inlined]
  [9] _copyto!
    @ ~/.julia/packages/GPUArrays/EVTem/src/host/broadcast.jl:73 [inlined]
 [10] copyto!
    @ ~/.julia/packages/GPUArrays/EVTem/src/host/broadcast.jl:56 [inlined]
 [11] copy
    @ ~/.julia/packages/GPUArrays/EVTem/src/host/broadcast.jl:47 [inlined]
 [12] materialize(bc::Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{1}, Nothing, typeof(+), Tuple{MtlArray{Int64, 1}, Int64}})
    @ Base.Broadcast ./broadcast.jl:860
 [13] top-level scope
    @ REPL[6]:1
 [14] top-level scope
    @ ~/.julia/packages/Metal/fQowO/src/initialization.jl:25
]

Here are the details

 % ./usr/bin/julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.0-rc1 (2022-05-27)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |

julia> import Pkg; Pkg.add("Metal")
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed GPUArrays ──────────── v8.4.0
   Installed Metal_LLVM_Tools_jll ─ v0.3.0+1
   Installed cmt_jll ────────────── v0.1.0+0
   Installed GPUArraysCore ──────── v0.1.0
   Installed CEnum ──────────────── v0.4.2
   Installed LLVMExtra_jll ──────── v0.0.16+0
   Installed GPUCompiler ────────── v0.16.1
   Installed Metal ──────────────── v0.1.0
   Installed LLVM ───────────────── v4.14.0
  Downloaded artifact: Metal_LLVM_Tools
  Downloaded artifact: LLVMExtra
  Downloaded artifact: cmt
    Updating `~/.julia/environments/v1.8/Project.toml`
  [dde4c033] + Metal v0.1.0
    Updating `~/.julia/environments/v1.8/Manifest.toml`
  [79e6a3ab] + Adapt v3.3.3
  [fa961155] + CEnum v0.4.2
  [e2ba6199] + ExprTools v0.1.8
  [0c68f7d7] + GPUArrays v8.4.0
  [46192b85] + GPUArraysCore v0.1.0
  [61eb1bfa] + GPUCompiler v0.16.1
  [692b3bcd] + JLLWrappers v1.4.1
  [929cbde3] + LLVM v4.14.0
  [dde4c033] + Metal v0.1.0
  [21216c6a] + Preferences v1.3.0
  [189a3867] + Reexport v1.2.2
  [a759f4b9] + TimerOutputs v0.5.20
  [dad2f222] + LLVMExtra_jll v0.0.16+0
  [0418c028] + Metal_LLVM_Tools_jll v0.3.0+1
  [65323cdd] + cmt_jll v0.1.0+0
  [0dad84c5] + ArgTools v1.1.1
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [f43a241f] + Downloads v1.6.0
  [7b1f6079] + FileWatching
  [b77e0a4c] + InteractiveUtils
  [4af54fe1] + LazyArtifacts
  [b27032c2] + LibCURL v0.6.3
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [ca575930] + NetworkOptions v1.2.0
  [44cfe95a] + Pkg v1.8.0
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA v0.7.0
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [fa267f1f] + TOML v1.0.0
  [a4e569a6] + Tar v1.10.0
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll v0.5.2+0
  [deac9b47] + LibCURL_jll v7.81.0+0
  [29816b5a] + LibSSH2_jll v1.10.2+0
  [c8ffd9c3] + MbedTLS_jll v2.28.0+0
  [14a3606d] + MozillaCACerts_jll v2022.2.1
  [4536629a] + OpenBLAS_jll v0.3.20+0
  [83775a58] + Zlib_jll v1.2.12+3
  [8e850b90] + libblastrampoline_jll v5.1.0+0
  [8e850ede] + nghttp2_jll v1.41.0+1
  [3f19e933] + p7zip_jll v17.4.0+0
Precompiling project...
  21 dependencies successfully precompiled in 12 seconds

julia> Metal.versioninfo()
ERROR: UndefVarError: Metal not defined
Stacktrace:
 [1] top-level scope
   @ REPL[2]:1

julia> using Metal

julia> Metal.versioninfo()
macOS 12.4.0, Darwin 21.5.0

Toolchain:
- Julia: 1.8.0-rc1
- LLVM: 13.0.1

1 device:
- AMD Radeon Pro 5700 XT (0 bytes allocated)

julia> a = MtlArray([1])
1-element MtlArray{Int64, 1}:
 1

julia> a .+ 1
┌ Warning: Compilation of MetalLib to native code failed.
│ If you think this is a bug, please file an issue and attach /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/jl_OUC1h1KIc6.metallib.
└ @ Metal ~/.julia/packages/Metal/fQowO/src/compiler/execution.jl:178
ERROR: MtlError: Compiler encountered an internal error (code 2, CompilerError)
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/Metal/fQowO/lib/core/helpers.jl:68 [inlined]
  [2] MtlComputePipelineState(d::MtlDevice, f::MtlFunction)
    @ Metal.MTL ~/.julia/packages/Metal/fQowO/lib/core/compute_pipeline.jl:25
  [3] mtlfunction_link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry), Tuple{Vector{UInt8}, String}})
    @ Metal ~/.julia/packages/Metal/fQowO/src/compiler/execution.jl:172
  [4] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(Metal.mtlfunction_compile), linker::typeof(Metal.mtlfunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/iaKrd/src/cache.jl:95
  [5] mtlfunction(f::GPUArrays.var"#broadcast_kernel#15", tt::Type{Tuple{Metal.mtlKernelContext, MtlDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{MtlDeviceVector{Int64, 1}, Tuple{Bool}, Tuple{Int64}}, Int64}}, Int64}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Metal ~/.julia/packages/Metal/fQowO/src/compiler/execution.jl:143
  [6] mtlfunction
    @ ~/.julia/packages/Metal/fQowO/src/compiler/execution.jl:136 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/Metal/fQowO/src/compiler/execution.jl:64 [inlined]
  [8] #launch_heuristic#53
    @ ~/.julia/packages/Metal/fQowO/src/gpuarrays.jl:14 [inlined]
  [9] _copyto!
    @ ~/.julia/packages/GPUArrays/EVTem/src/host/broadcast.jl:73 [inlined]
 [10] copyto!
    @ ~/.julia/packages/GPUArrays/EVTem/src/host/broadcast.jl:56 [inlined]
 [11] copy
    @ ~/.julia/packages/GPUArrays/EVTem/src/host/broadcast.jl:47 [inlined]
 [12] materialize(bc::Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{1}, Nothing, typeof(+), Tuple{MtlArray{Int64, 1}, Int64}})
    @ Base.Broadcast ./broadcast.jl:860
 [13] top-level scope
    @ REPL[6]:1
 [14] top-level scope
    @ ~/.julia/packages/Metal/fQowO/src/initialization.jl:25

julia> device(a)
MtlDevice:
 name:             AMD Radeon Pro 5700 XT
 lowpower:         false
 headless:         false
 removable:        false
 unified memory:   false
 registry id:      4294968934
 transfer rate:    0

julia> task_local_storage()[:MtlDevice] = MtlDevice(1)
MtlDevice:
 name:             AMD Radeon Pro 5700 XT
 lowpower:         false
 headless:         false
 removable:        false
 unified memory:   false
 registry id:      4294968934
 transfer rate:    0

julia>

cmt: Release build fails install

[12:13:11] [34/34] Install the project...
[12:13:11] -- Install configuration: "Release"
[12:13:11] -- Installing: /workspace/destdir/lib/libcmt.dylib
[12:13:11] -- Installing: /workspace/destdir/include/cmt.h
[12:13:11] CMake Error at cmake_install.cmake:66 (file):
[12:13:11]   file INSTALL cannot find
[12:13:11]   "/workspace/srcdir/Metal.jl/deps/cmt/build/libcmt.dylib.dSYM": No such file
[12:13:11]   or directory.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

I get this, my name isn't Tim

% julia --project -e 'using Pkg; Pkg.build(); Pkg.instantiate()'
ERROR: Build path for LLVM does not exist: /Users/tim/Julia/pkg/LLVM
Stacktrace:

Metal.jl fails to precompile on Linux

Hi,
On a linux box, ] add Metal leads to precompile error e.g.:

(@v1.8) pkg> activate --temp
  Activating new project at `/tmp/jl_F2AG4H`

(jl_F2AG4H) pkg> add Metal
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed GPUArraysCore ──────── v0.1.4
   Installed GPUCompiler ────────── v0.16.8
   Installed cmt_jll ────────────── v0.1.0+0
   Installed Metal_LLVM_Tools_jll ─ v0.3.0+2
   Installed GPUArrays ──────────── v8.6.3
   Installed Metal ──────────────── v0.1.2
    Updating `/tmp/jl_F2AG4H/Project.toml`
  [dde4c033] + Metal v0.1.2
    Updating `/tmp/jl_F2AG4H/Manifest.toml`
  [79e6a3ab] + Adapt v3.5.0
  [fa961155] + CEnum v0.4.2
  [e2ba6199] + ExprTools v0.1.8
  [0c68f7d7] + GPUArrays v8.6.3
  [46192b85] + GPUArraysCore v0.1.4
⌅ [61eb1bfa] + GPUCompiler v0.16.8
  [692b3bcd] + JLLWrappers v1.4.1
  [929cbde3] + LLVM v4.16.0
  [dde4c033] + Metal v0.1.2
  [21216c6a] + Preferences v1.3.0
  [189a3867] + Reexport v1.2.2
  [a759f4b9] + TimerOutputs v0.5.22
  [dad2f222] + LLVMExtra_jll v0.0.16+0
  [0418c028] + Metal_LLVM_Tools_jll v0.3.0+2
  [65323cdd] + cmt_jll v0.1.0+0
  [0dad84c5] + ArgTools v1.1.1
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [f43a241f] + Downloads v1.6.0
  [7b1f6079] + FileWatching
  [b77e0a4c] + InteractiveUtils
  [4af54fe1] + LazyArtifacts
  [b27032c2] + LibCURL v0.6.3
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [ca575930] + NetworkOptions v1.2.0
  [44cfe95a] + Pkg v1.8.0
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA v0.7.0
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [fa267f1f] + TOML v1.0.0
  [a4e569a6] + Tar v1.10.1
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll v1.0.1+0
  [deac9b47] + LibCURL_jll v7.84.0+0
  [29816b5a] + LibSSH2_jll v1.10.2+0
  [c8ffd9c3] + MbedTLS_jll v2.28.0+0
  [14a3606d] + MozillaCACerts_jll v2022.2.1
  [4536629a] + OpenBLAS_jll v0.3.20+0
  [83775a58] + Zlib_jll v1.2.12+3
  [8e850b90] + libblastrampoline_jll v5.1.1+0
  [8e850ede] + nghttp2_jll v1.48.0+0
  [3f19e933] + p7zip_jll v17.4.0+0
        Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`
Precompiling project...
  ✗ Metal
  5 dependencies successfully precompiled in 7 seconds. 15 already precompiled.
  1 dependency errored. To see a full report either run `import Pkg; Pkg.precompile()` or load the package

(jl_F2AG4H) pkg> precompile Metal
Precompiling project...
  ✗ Metal
  0 dependencies successfully precompiled in 3 seconds. 15 already precompiled.

ERROR: The following 1 direct dependency failed to precompile:

Metal [dde4c033-4e86-420c-a63e-0dd931031962]

Failed to precompile Metal [dde4c033-4e86-420c-a63e-0dd931031962] to /home/nehalp/.julia/compiled/v1.8/Metal/jl_TqQ1Rh.
ERROR: LoadError: UndefVarError: libcmt not defined
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base ./Base.jl:31
 [2] top-level scope
   @ ~/.julia/packages/Metal/pfCxO/lib/core/MTL.jl:7
 [3] include(mod::Module, _path::String)
   @ Base ./Base.jl:419
 [4] include(x::String)
   @ Metal ~/.julia/packages/Metal/pfCxO/src/Metal.jl:1
 [5] top-level scope
   @ ~/.julia/packages/Metal/pfCxO/src/Metal.jl:13
 [6] include
   @ ./Base.jl:419 [inlined]
 [7] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
   @ Base ./loading.jl:1554
 [8] top-level scope
   @ stdin:1
in expression starting at /home/nehalp/.julia/packages/Metal/pfCxO/lib/core/MTL.jl:1
in expression starting at /home/nehalp/.julia/packages/Metal/pfCxO/src/Metal.jl:1
in expression starting at stdin:1

(jl_F2AG4H) pkg> status
Status `/tmp/jl_F2AG4H/Project.toml`
  [dde4c033] Metal v0.1.2
  
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
  Threads: 1 on 64 virtual cores

Makes writing code that supports both CUDA and Metal difficult.
Looks the fix may not be too hard, will try creating a PR.

first try at metal

Hi wonderful metal people.
I wanted to replace CUDA with MTL in https://github.com/mitmath/18337/blob/master/lecture1/the_dream.ipynb
for my class tomorrow but here's what I see:

to be honest i'm on macOS 12.5, but willing to upgrade if anyone tells me this would
solve the problem.
Thanks for any advice.

--
InvalidIRError: compiling kernel #partial_mapreduce_device(typeof(identity), typeof(max), Float64, Val{1024}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, MtlDeviceMatrix{Float64, 1}, Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{MtlDeviceVector{Float64, 1}}}) resulted in invalid LLVM IR
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/base.jl:40
 [2] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:44
 [3] pointerset
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:44
 [4] unsafe_store!
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:84
 [5] arrayset
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:88
 [6] setindex!
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:105
 [7] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:14
 [8] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/base.jl:40
 [2] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
 [3] pointerref
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
 [4] unsafe_load
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:81
 [5] arrayref
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:82
 [6] getindex
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:103
 [7] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:23
 [8] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/base.jl:40
 [2] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
 [3] pointerref
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
 [4] unsafe_load
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:81
 [5] arrayref
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:82
 [6] getindex
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:103
 [7] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:27
 [8] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] <
   @ ./float.jl:412
 [2] >
   @ ./operators.jl:382
 [3] max
   @ ./math.jl:760
 [4] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:27
 [5] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] signbit
   @ ./floatfuncs.jl:15
 [2] max
   @ ./math.jl:760
 [3] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:27
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] signbit
   @ ./floatfuncs.jl:15
 [2] max
   @ ./math.jl:760
 [3] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:27
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] !=
   @ ./float.jl:411
 [2] isnan
   @ ./float.jl:496
 [3] max
   @ ./math.jl:760
 [4] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:27
 [5] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:27
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] !=
   @ ./float.jl:411
 [2] isnan
   @ ./float.jl:496
 [3] max
   @ ./math.jl:760
 [4] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:27
 [5] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:27
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:27
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/base.jl:40
 [2] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:44
 [3] pointerset
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:44
 [4] unsafe_store!
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:84
 [5] arrayset
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:88
 [6] setindex!
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:105
 [7] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:27
 [8] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/base.jl:40
 [2] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
 [3] pointerref
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
 [4] unsafe_load
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:81
 [5] arrayref
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:82
 [6] getindex
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:103
 [7] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:34
 [8] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] reduce_group
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:37
 [2] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Reason: unsupported use of double floating-point value
Reason: unsupported use of double floating-point value
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/base.jl:40
  [2] macro expansion
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
  [3] pointerref
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
  [4] unsafe_load
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:81
  [5] arrayref
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:82
  [6] getindex
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:103
  [7] getindex
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:115
  [8] _broadcast_getindex
    @ ./broadcast.jl:623
  [9] _getindex
    @ ./broadcast.jl:667
 [10] _broadcast_getindex
    @ ./broadcast.jl:642
 [11] getindex
    @ ./broadcast.jl:597
 [12] _map_getindex
    @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:41
 [13] partial_mapreduce_device
    @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] <
   @ ./float.jl:412
 [2] >
   @ ./operators.jl:382
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] signbit
   @ ./floatfuncs.jl:15
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] signbit
   @ ./floatfuncs.jl:15
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] !=
   @ ./float.jl:411
 [2] isnan
   @ ./float.jl:496
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] !=
   @ ./float.jl:411
 [2] isnan
   @ ./float.jl:496
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Reason: unsupported use of double floating-point value
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/base.jl:40
  [2] macro expansion
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
  [3] pointerref
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
  [4] unsafe_load
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:81
  [5] arrayref
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:82
  [6] getindex
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:103
  [7] getindex
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:115
  [8] _broadcast_getindex
    @ ./broadcast.jl:623
  [9] _getindex
    @ ./broadcast.jl:667
 [10] _broadcast_getindex
    @ ./broadcast.jl:642
 [11] getindex
    @ ./broadcast.jl:597
 [12] _map_getindex
    @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:41
 [13] partial_mapreduce_device
    @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] <
   @ ./float.jl:412
 [2] >
   @ ./operators.jl:382
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] signbit
   @ ./floatfuncs.jl:15
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] signbit
   @ ./floatfuncs.jl:15
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] !=
   @ ./float.jl:411
 [2] isnan
   @ ./float.jl:496
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] !=
   @ ./float.jl:411
 [2] isnan
   @ ./float.jl:496
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Reason: unsupported use of double floating-point value
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/base.jl:40
  [2] macro expansion
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
  [3] pointerref
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
  [4] unsafe_load
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:81
  [5] arrayref
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:82
  [6] getindex
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:103
  [7] getindex
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:115
  [8] _broadcast_getindex
    @ ./broadcast.jl:623
  [9] _getindex
    @ ./broadcast.jl:667
 [10] _broadcast_getindex
    @ ./broadcast.jl:642
 [11] getindex
    @ ./broadcast.jl:597
 [12] _map_getindex
    @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:41
 [13] partial_mapreduce_device
    @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] <
   @ ./float.jl:412
 [2] >
   @ ./operators.jl:382
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] signbit
   @ ./floatfuncs.jl:15
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] signbit
   @ ./floatfuncs.jl:15
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] !=
   @ ./float.jl:411
 [2] isnan
   @ ./float.jl:496
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] !=
   @ ./float.jl:411
 [2] isnan
   @ ./float.jl:496
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Reason: unsupported use of double floating-point value
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/base.jl:40
  [2] macro expansion
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
  [3] pointerref
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:9
  [4] unsafe_load
    @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:81
  [5] arrayref
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:82
  [6] getindex
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:103
  [7] getindex
    @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:115
  [8] _broadcast_getindex
    @ ./broadcast.jl:623
  [9] _getindex
    @ ./broadcast.jl:667
 [10] _broadcast_getindex
    @ ./broadcast.jl:642
 [11] getindex
    @ ./broadcast.jl:597
 [12] _map_getindex
    @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:41
 [13] partial_mapreduce_device
    @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] <
   @ ./float.jl:412
 [2] >
   @ ./operators.jl:382
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] signbit
   @ ./floatfuncs.jl:15
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] signbit
   @ ./floatfuncs.jl:15
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] !=
   @ ./float.jl:411
 [2] isnan
   @ ./float.jl:496
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] !=
   @ ./float.jl:411
 [2] isnan
   @ ./float.jl:496
 [3] max
   @ ./math.jl:760
 [4] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] ifelse
   @ ./essentials.jl:489
 [2] max
   @ ./math.jl:760
 [3] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:76
Reason: unsupported use of double floating-point value
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:80
Reason: unsupported use of double floating-point value
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/base.jl:40
 [2] macro expansion
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:44
 [3] pointerset
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:44
 [4] unsafe_store!
   @ ~/.julia/packages/LLVM/qc3sa/src/interop/pointer.jl:84
 [5] arrayset
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:88
 [6] setindex!
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:105
 [7] setindex!
   @ ~/.julia/packages/Metal/pfCxO/src/device/array.jl:118
 [8] partial_mapreduce_device
   @ ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:84
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl

Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams, GPUCompiler.FunctionSpec{typeof(Metal.partial_mapreduce_device), Tuple{typeof(identity), typeof(max), Float64, Val{1024}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, MtlDeviceMatrix{Float64, 1}, Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{MtlDeviceVector{Float64, 1}}}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Z5kZC/src/validation.jl:141
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/Z5kZC/src/driver.jl:418 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LHjFw/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/Z5kZC/src/driver.jl:416 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Z5kZC/src/utils.jl:68
  [6] mtlfunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ Metal ~/.julia/packages/Metal/pfCxO/src/compiler/execution.jl:168
  [7] #32
    @ ~/.julia/packages/Metal/pfCxO/src/compiler/execution.jl:161 [inlined]
  [8] JuliaContext(f::Metal.var"#32#33"{GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams, GPUCompiler.FunctionSpec{typeof(Metal.partial_mapreduce_device), Tuple{typeof(identity), typeof(max), Float64, Val{1024}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, MtlDeviceMatrix{Float64, 1}, Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{MtlDeviceVector{Float64, 1}}}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Z5kZC/src/driver.jl:76
  [9] mtlfunction_compile(job::GPUCompiler.CompilerJob)
    @ Metal ~/.julia/packages/Metal/pfCxO/src/compiler/execution.jl:160
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(Metal.mtlfunction_compile), linker::typeof(Metal.mtlfunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Z5kZC/src/cache.jl:90
 [11] mtlfunction(f::typeof(Metal.partial_mapreduce_device), tt::Type{Tuple{typeof(identity), typeof(max), Float64, Val{1024}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, MtlDeviceMatrix{Float64, 1}, Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{MtlDeviceVector{Float64, 1}}}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Metal ~/.julia/packages/Metal/pfCxO/src/compiler/execution.jl:148
 [12] mtlfunction
    @ ~/.julia/packages/Metal/pfCxO/src/compiler/execution.jl:141 [inlined]
 [13] macro expansion
    @ ~/.julia/packages/Metal/pfCxO/src/compiler/execution.jl:64 [inlined]
 [14] mapreducedim!(f::typeof(identity), op::typeof(max), R::MtlVector{Float64}, A::Base.Broadcast.Broadcasted{Metal.MtlArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{MtlVector{Float64}}}; init::Float64)
    @ Metal ~/.julia/packages/Metal/pfCxO/src/mapreduce.jl:138
 [15] #_mapreduce#33
    @ ~/.julia/packages/GPUArrays/g2pOV/src/host/mapreduce.jl:69 [inlined]
 [16] #mapreduce#31
    @ ~/.julia/packages/GPUArrays/g2pOV/src/host/mapreduce.jl:31 [inlined]
 [17] mapreduce
    @ ~/.julia/packages/GPUArrays/g2pOV/src/host/mapreduce.jl:31 [inlined]
 [18] #_maximum#783
    @ ./reducedim.jl:999 [inlined]
 [19] _maximum
    @ ./reducedim.jl:999 [inlined]
 [20] #_maximum#782
    @ ./reducedim.jl:998 [inlined]
 [21] _maximum
    @ ./reducedim.jl:998 [inlined]
 [22] #maximum#780
    @ ./reducedim.jl:994 [inlined]
 [23] maximum
    @ ./reducedim.jl:994 [inlined]
 [24] eigmax(A::MyMatrix{Float64, MtlVector{Float64}}; tol::Float64, debug::Bool)
    @ Main ./In[9]:5
 [25] eigmax(A::MyMatrix{Float64, MtlVector{Float64}})
    @ Main ./In[9]:4
 [26] top-level scope
    @ In[20]:1

1
@time eigmax(gpuA)
  0.004720 seconds (4.14 k CPU allocations: 199.016 KiB) (92 GPU allocations: 57.859 KiB, 11.30% memmgmt time)
39991.4350068237

@metal docstring out-of-date

is says only launch is a valid kwarg, yet the example in the README uses threads and groups.

help?> @metal
  @metal [kwargs...] func(args...)

  High-level interface for executing code on a GPU. The @metal macro should prefix a call,
  with func a callable function or object that should return nothing. It will be compiled to a
  Metal function upon first use, and to a certain extent arguments will be converted and
  managed automatically using mtlconvert. Finally, a call to mtlcall is performed, creating a
  command buffer in the current global command queue then committing it.

  There is one supported keyword argument that influences the behavior of @metal.

    •  launch: whether to launch this kernel, defaults to true. If false the returned
       kernel object should be launched by calling it and passing arguments again.

Multiplication with SubArrays

I have an application where I need to compute pairwise distances between vectors that are part of a larger MtlArray (Gaussian processes; kernel methods). The implementation that finally gets called uses SubArrays (it is designed for CPU work). Since there is no * method for MtlArrays and a view of a MtlArray, Julia dispatches to the CPU implementation of multiplication and is breaking.

Where should this kind of support be added? If this question has a complicated answer, how can I go about working around this? Thanks a bunch in advance!

using Metal

x = MtlArray(rand(Float32, 1, 1))
# 1×1 MtlArray{Float32, 2}:
#  0.7656045

subarray = view(x, :, :)
# 1×1 view(::MtlArray{Float32, 2}, :, :) with eltype Float32:
#  0.7656045

x * x
# 1×1 MtlArray{Float32, 2}:
#  0.5861502

x * subarray
# ERROR: MethodError: unsafe_convert(::Type{Ptr{Float32}}, ::MtlArray{Float32, 2}) is ambiguous. Candidates:
#   unsafe_convert(::Type{Ptr{T}}, a::AbstractArray{T}) where T in Base at pointer.jl:67
#   unsafe_convert(::Type{Ptr{S}}, a::AbstractArray{T}) where {S, T} in Base at pointer.jl:66
#   unsafe_convert(::Type{<:Ptr}, x::MtlArray) in Metal at /Users/me/.julia/packages/Metal/OGOSN/src/array.jl:128
# Possible fix, define
#   unsafe_convert(::Type{Ptr{T}}, ::MtlArray{T}) where T
# Stacktrace:
#  [1] gemm!(transA::Char, transB::Char, alpha::Float32, A::MtlArray{Float32, 2}, B::SubArray{Float32, 2, MtlArray{Float32, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, true}, beta::Float32, C::MtlArray{Float32, 2})
#    @ LinearAlgebra.BLAS ~/.julia/juliaup/julia-1.8.0+0.x64/share/julia/stdlib/v1.8/LinearAlgebra/src/blas.jl:1514
#  [2] gemm_wrapper!(C::MtlArray{Float32, 2}, tA::Char, tB::Char, A::MtlArray{Float32, 2}, B::SubArray{Float32, 2, MtlArray{Float32, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, true}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
#    @ LinearAlgebra ~/.julia/juliaup/julia-1.8.0+0.x64/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:674
#  [3] mul!
#    @ ~/.julia/juliaup/julia-1.8.0+0.x64/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:161 [inlined]
#  [4] mul!
#    @ ~/.julia/juliaup/julia-1.8.0+0.x64/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:276 [inlined]
#  [5] *(A::MtlArray{Float32, 2}, B::SubArray{Float32, 2, MtlArray{Float32, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, true})
#    @ LinearAlgebra ~/.julia/juliaup/julia-1.8.0+0.x64/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:148
#  [6] top-level scope
#    @ REPL[30]:1
#  [7] macro expansion
#    @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52 [inlined]
#  [8] top-level scope
#    @ ~/.julia/packages/Metal/OGOSN/src/initialization.jl:25

Support for atomics

Native code reflection

Writing compiled archives to disk: #94
Extracting native GPU code: staticfloat/ObjectFile.jl#38
Disassembling the GPU code: https://github.com/dougallj/applegpu

LLVMType of MtlDeviceArray needs changing/manipulation

Core.LLVMPtr's LLVM type is an opaque pointer that causes improper bitcasts when used in the context of MtlDeviceArrays. To avoid this, some manipulation of the pointer type is done in GPUCompiler.jl/src/metal.jl. I'm opening this issue so there can be some shared discussion/thoughts on more systemic resolutions of this problem.

Example code:

using Metal
using LLVM
convert(LLVM.LLVMType, MtlDeviceArray{Float32,1,1})
convert(LLVM.LLVMType, Core.LLVMPtr{Float32,1})

Add functionality check helper

Something simple like https://github.com/JuliaGPU/oneAPI.jl/blob/652a93df628d0e26f3f6fa96235eeae102dc572c/lib/level-zero/oneL0.jl#L81-L107

Switch to XCode's Metal compiler?

We're currently using a custom metallib-as to downgrade LLVM IR and generate a metallib containing LLVM 5 bitcode, however, it's possible to just re-use XCode's Clang-based metal compiler for this:

$ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/metal/macos/bin/metal kernel.(ll|bc)
warning: overriding the module target triple with air64-apple-macosx12.0.0 [-Woverride-module]
1 warning generated

The problem is that we still need to downgrade the LLVM IR (this is more obvious when using textual bitcode inputs, in which case metal will complain about unsupported deferencerable arguments, spFlags MDNode contents, etc), but a tool for that could be more valuable, e.g., to support NVIDIA's NVVM compiler.

Another disadvantage is that we'd then require users to have Xcode installed, and the full thing at that (the metal compiler isn't part of the command line xcode tools). That's a multi-gigabyte download and installation, since presumably Apple doesn't allow us to redistribute those parts of the SDK.

inputing non-isbits types

a simple modification to the example in the README does not work:

julia> using Metal

julia> function vaddT(T, a, b, c)            ### T is input here; nominally it is a Type
           i = thread_position_in_grid_1d()
           c[i] = a[i] + T(b[i])             ### it is used here
           return
       end
vaddT (generic function with 1 method)

julia> a = MtlArray([1,1,1,1]); b = MtlArray([2,2,2,2]); c = similar(a);

julia> @metal threads=2 groups=2 vaddT(Float32, a, b, c)
ERROR: InvalidIRError: compiling kernel #vaddT(Type{Float32}, MtlDeviceVector{Int64, 1}, MtlDeviceVector{Int64, 1}, MtlDeviceVector{Int64, 1}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to gpu_malloc)
Stacktrace:
 [1] malloc
   @ ~/.julia/packages/GPUCompiler/S3TWf/src/runtime.jl:89
 [2] macro expansion
   @ ~/.julia/packages/GPUCompiler/S3TWf/src/runtime.jl:184
 [3] macro expansion
   @ ./none:0
 [4] box
   @ ./none:0
 [5] box_float32
   @ ~/.julia/packages/GPUCompiler/S3TWf/src/runtime.jl:213
 [6] Int64
   @ ./float.jl:788
 [7] convert
   @ ./number.jl:7
 [8] setindex!
   @ ~/.julia/dev/Metal/src/device/array.jl:105
 [9] vaddT
   @ ./REPL[2]:3
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams, GPUCompiler.FunctionSpec{typeof(vaddT), Tuple{Type{Float32}, MtlDeviceVector{Int64, 1}, MtlDeviceVector{Int64, 1}, MtlDeviceVector{Int64, 1}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/validation.jl:141
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:418 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LHjFw/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:417 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob, method_instance::Any; libraries::Bool, deferred_codegen::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool, ctx::LLVM.Context)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/utils.jl:83
  [6] mtlfunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ Metal ~/.julia/dev/Metal/src/compiler/execution.jl:166
  [7] #40
    @ ~/.julia/dev/Metal/src/compiler/execution.jl:161 [inlined]
  [8] JuliaContext(f::Metal.var"#40#41"{GPUCompiler.CompilerJob{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams, GPUCompiler.FunctionSpec{typeof(vaddT), Tuple{Type{Float32}, MtlDeviceVector{Int64, 1}, MtlDeviceVector{Int64, 1}, MtlDeviceVector{Int64, 1}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:76
  [9] mtlfunction_compile(job::GPUCompiler.CompilerJob)
    @ Metal ~/.julia/dev/Metal/src/compiler/execution.jl:160
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(Metal.mtlfunction_compile), linker::typeof(Metal.mtlfunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/S3TWf/src/cache.jl:90
 [11] mtlfunction(f::typeof(vaddT), tt::Type{Tuple{Type{Float32}, MtlDeviceVector{Int64, 1}, MtlDeviceVector{Int64, 1}, MtlDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Metal ~/.julia/dev/Metal/src/compiler/execution.jl:148
 [12] mtlfunction(f::typeof(vaddT), tt::Type{Tuple{Type{Float32}, MtlDeviceVector{Int64, 1}, MtlDeviceVector{Int64, 1}, MtlDeviceVector{Int64, 1}}})
    @ Metal ~/.julia/dev/Metal/src/compiler/execution.jl:141
 [13] top-level scope
    @ ~/.julia/dev/Metal/src/compiler/execution.jl:64
 [14] top-level scope
    @ ~/.julia/dev/Metal/src/initialization.jl:33

i'm hoping we can get this to work, as it works fine with CUDA.jl:

julia> using CUDA

julia> function vaddT(T, a, b, c)
                  i = threadIdx().x
                  c[i] = a[i] + T(b[i])
                  return
              end
vaddT (generic function with 1 method)

julia> a = CuArray([1,1,1,1]); b = CuArray([2,2,2,2]); c = similar(a);

julia> @cuda threads=4 vaddT(Float32, a,b,c)
CUDA.HostKernel{typeof(vaddT), Tuple{Type{Float32}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}}}(vaddT, CuFunction(Ptr{CUDA.CUfunc_st} @0x000000000578ad50, CuModule(Ptr{CUDA.CUmod_st} @0x0000000005494bd0, CuContext(0x0000000001ca6140, instance 510ba3156c98e3a9))), CUDA.KernelState(Ptr{Nothing} @0x00007fb21ba00000))

julia> c
4-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 3
 3
 3
 3

this is with julia 1.8.5, 0-day master of Metal.jl, and an M2 Max