rust-gpu / rust-cuda Goto Github PK

View Code? Open in Web Editor NEW

3.1K 3.1K 118.0 5.85 MB

Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.

License: Apache License 2.0

Rust 90.04% Cuda 0.29% Shell 0.07% C 0.93% LLVM 0.33% C++ 7.31% Python 0.17% HTML 0.04% Dockerfile 0.04% CMake 0.78%

rust-cuda's Introduction

The Rust CUDA Project

An ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust

Guide | Getting Started | Features

⚠️ The project is still in early development, expect bugs, safety issues, and things that don't work ⚠️

Goal

The Rust CUDA Project is a project aimed at making Rust a tier-1 language for extremely fast GPU computing using the CUDA Toolkit. It provides tools for compiling Rust to extremely fast PTX code as well as libraries for using existing CUDA libraries with it.

Background

Historically, general purpose high performance GPU computing has been done using the CUDA toolkit. The CUDA toolkit primarily provides a way to use Fortran/C/C++ code for GPU computing in tandem with CPU code with a single source. It also provides many libraries, tools, forums, and documentation to supplement the single-source CPU/GPU code.

CUDA is exclusively an NVIDIA-only toolkit. Many tools have been proposed for cross-platform GPU computing such as OpenCL, Vulkan Computing, and HIP. However, CUDA remains the most used toolkit for such tasks by far. This is why it is imperative to make Rust a viable option for use with the CUDA toolkit.

However, CUDA with Rust has been a historically very rocky road. The only viable option until now has been to use the LLVM PTX backend, however, the LLVM PTX backend does not always work and would generate invalid PTX for many common Rust operations, and in recent years it has been shown time and time again that a specialized solution is needed for Rust on the GPU with the advent of projects such as rust-gpu (for Rust -> SPIR-V).

Our hope is that with this project we can push the Rust GPU computing industry forward and make Rust an excellent language for such tasks. Rust offers plenty of benefits such as __restrict__ performance benefits for every kernel, An excellent module/crate system, delimiting of unsafe areas of CPU/GPU code with unsafe, high level wrappers to low level CUDA libraries, etc.

Structure

The scope of the Rust CUDA Project is quite broad, it spans the entirety of the CUDA ecosystem, with libraries and tools to make it usable using Rust. Therefore, the project contains many crates for all corners of the CUDA ecosystem.

The current line-up of libraries is the following:

rustc_codegen_nvvm Which is a rustc backend that targets NVVM IR (a subset of LLVM IR) for the libnvvm library.
- Generates highly optimized PTX code which can be loaded by the CUDA Driver API to execute on the GPU.
- For the near future it will be CUDA-only, but it may be used to target amdgpu in the future.
cuda_std for GPU-side functions and utilities, such as thread index queries, memory allocation, warp intrinsics, etc.
- Not a low level library, provides many utility functions to make it easier to write cleaner and more reliable GPU kernels.
- Closely tied to rustc_codegen_nvvm which exposes GPU features through it internally.
cudnn for a collection of GPU-accelerated primitives for deep neural networks.
cust for CPU-side CUDA features such as launching GPU kernels, GPU memory allocation, device queries, etc.
- High level with features such as RAII and Rust Results that make it easier and cleaner to manage the interface to the GPU.
- A high level wrapper for the CUDA Driver API, the lower level version of the more common CUDA Runtime API used from C++.
- Provides much more fine grained control over things like kernel concurrency and module loading than the C++ Runtime API.
gpu_rand for GPU-friendly random number generation, currently only implements xoroshiro RNGs from rand_xoshiro.
optix for CPU-side hardware raytracing and denoising using the CUDA OptiX library.

In addition to many "glue" crates for things such as high level wrappers for certain smaller CUDA libraries.

Related Projects

Other projects related to using Rust on the GPU:

2016: glassful Subset of Rust that compiles to GLSL.
2017: inspirv-rust Experimental Rust MIR -> SPIR-V Compiler.
2018: nvptx Rust to PTX compiler using the nvptx target for rustc (using the LLVM PTX backend).
2020: accel Higher level library that relied on the same mechanism that nvptx does.
2020: rlsl Experimental Rust -> SPIR-V compiler (predecessor to rust-gpu)
2020: rust-gpu Rustc codegen backend to compile Rust to SPIR-V for use in shaders, similar mechanism as our project.

License

Licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your discretion.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

rust-cuda's People

Contributors

Stargazers

Watchers

Forkers

amadeusine rozgo isgasho jdkbean tkoul cjiajiazhuiqiu ryankav idkunal mlindner ed-fish wedusk101 aprimadi icodein harry-zhou sebcrozet destenson millerhooks tshauck metavai justanotherdot kixiron etaloof ianrgraham lionerai yarenty squadrick cometyang circargs acarlos1941 hpwxf thedodd andll ajunlonglive corvinux mdeering24 mfkiwl yerke tchigher beepster4096 kjetilkjeka smallverse vn-cdr yueyedeai primeless chenhuayou benrod3k andreybychkov dev233 chr1sj0nes sanjeevi567 williamvenner vmx anas-zafar chaoticsys g302ge jounathaen pca006132 thiwankajayasiri jac-cbi vansweej bobface laplacekorea assapir dmgolembiowski ralfbiedert apriori keytorx mohammad-haseeb23 gg-big-org eagomez2 sbnair meuus klassikcat johnmave126 doeringchristian seisyuu-hantatsushi dssgabriel media-library sarvex lateincecer astralscribe gilvbp hcyrs ssh352 brunoscaglione willnilges ja-softdevel tweoss sakapoi johnallen zacksff zhu1365377615 vollkornaffe rbaygildin wiwa pratikdhanave devillove084 gl-yziquel davidalphafox codingonion

rust-cuda's Issues

Support OpenGL Interop

CUDA has support for OpenGL interop and Direct3D 9/10/11 interop, it'd be nice to be able to use them from Rust

Fix panicking in kernels

Currently panics in kernels just kill the kernel without any sort of message, allowing the panic message to print would be really nice

`Error: NotSupported` for `add` example inside docker container

All,

Today I followed the instructions at https://github.com/Rust-GPU/Rust-CUDA/blob/master/guide/src/guide/getting_started.md#docker and I appear to have a successful, running docker container for building Rust-CUDA.

The Nvidia tools seems to report successfully:

root@ad244cfbfe70:~/rust-cuda/examples/cuda/cpu/add# nvidia-smi
Mon Jun 13 19:34:30 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T1000        Off  | 00000000:06:00.0 Off |                  N/A |
| 82%   63C    P0    N/A /  50W |      0MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@ad244cfbfe70:~/rust-cuda/examples/cuda/cpu/add#

The host OS is Gentoo on x86_64, using this to install nvidia-container-runtime:

https://forums.gentoo.org/viewtopic-p-8469852.html?sid=2f635b28a650993b900c03245ade9029#8469852

IIUC, I've set the environment up correctly. However, I get the following when I try to run the add example:

$ docker run -it --gpus all -v $RUST_CUDA:/root/rust-cuda --entrypoint /bin/bash rust-cuda
root@ad244cfbfe70:/# cd ~/rust-cuda/examples/cuda/cpu/add/
root@ad244cfbfe70:~/rust-cuda/examples/cuda/cpu/add# cargo run
info: syncing channel updates for 'nightly-2021-12-04-x86_64-unknown-linux-gnu'
info: latest update on 2021-12-04, rust version 1.59.0-nightly (532d2b14c 2021-12-03)
info: downloading component 'cargo'
info: downloading component 'clippy'
info: downloading component 'llvm-tools-preview'
info: downloading component 'rust-docs'
info: downloading component 'rust-src'
info: downloading component 'rust-std'
info: downloading component 'rustc'
info: downloading component 'rustc-dev'
info: downloading component 'rustfmt'
info: installing component 'cargo'
info: installing component 'clippy'
info: installing component 'llvm-tools-preview'
info: installing component 'rust-docs'
info: installing component 'rust-src'
info: installing component 'rust-std'
info: installing component 'rustc'
info: installing component 'rustc-dev'
info: installing component 'rustfmt'
    Updating crates.io index
/**** SNIP ****/
  Downloaded 62 crates (9.8 MB) in 2.65s (largest was `curl-sys` at 3.0 MB)
   Compiling curl-sys v0.4.55+curl-7.83.1
   Compiling curl v0.4.43
   Compiling rustc_codegen_nvvm v0.3.0 (/root/rust-cuda/crates/rustc_codegen_nvvm)
   Compiling cuda_builder v0.3.0 (/root/rust-cuda/crates/cuda_builder)
   Compiling add v0.1.0 (/root/rust-cuda/examples/cuda/cpu/add)
    Finished dev [unoptimized + debuginfo] target(s) in 3m 49s
     Running `/root/rust-cuda/target/debug/add`
cust::quick_init(): NotSupported
Error: NotSupported
root@ad244cfbfe70:~/rust-cuda/examples/cuda/cpu/add#

The third to the last line is an error message I added, since Error: Not Supported isn't very helpful...

$ git diff
diff --git a/examples/cuda/cpu/add/src/main.rs b/examples/cuda/cpu/add/src/main.rs
index 8ced6476e9ba..fb52be41ba67 100644
--- a/examples/cuda/cpu/add/src/main.rs
+++ b/examples/cuda/cpu/add/src/main.rs
@@ -18,7 +18,13 @@ fn main() -> Result<(), Box<dyn Error>> {
     // initialize CUDA, this will pick the first available device and will
     // make a CUDA context from it.
     // We don't need the context for anything but it must be kept alive.
-    let _ctx = cust::quick_init()?;
+    let _ctx = match cust::quick_init() {
+        Ok(c) => c,
+        Err(e) => {
+            println!("cust::quick_init(): {:?}", e);
+            return Err(Box::new(e));
+        }
+    };

     // Make the CUDA module, modules just house the GPU code for the kernels we created.
     // they can be made from PTX code, cubins, or fatbins.

Is there anything I've missed?

SIGSEGV at librustc_driver-f92801b4d17b5b5b.so(+0xb118b0)

I was trying to port ed25519 verification to gpu, and after #35 was fixed and some changes to original ed25519 crate, I was able to get past compilation rust errors, but then codegen failed with SIGSEGV (see below).

If someone has time to look at issues like that, I have an example here: andll@a92ebc5

  $ cargo run -p add

  /home/ubuntu/.rustup/toolchains/nightly-2021-12-04-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-f92801b4d17b5b5b.so(+0xb118b0)[0xffff94d518b0]
  linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffff9bb5f5c0]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x7611a8)[0xffff90a531a8]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x76146c)[0xffff90a5346c]
  /usr/local/cuda-11.5/nvvm/lib64/libnvvm.so.4(+0x761540)[0xffff90a53540]
  error: could not compile `add_gpu`

  Caused by:
    process didn't exit successfully: `rustc --crate-name add_gpu --edition=2021 examples/cuda/gpu/add_gpu/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi --crate-type cdylib --crate-type rlib --emit=dep-info,link -C opt-level=3 -C embed-bitcode=no -C metadata=9757d2d6d1bce9e0 --out-dir /home/ubuntu/Rust-CUDA/target/cuda-builder/nvptx64-nvidia-cuda/release/deps --target nvptx64-nvidia-cuda -L dependency=/home/ubuntu/Rust-CUDA/target/cuda-builder/nvptx64-nvidia-cuda/release/deps -L dependency=/home/ubuntu/Rust-CUDA/target/cuda-builder/release/deps --extern 'noprelude:alloc=/home/ubuntu/Rust-CUDA/target/cuda-builder/nvptx64-nvidia-cuda/release/deps/liballoc-26092049209c9db4.rlib' --extern 'noprelude:compiler_builtins=/home/ubuntu/Rust-CUDA/target/cuda-builder/nvptx64-nvidia-cuda/release/deps/libcompiler_builtins-09155fb5e15047c5.rlib' --extern 'noprelude:core=/home/ubuntu/Rust-CUDA/target/cuda-builder/nvptx64-nvidia-cuda/release/deps/libcore-df7e97b0028cb699.rlib' --extern cuda_std=/home/ubuntu/Rust-CUDA/target/cuda-builder/nvptx64-nvidia-cuda/release/deps/libcuda_std-6692aef9e92e8928.rlib --extern ed25519_dalek=/home/ubuntu/Rust-CUDA/target/cuda-builder/nvptx64-nvidia-cuda/release/deps/libed25519_dalek-56a7ccfaaf280698.rlib -Z unstable-options -Zcodegen-backend=/home/ubuntu/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so '-Cllvm-args=-arch=compute_61 --override-libm'` (signal: 11, SIGSEGV: invalid memory reference)


$ cargo -vV
cargo 1.58.0-nightly (294967c53 2021-11-29)
release: 1.58.0
commit-hash: 294967c53f0c70d598fc54ca189313c86c576ea7
commit-date: 2021-11-29
host: aarch64-unknown-linux-gnu
libgit2: 1.3.0 (sys:0.13.23 vendored)
libcurl: 7.80.0-DEV (sys:0.4.51+curl-7.80.0 vendored ssl:OpenSSL/1.1.1l)
os: Ubuntu 18.04 (bionic) [64-bit]

$ llvm-config --version
7.0.1

Add example fails to build

Hello, I was trying to build the example and after troubleshooting a few issues as I worked through the getting started, I've hit a wall I think without hints at least with my knowledge.

Before I show the output of the build failure, here is some background:

System is Ubuntu 20.04

>>> nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

I took the prebuilt llvm from https://github.com/rust-gpu/rustc_codegen_nvvm-llvm/releases/download/LLVM-7.1.0/ and set LLVM_CONFIG to it (I was hitting this before but no longer)

I needed to set CUDA_ROOT and CUDA_PATH to /usr/local/cuda-11.5/bin

I also has libnvvm.so under /usr/local/cuda-11.5/nvvm/lib64 which is now in my PATH

My toolchain is defaulted to nightly-2021-12-04-x86_64-unknown-linux-gnu to be sure, but I also put the toolchain file in the root of Rust-CUDA in the add example root

I don't see anything from the getting started that I seem to have missed, so, finally, the output of the build now is

   Compiling rustc_codegen_nvvm v0.2.2 (/home/nick/Projects/Rust-CUDA/crates/rustc_codegen_nvvm)
The following warnings were emitted during compilation:

warning: c++: error: unrecognized command line option ‘-Wcovered-switch-default’; did you mean ‘-Wno-switch-default’?
warning: c++: error: unrecognized command line option ‘-Wcovered-switch-default’; did you mean ‘-Wno-switch-default’?
warning: c++: error: unrecognized command line option ‘-Wstring-conversion’; did you mean ‘-Wsign-conversion’?
warning: c++: error: unrecognized command line option ‘-Wstring-conversion’; did you mean ‘-Wsign-conversion’?

error: failed to run custom build command for `rustc_codegen_nvvm v0.2.2 (/home/nick/Projects/Rust-CUDA/crates/rustc_codegen_nvvm)`

Caused by:
  process didn't exit successfully: `/home/nick/Projects/Rust-CUDA/target/debug/build/rustc_codegen_nvvm-95f9f0253c1f9f5e/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-env-changed=LLVM_CONFIG
  cargo:rerun-if-env-changed=USE_PREBUILT_LLVM
  cargo:rustc-cfg=llvm_component="bitreader"
  cargo:rustc-cfg=llvm_component="bitwriter"
  cargo:rustc-cfg=llvm_component="ipo"
  cargo:rustc-cfg=llvm_component="lto"
  cargo:rustc-cfg=llvm_component="nvptx"
  cargo:rerun-if-env-changed=LLVM_RUSTLLVM
  cargo:rerun-if-changed=rustc_llvm_wrapper
  TARGET = Some("x86_64-unknown-linux-gnu")
  OPT_LEVEL = Some("3")
  HOST = Some("x86_64-unknown-linux-gnu")
  CXX_x86_64-unknown-linux-gnu = None
  CXX_x86_64_unknown_linux_gnu = None
  HOST_CXX = None
  CXX = None
  CXXFLAGS_x86_64-unknown-linux-gnu = None
  CXXFLAGS_x86_64_unknown_linux_gnu = None
  HOST_CXXFLAGS = None
  CXXFLAGS = None
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("true")
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  running: "c++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-fno-omit-frame-pointer" "-m64" "-I" "rustc_llvm_wrapper/rustllvm.h" "-I/usr/bin/llvm-linux-x86_64/include" "-fPIC" "-fvisibility-inlines-hidden" "-Werror=date-time" "-Werror=unguarded-availability-new" "-std=c++11" "-Wall" "-Wextra" "-Wno-unused-parameter" "-Wwrite-strings" "-Wcast-qual" "-Wmissing-field-initializers" "-pedantic" "-Wno-long-long" "-Wcovered-switch-default" "-Wnon-virtual-dtor" "-Wdelete-non-virtual-dtor" "-Wstring-conversion" "-ffunction-sections" "-fdata-sections" "-O3" "-DNDEBUG" "-fno-exceptions" "-fno-rtti" "-D_GNU_SOURCE" "-D__STDC_CONSTANT_MACROS" "-D__STDC_FORMAT_MACROS" "-D__STDC_LIMIT_MACROS" "-DLLVM_COMPONENT_BITREADER" "-DLLVM_COMPONENT_BITWRITER" "-DLLVM_COMPONENT_IPO" "-DLLVM_COMPONENT_LTO" "-DLLVM_COMPONENT_NVPTX" "-o" "/home/nick/Projects/Rust-CUDA/target/debug/build/rustc_codegen_nvvm-8af0dea16b2e8647/out/rustc_llvm_wrapper/RustWrapper.o" "-c" "rustc_llvm_wrapper/RustWrapper.cpp"
  running: "c++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-fno-omit-frame-pointer" "-m64" "-I" "rustc_llvm_wrapper/rustllvm.h" "-I/usr/bin/llvm-linux-x86_64/include" "-fPIC" "-fvisibility-inlines-hidden" "-Werror=date-time" "-Werror=unguarded-availability-new" "-std=c++11" "-Wall" "-Wextra" "-Wno-unused-parameter" "-Wwrite-strings" "-Wcast-qual" "-Wmissing-field-initializers" "-pedantic" "-Wno-long-long" "-Wcovered-switch-default" "-Wnon-virtual-dtor" "-Wdelete-non-virtual-dtor" "-Wstring-conversion" "-ffunction-sections" "-fdata-sections" "-O3" "-DNDEBUG" "-fno-exceptions" "-fno-rtti" "-D_GNU_SOURCE" "-D__STDC_CONSTANT_MACROS" "-D__STDC_FORMAT_MACROS" "-D__STDC_LIMIT_MACROS" "-DLLVM_COMPONENT_BITREADER" "-DLLVM_COMPONENT_BITWRITER" "-DLLVM_COMPONENT_IPO" "-DLLVM_COMPONENT_LTO" "-DLLVM_COMPONENT_NVPTX" "-o" "/home/nick/Projects/Rust-CUDA/target/debug/build/rustc_codegen_nvvm-8af0dea16b2e8647/out/rustc_llvm_wrapper/PassWrapper.o" "-c" "rustc_llvm_wrapper/PassWrapper.cpp"
  cargo:warning=c++: error: unrecognized command line option ‘-Wcovered-switch-default’; did you mean ‘-Wno-switch-default’?
  cargo:warning=c++: error: unrecognized command line option ‘-Wcovered-switch-default’; did you mean ‘-Wno-switch-default’?
  cargo:warning=c++: error: unrecognized command line option ‘-Wstring-conversion’; did you mean ‘-Wsign-conversion’?
  cargo:warning=c++: error: unrecognized command line option ‘-Wstring-conversion’; did you mean ‘-Wsign-conversion’?
  exit status: 1
  exit status: 1

  --- stderr


  error occurred: Command "c++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-fno-omit-frame-pointer" "-m64" "-I" "rustc_llvm_wrapper/rustllvm.h" "-I/usr/bin/llvm-linux-x86_64/include" "-fPIC" "-fvisibility-inlines-hidden" "-Werror=date-time" "-Werror=unguarded-availability-new" "-std=c++11" "-Wall" "-Wextra" "-Wno-unused-parameter" "-Wwrite-strings" "-Wcast-qual" "-Wmissing-field-initializers" "-pedantic" "-Wno-long-long" "-Wcovered-switch-default" "-Wnon-virtual-dtor" "-Wdelete-non-virtual-dtor" "-Wstring-conversion" "-ffunction-sections" "-fdata-sections" "-O3" "-DNDEBUG" "-fno-exceptions" "-fno-rtti" "-D_GNU_SOURCE" "-D__STDC_CONSTANT_MACROS" "-D__STDC_FORMAT_MACROS" "-D__STDC_LIMIT_MACROS" "-DLLVM_COMPONENT_BITREADER" "-DLLVM_COMPONENT_BITWRITER" "-DLLVM_COMPONENT_IPO" "-DLLVM_COMPONENT_LTO" "-DLLVM_COMPONENT_NVPTX" "-o" "/home/nick/Projects/Rust-CUDA/target/debug/build/rustc_codegen_nvvm-8af0dea16b2e8647/out/rustc_llvm_wrapper/RustWrapper.o" "-c" "rustc_llvm_wrapper/RustWrapper.cpp" with args "c++" did not execute successfully (status code exit status: 1).

Dont replace 128 bit ints for libnvvm 1.8

It seems libnvvm 1.8 came out some time recently and it supposedly fixes i128 in data layouts, so we do not have to replace i128 for libnvvm 1.8 (supposedly).

Sorting an array doesn’t compile when targeting nvvm

The following piece of code:

let mut foo = [1usize, 2];
foo.sort_unstable();

fails to compile with the error:

error: internal compiler error: C:\Users\devel\.cargo\registry\src\github.com-1ecc6299db9ec823\rustc_codegen_nvvm-0.1.0\src\context.rs:373:32: unknown intrinsic 'llvm.bswap.i64'

I originally triggered this error while attempting to use the sorted variant of the SVD decomposition in nalgebra.

libnvvm.so not found

I'm following the add example, and I get to the last step, but when I build, I get this:
cuda_test/cuda_runner/target/debug/build/cuda_runner-170008ea0ebbe57c/build-script-build: error while loading shared libraries: libnvvm.so.4: cannot open shared object file: No such file or directory

As far as I can tell, libnvvm.so is in cuda, which should be found by rustcuda. Is there any env variable I need to set, or any place I need to copy libnvvm.so?

Licenses files not include in crates.io source for most crate

Fedora need a license file include in the crate source since it download the source on crates.io to ease package management.

Installing llvm 7 on Ubuntu 22.04

I'm currently running Ubuntu 22.04 and have had quite a bit of trouble getting llvm installed. It seems it's not supported anymore, and 20.04 was the last version that supported it. Given 22.04 is the newest LTS version, and llvm 7 is required for this project to work, this seems like a massive problem. Am I missing some more straightforward way of installing this?

How should the future cust::memory look like? (Originally: Should pitched alloc and 2D memcpy be exposed in cust?)

This issue was really going to be about malloc pitch and 2D, but then I realized maybe the whole cust::memory needed a little refurbish.

Cuda gives better performance when rows are aligned in a specific manner. For 2D arrays (e.g. images) it is common to use cudaMallocPitch for allocation and cudaMemcpy2D for copying. In addition there exists corresponding functions for 3D which makes things a bit more complex as the cuda array type must be used.

The first question is: Should these be exposed, in some way, in cust? I think the answer is yes, or else there would be unnecessarily difficult to write code that is as performant as when using for host code C/C++.

The second question is: How should these functions be exposed in cust? Cuda is all over the place when it comes to naming, argument lists, return values and even how things really work. Should we follow the cuda runtime API to make it as similar as possible for the C++ crowd? Or should try to make cust::memory as coherent as possible?

I understand both sides, but it seems to me the goal for this project is to make Cuda in Rust as good as it can be without being afraid of diverging from how it works in C++. The logical solution might then be to try to improve on naming and function signatures.

When I started looking at this I realized that the current cust::memory module is not very unified either. The alloc and memcpy functions operate of different types (cust::DevicePointer vs CUdeviceptr), size specifications (bytes vs elements) and constraints (when allocing T must be DeviceCopy). How should the feature complete cust::memory look like?

Allow enabling `bytemuck` support for `vek`

vek now has bytemuck support, this can be offered as a feature and to fix up the examples that use transmutes instead of bytemuck

Fix Rust Analyzer integration

Currently Rust Analyzer fails for some reason, meaning most diagnostics aren't displayed. This makes it pretty hard to program using rust-cuda

Help me.Can rust CUDA 0.3 be deployed in windows? How can I encounter the situation that rust CUDA 0.3 cannot be compiled? The CPU / add demo fails to compile

Can rust CUDA 0.3 be deployed in windows? How can I encounter the situation that rust CUDA 0.3 cannot be compiled? The CPU / add demo fails to compile

nvrtc support?

Will bindings or high level api be provided for nvrtc functions? Would be nice for JIT compiling .cu files into .ptx files and then passing result into Module::from_ptx

SIGSEGV while building `add` example

Git commit: 8a32ee7
$LLVM_CONFIG --version: 7.1.0

`cargo -vV`

cargo 1.58.0-nightly (294967c53 2021-11-29)
release: 1.58.0
commit-hash: 294967c53f0c70d598fc54ca189313c86c576ea7
commit-date: 2021-11-29
host: x86_64-unknown-linux-gnu
libgit2: 1.3.0 (sys:0.13.23 vendored)
libcurl: 7.80.0-DEV (sys:0.4.51+curl-7.80.0 vendored ssl:OpenSSL/1.1.1l)
os: Pop!_OS 21.10 (impish) [64-bit]

`cargo run -p add`

/home/zeta/repos/Rust-CUDA〉cargo clean
/home/zeta/repos/Rust-CUDA〉cargo run -p add
   Compiling libc v0.2.119
   Compiling autocfg v1.1.0
   Compiling pkg-config v0.3.24
   Compiling proc-macro2 v1.0.36
   Compiling unicode-xid v0.2.2
   Compiling syn v1.0.86
   Compiling glob v0.3.0
   Compiling cfg-if v1.0.0
   Compiling lazy_static v1.4.0
   Compiling libm v0.2.2
   Compiling regex-syntax v0.6.25
   Compiling semver-parser v0.7.0
   Compiling semver v1.0.6
   Compiling curl v0.4.42
   Compiling log v0.4.14
   Compiling openssl-probe v0.1.5
   Compiling serde_derive v1.0.136
   Compiling serde v1.0.136
   Compiling once_cell v1.10.0
   Compiling pin-project-lite v0.2.8
   Compiling serde_json v1.0.79
   Compiling smallvec v1.8.0
   Compiling bitflags v1.3.2
   Compiling ansi_term v0.12.1
   Compiling itoa v1.0.1
   Compiling rustc-demangle v0.1.21
   Compiling ryu v1.0.9
   Compiling static_assertions v1.1.0
   Compiling mint v0.5.9
   Compiling bytemuck v1.8.0
   Compiling nanorand v0.6.1
   Compiling num-traits v0.2.14
   Compiling num-integer v0.1.44
   Compiling find_cuda_helper v0.2.0 (/home/zeta/repos/Rust-CUDA/crates/find_cuda_helper)
   Compiling tracing-core v0.1.23
   Compiling sharded-slab v0.1.4
   Compiling semver v0.6.0
   Compiling thread_local v1.1.4
   Compiling regex-automata v0.1.10
   Compiling regex v1.5.5
   Compiling nvvm v0.1.1 (/home/zeta/repos/Rust-CUDA/crates/nvvm)
   Compiling cust_raw v0.11.3 (/home/zeta/repos/Rust-CUDA/crates/cust_raw)
   Compiling cust v0.3.2 (/home/zeta/repos/Rust-CUDA/crates/cust)
   Compiling build-helper v0.1.1
   Compiling quote v1.0.15
   Compiling jobserver v0.1.24
   Compiling filetime v0.2.15
   Compiling socket2 v0.4.4
   Compiling xattr v0.2.2
   Compiling matchers v0.1.0
   Compiling tracing-log v0.1.2
   Compiling rustc_version v0.4.0
   Compiling cc v1.0.73
   Compiling tar v0.4.38
   Compiling vek v0.15.6
   Compiling approx v0.4.0
   Compiling glam v0.20.2
   Compiling openssl-sys v0.9.72
   Compiling libz-sys v1.1.5
   Compiling lzma-sys v0.1.17
   Compiling curl-sys v0.4.52+curl-7.81.0
   Compiling xz2 v0.1.6
   Compiling xz v0.1.0
   Compiling tracing-attributes v0.1.20
   Compiling rustc_codegen_nvvm_macros v0.1.0 (/home/zeta/repos/Rust-CUDA/crates/rustc_codegen_nvvm_macros)
   Compiling cust_derive v0.2.0 (/home/zeta/repos/Rust-CUDA/crates/cust_derive)
   Compiling cust_core v0.1.1 (/home/zeta/repos/Rust-CUDA/crates/cust_core)
   Compiling tracing v0.1.32
   Compiling tracing-subscriber v0.3.9
   Compiling rustc_codegen_nvvm v0.3.0 (/home/zeta/repos/Rust-CUDA/crates/rustc_codegen_nvvm)
warning: In file included from /home/zeta/llvm-7.1.0/include/llvm/CodeGen/TargetSubtargetInfo.h:22,
warning:                  from rustc_llvm_wrapper/PassWrapper.cpp:29:
warning: /home/zeta/llvm-7.1.0/include/llvm/CodeGen/SchedulerRegistry.h: In constructor ‘llvm::RegisterScheduler::RegisterScheduler(const char*, const char*, llvm::RegisterScheduler::FunctionPassCtor)’:
warning: /home/zeta/llvm-7.1.0/include/llvm/CodeGen/SchedulerRegistry.h:40:35: warning: cast between incompatible function types from ‘llvm::RegisterScheduler::FunctionPassCtor’ {aka ‘llvm::ScheduleDAGSDNodes* (*)(llvm::SelectionDAGISel*, llvm::CodeGenOpt::Level)’} to ‘llvm::MachinePassCtor’ {aka ‘void* (*)()’} [-Wcast-function-type]
warning:    40 |   : MachinePassRegistryNode(N, D, (MachinePassCtor)C)
warning:       |                                   ^~~~~~~~~~~~~~~~~~
warning: In file included from rustc_llvm_wrapper/rustllvm.h:15,
warning:                  from rustc_llvm_wrapper/RustWrapper.cpp:11:
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/ArrayRef.h: In instantiation of ‘llvm::ArrayRef<T>::ArrayRef(const std::initializer_list<_Tp>&) [with T = long unsigned int]’:
warning: /home/zeta/llvm-7.1.0/include/llvm/IR/DIBuilder.h:640:31:   required from here
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/ArrayRef.h:102:37: warning: initializing ‘llvm::ArrayRef<long unsigned int>::Data’ from ‘std::initializer_list<long unsigned int>::begin’ does not extend the lifetime of the underlying array [-Winit-list-lifetime]
warning:   102 |     : Data(Vec.begin() == Vec.end() ? (T*)nullptr : Vec.begin()),
warning:       |            ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: In file included from /home/zeta/llvm-7.1.0/include/llvm/ADT/STLExtras.h:21,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/StringRef.h:13,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/StringMap.h:17,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/Support/Host.h:17,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/Hashing.h:49,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/ArrayRef.h:13,
warning:                  from rustc_llvm_wrapper/rustllvm.h:15,
warning:                  from rustc_llvm_wrapper/RustWrapper.cpp:11:
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/SmallVector.h: In instantiation of ‘void llvm::SmallVectorTemplateBase<T, true>::push_back(const T&) [with T = std::pair<void*, long unsigned int>]’:
warning: /home/zeta/llvm-7.1.0/include/llvm/Support/Allocator.h:249:33:   required from ‘void* llvm::BumpPtrAllocatorImpl<AllocatorT, SlabSize, SizeThreshold>::Allocate(size_t, size_t) [with AllocatorT = llvm::MallocAllocator; long unsigned int SlabSize = 4096; long unsigned int SizeThreshold = 4096; size_t = long unsigned int]’
warning: /home/zeta/llvm-7.1.0/include/llvm/Support/YAMLParser.h:138:26:   required from here
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/SmallVector.h:313:11: warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of type ‘struct std::pair<void*, long unsigned int>’ with no trivial copy-assignment; use copy-assignment or copy-initialization instead [-Wclass-memaccess]
warning:   313 |     memcpy(this->end(), &Elt, sizeof(T));
warning:       |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: In file included from /usr/include/c++/11/utility:70,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/Support/type_traits.h:19,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/Optional.h:22,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/STLExtras.h:20,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/StringRef.h:13,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/StringMap.h:17,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/Support/Host.h:17,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/Hashing.h:49,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/ArrayRef.h:13,
warning:                  from rustc_llvm_wrapper/rustllvm.h:15,
warning:                  from rustc_llvm_wrapper/RustWrapper.cpp:11:
warning: /usr/include/c++/11/bits/stl_pair.h:211:12: note: ‘struct std::pair<void*, long unsigned int>’ declared here
warning:   211 |     struct pair
warning:       |            ^~~~
warning: In file included from rustc_llvm_wrapper/rustllvm.h:15,
warning:                  from rustc_llvm_wrapper/PassWrapper.cpp:16:
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/ArrayRef.h: In instantiation of ‘llvm::ArrayRef<T>::ArrayRef(const std::initializer_list<_Tp>&) [with T = long unsigned int]’:
warning: /home/zeta/llvm-7.1.0/include/llvm/IR/DIBuilder.h:640:31:   required from here
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/ArrayRef.h:102:37: warning: initializing ‘llvm::ArrayRef<long unsigned int>::Data’ from ‘std::initializer_list<long unsigned int>::begin’ does not extend the lifetime of the underlying array [-Winit-list-lifetime]
warning:   102 |     : Data(Vec.begin() == Vec.end() ? (T*)nullptr : Vec.begin()),
warning:       |            ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: In file included from /home/zeta/llvm-7.1.0/include/llvm/ADT/STLExtras.h:21,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/StringRef.h:13,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/StringMap.h:17,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/Support/Host.h:17,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/Hashing.h:49,
warning:                  from /home/zeta/llvm-7.1.0/include/llvm/ADT/ArrayRef.h:13,
warning:                  from rustc_llvm_wrapper/rustllvm.h:15,
warning:                  from rustc_llvm_wrapper/PassWrapper.cpp:16:
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/SmallVector.h: In instantiation of ‘void llvm::SmallVectorTemplateBase<T, true>::push_back(const T&) [with T = std::pair<void*, long unsigned int>]’:
warning: /home/zeta/llvm-7.1.0/include/llvm/Support/Allocator.h:249:33:   required from ‘void* llvm::BumpPtrAllocatorImpl<AllocatorT, SlabSize, SizeThreshold>::Allocate(size_t, size_t) [with AllocatorT = llvm::MallocAllocator; long unsigned int SlabSize = 4096; long unsigned int SizeThreshold = 4096; size_t = long unsigned int]’
warning: /home/zeta/llvm-7.1.0/include/llvm/Support/YAMLParser.h:138:26:   required from here
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/SmallVector.h:313:11: warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of type ‘struct std::pair<void*, long unsigned int>’ with no trivial copy-assignment; use copy-assignment or copy-initialization instead [-Wclass-memaccess]
warning:   313 |     memcpy(this->end(), &Elt, sizeof(T));
warning:       |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: In file included from /usr/include/c++/11/bits/stl_algobase.h:64,
warning:                  from /usr/include/c++/11/vector:60,
warning:                  from rustc_llvm_wrapper/PassWrapper.cpp:13:
warning: /usr/include/c++/11/bits/stl_pair.h:211:12: note: ‘struct std::pair<void*, long unsigned int>’ declared here
warning:   211 |     struct pair
warning:       |            ^~~~
warning: In file included from /home/zeta/llvm-7.1.0/include/llvm/ADT/DenseSet.h:17,
warning:                  from rustc_llvm_wrapper/rustllvm.h:16,
warning:                  from rustc_llvm_wrapper/PassWrapper.cpp:16:
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/DenseMap.h: In instantiation of ‘void llvm::DenseMapBase<DerivedT, KeyT, ValueT, KeyInfoT, BucketT>::copyFrom(const llvm::DenseMapBase<OtherBaseT, KeyT, ValueT, KeyInfoT, BucketT>&) [with OtherBaseT = llvm::DenseMap<long unsigned int, llvm::GlobalValueSummary*>; DerivedT = llvm::DenseMap<long unsigned int, llvm::GlobalValueSummary*>; KeyT = long unsigned int; ValueT = llvm::GlobalValueSummary*; KeyInfoT = llvm::DenseMapInfo<long unsigned int>; BucketT = llvm::detail::DenseMapPair<long unsigned int, llvm::GlobalValueSummary*>]’:
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/DenseMap.h:711:28:   required from ‘void llvm::DenseMap<KeyT, ValueT, KeyInfoT, BucketT>::copyFrom(const llvm::DenseMap<KeyT, ValueT, KeyInfoT, BucketT>&) [with KeyT = long unsigned int; ValueT = llvm::GlobalValueSummary*; KeyInfoT = llvm::DenseMapInfo<long unsigned int>; BucketT = llvm::detail::DenseMapPair<long unsigned int, llvm::GlobalValueSummary*>]’
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/DenseMap.h:665:5:   required from ‘llvm::DenseMap<KeyT, ValueT, KeyInfoT, BucketT>::DenseMap(const llvm::DenseMap<KeyT, ValueT, KeyInfoT, BucketT>&) [with KeyT = long unsigned int; ValueT = llvm::GlobalValueSummary*; KeyInfoT = llvm::DenseMapInfo<long unsigned int>; BucketT = llvm::detail::DenseMapPair<long unsigned int, llvm::GlobalValueSummary*>]’
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/StringMap.h:350:18:   required from ‘ValueTy llvm::StringMap<ValueTy, AllocatorTy>::lookup(llvm::StringRef) const [with ValueTy = llvm::DenseMap<long unsigned int, llvm::GlobalValueSummary*>; AllocatorTy = llvm::MallocAllocator]’
warning: rustc_llvm_wrapper/PassWrapper.cpp:1200:71:   required from here
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/DenseMap.h:396:13: warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of type ‘struct llvm::detail::DenseMapPair<long unsigned int, llvm::GlobalValueSummary*>’ with no trivial copy-assignment; use copy-assignment or copy-initialization instead [-Wclass-memaccess]
warning:   396 |       memcpy(getBuckets(), other.getBuckets(),
warning:       |       ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning:   397 |              getNumBuckets() * sizeof(BucketT));
warning:       |              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: /home/zeta/llvm-7.1.0/include/llvm/ADT/DenseMap.h:40:8: note: ‘struct llvm::detail::DenseMapPair<long unsigned int, llvm::GlobalValueSummary*>’ declared here
warning:    40 | struct DenseMapPair : public std::pair<KeyT, ValueT> {
warning:       |        ^~~~~~~~~~~~
   Compiling cuda_builder v0.3.0 (/home/zeta/repos/Rust-CUDA/crates/cuda_builder)
   Compiling add v0.1.0 (/home/zeta/repos/Rust-CUDA/examples/cuda/cpu/add)
error: failed to run custom build command for `add v0.1.0 (/home/zeta/repos/Rust-CUDA/examples/cuda/cpu/add)`

Caused by:
  process didn't exit successfully: `/home/zeta/repos/Rust-CUDA/target/debug/build/add-34c097f5387f9249/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=../../gpu/add_gpu

  --- stderr
     Compiling compiler_builtins v0.1.55
     Compiling core v0.0.0 (/home/zeta/.rustup/toolchains/nightly-2021-12-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core)
     Compiling autocfg v1.1.0
     Compiling semver v1.0.6
     Compiling libm v0.2.2
     Compiling proc-macro2 v1.0.36
     Compiling unicode-xid v0.2.2
     Compiling syn v1.0.86
     Compiling paste v1.0.6
     Compiling num-traits v0.2.14
     Compiling num-integer v0.1.44
     Compiling rustc_version v0.4.0
     Compiling quote v1.0.15
     Compiling vek v0.15.6
     Compiling cuda_std_macros v0.2.0 (/home/zeta/repos/Rust-CUDA/crates/cuda_std_macros)
     Compiling rustc-std-workspace-core v1.99.0 (/home/zeta/.rustup/toolchains/nightly-2021-12-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/rustc-std-workspace-core)
  /home/zeta/.rustup/toolchains/nightly-2021-12-04-x86_64-unknown-linux-gnu/bin/../lib/librustc_driver-713ba8e674184066.so(+0x4d4be3)[0x7fde8decebe3]
  /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fde8d474520]
  /lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSsC2ERKSs+0x1d)[0x7fde84e962bd]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x1a24901)[0x7fde869df901]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x1a1f542)[0x7fde869da542]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x999054)[0x7fde85954054]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x9186e9)[0x7fde858d36e9]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x91a41d)[0x7fde858d541d]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x92f4b8)[0x7fde858ea4b8]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x95838d)[0x7fde8591338d]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x908041)[0x7fde858c3041]
  /home/zeta/.rustup/toolchains/nightly-2021-12-04-x86_64-unknown-linux-gnu/bin/../lib/libstd-13e2ae73269b4206.so(rust_metadata_std_f24903a91e569aa2+0xa99b3)[0x7fde8d7349b3]
  /lib/x86_64-linux-gnu/libc.so.6(+0x94947)[0x7fde8d4c6947]
  /lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7fde8d556a44]
  /home/zeta/.rustup/toolchains/nightly-2021-12-04-x86_64-unknown-linux-gnu/bin/../lib/librustc_driver-713ba8e674184066.so(+0x4d4be3)[0x7f4617b8bbe3]
  /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4617131520]
  /lib/x86_64-linux-gnu/libc.so.6(+0x197f00)[0x7f4617286f00]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x1a1055f)[0x7f46107cb55f]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x1a1f59a)[0x7f46107da59a]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x999054)[0x7f460f754054]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x9186e9)[0x7f460f6d36e9]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x91a41d)[0x7f460f6d541d]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x92f4b8)[0x7f460f6ea4b8]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x95838d)[0x7f460f71338d]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x908041)[0x7f460f6c3041]
  /home/zeta/.rustup/toolchains/nightly-2021-12-04-x86_64-unknown-linux-gnu/bin/../lib/libstd-13e2ae73269b4206.so(rust_metadata_std_f24903a91e569aa2+0xa99b3)[0x7f46173f19b3]
  /lib/x86_64-linux-gnu/libc.so.6(+0x94947)[0x7f4617183947]
  /lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f4617213a44]
  error: could not compile `rustc-std-workspace-core`

  Caused by:
    process didn't exit successfully: `rustc --crate-name rustc_std_workspace_core --edition=2018 /home/zeta/.rustup/toolchains/nightly-2021-12-04-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/rustc-std-workspace-core/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no -C debuginfo=1 -C overflow-checks=on -C metadata=f0bb55e3c25ff57e -C extra-filename=-f0bb55e3c25ff57e --out-dir /home/zeta/repos/Rust-CUDA/target/cuda-builder/nvptx64-nvidia-cuda/release/deps --target nvptx64-nvidia-cuda -Z force-unstable-if-unmarked -L dependency=/home/zeta/repos/Rust-CUDA/target/cuda-builder/nvptx64-nvidia-cuda/release/deps -L dependency=/home/zeta/repos/Rust-CUDA/target/cuda-builder/release/deps --extern core=/home/zeta/repos/Rust-CUDA/target/cuda-builder/nvptx64-nvidia-cuda/release/deps/libcore-f8fbafb5a7b3b2c0.rmeta --cap-lints allow -Zcodegen-backend=/home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so '-Cllvm-args=-arch=compute_61 --override-libm'` (signal: 11, SIGSEGV: invalid memory reference)
  warning: build failed, waiting for other jobs to finish...
  /home/zeta/.rustup/toolchains/nightly-2021-12-04-x86_64-unknown-linux-gnu/bin/../lib/librustc_driver-713ba8e674184066.so(+0x4d4be3)[0x7f2a1a248be3]
  /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f2a197ee520]
  /lib/x86_64-linux-gnu/libc.so.6(+0x197f00)[0x7f2a19943f00]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x1a1055f)[0x7f2a12dcb55f]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x1a1f59a)[0x7f2a12dda59a]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x999054)[0x7f2a11d54054]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x9186e9)[0x7f2a11cd36e9]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x91a41d)[0x7f2a11cd541d]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x92f4b8)[0x7f2a11cea4b8]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x95838d)[0x7f2a11d1338d]
  /home/zeta/repos/Rust-CUDA/target/debug/deps/librustc_codegen_nvvm.so(+0x908041)[0x7f2a11cc3041]
  /home/zeta/.rustup/toolchains/nightly-2021-12-04-x86_64-unknown-linux-gnu/bin/../lib/libstd-13e2ae73269b4206.so(rust_metadata_std_f24903a91e569aa2+0xa99b3)[0x7f2a19aae9b3]
  /lib/x86_64-linux-gnu/libc.so.6(+0x94947)[0x7f2a19840947]
  /lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f2a198d0a44]
  error: build failed
  thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: BuildFailed', examples/cuda/cpu/add/build.rs:7:10
  stack backtrace:
     0:     0x55bd76b7663c - std::backtrace_rs::backtrace::libunwind::trace::hf7449935ead7573e
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
     1:     0x55bd76b7663c - std::backtrace_rs::backtrace::trace_unsynchronized::h221aa2d88d72372a
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
     2:     0x55bd76b7663c - std::sys_common::backtrace::_print_fmt::h1c77e8983e1df895
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/sys_common/backtrace.rs:67:5
     3:     0x55bd76b7663c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd4ec41a9a6b0d22c
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/sys_common/backtrace.rs:46:22
     4:     0x55bd76b9982c - core::fmt::write::h72801a82c94e6ff1
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/core/src/fmt/mod.rs:1149:17
     5:     0x55bd76b72825 - std::io::Write::write_fmt::haf74340a8cbeaa88
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/io/mod.rs:1697:15
     6:     0x55bd76b77d80 - std::sys_common::backtrace::_print::h2d15cd157796a64a
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/sys_common/backtrace.rs:49:5
     7:     0x55bd76b77d80 - std::sys_common::backtrace::print::h52d286d22e2398eb
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/sys_common/backtrace.rs:36:9
     8:     0x55bd76b77d80 - std::panicking::default_hook::{{closure}}::h6da08fba6306daf2
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/panicking.rs:211:50
     9:     0x55bd76b7792b - std::panicking::default_hook::h266f67a22e76b11a
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/panicking.rs:228:9
    10:     0x55bd76b78564 - std::panicking::rust_panic_with_hook::he55698a957f4fb6d
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/panicking.rs:606:17
    11:     0x55bd76b78010 - std::panicking::begin_panic_handler::{{closure}}::h01f453c3ac181895
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/panicking.rs:502:13
    12:     0x55bd76b76ae4 - std::sys_common::backtrace::__rust_end_short_backtrace::h675d77c6e5a3926d
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/sys_common/backtrace.rs:139:18
    13:     0x55bd76b77f79 - rust_begin_unwind
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/panicking.rs:498:5
    14:     0x55bd76b982c1 - core::panicking::panic_fmt::h7b8580d81fcbbacd
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/core/src/panicking.rs:107:14
    15:     0x55bd76b98573 - core::result::unwrap_failed::h885d3f7beb571353
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/core/src/result.rs:1660:5
    16:     0x55bd76b0a8bc - core::result::Result<T,E>::unwrap::h7cad7d4d8f622d94
    17:     0x55bd76b0a9c4 - build_script_build::main::hcc15305ed3914600
    18:     0x55bd76b0ad63 - core::ops::function::FnOnce::call_once::h4294fec930b869be
    19:     0x55bd76b0a7f9 - std::sys_common::backtrace::__rust_begin_short_backtrace::h8ddd2afd426b9805
    20:     0x55bd76b0aa49 - std::rt::lang_start::{{closure}}::he17181887215836e
    21:     0x55bd76b75e81 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h3ab949a23c24b3b3
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/core/src/ops/function.rs:259:13
    22:     0x55bd76b75e81 - std::panicking::try::do_call::hd77545f815897665
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/panicking.rs:406:40
    23:     0x55bd76b75e81 - std::panicking::try::h7ce07831cfe77010
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/panicking.rs:370:19
    24:     0x55bd76b75e81 - std::panic::catch_unwind::h3a6382eb739fc3cb
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/panic.rs:133:14
    25:     0x55bd76b75e81 - std::rt::lang_start_internal::{{closure}}::ha0ce13fbefa6ed19
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/rt.rs:128:48
    26:     0x55bd76b75e81 - std::panicking::try::do_call::hbacbc27e75105135
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/panicking.rs:406:40
    27:     0x55bd76b75e81 - std::panicking::try::he42832d690a14084
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/panicking.rs:370:19
    28:     0x55bd76b75e81 - std::panic::catch_unwind::h868b7b91fa924314
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/panic.rs:133:14
    29:     0x55bd76b75e81 - std::rt::lang_start_internal::h7797a51a89d842a3
                                 at /rustc/532d2b14c05f9bc20b2d27cbb5f4550d28343a36/library/std/src/rt.rs:128:20
    30:     0x55bd76b0aa31 - std::rt::lang_start::hf668bbaa54c4815d
    31:     0x55bd76b0a9f3 - main
    32:     0x7f31c0db7fd0 - <unknown>
    33:     0x7f31c0db807d - __libc_start_main
    34:     0x55bd76b0a725 - _start
    35:                0x0 - <unknown>

GPU kernel debugging documentation

Super excited to try out this project, I've just been reading through the Guide so far. I couldn't find any pages on debugging GPU kernels at runtime (there's a page on debugging the codegen but not the live kernel itself). I think it would be great if debugging was mentioned somewhere, at least to say if it isn't well-supported yet or maybe link to some external resources on CUDA debugging.

cuda_std: doctest failed

running 1 test
test src/shared.rs - shared::shared_array (line 33) - compile ... FAILED

failures:

---- src/shared.rs - shared::shared_array (line 33) stdout ----
error: cannot find attribute kernel in this scope
--> src/shared.rs:34:3
|
3 | #[kernel]
| ^^^^^^
|
= note: consider importing one of these items:
cuda_std::kernel
cuda_std_macros::kernel

https://download.copr.fedorainfracloud.org/results/remilauzier/rust-cuda/fedora-rawhide-x86_64/03707791-rust-cuda_std/builder-live.log.gz

Add more assertion macros

Adding the debug_* variants of all assertion macros would be really nice in addition to adding assert!() and assert_matches!() macros

Expose cuMemcpyDtoH_v2 in cust

In the latest version (0.3.0) of cust cuMemcpyHtoD_v2 was exposed as cust::memory::memcpy_htod. This avoids the need for reaching into cust_raw and allows using cust::error::CudaResult instead of error codes.

Should cuMemcpyDtoH_v2 be exposed in a similar manner to enable the features above for device to host memcpy as well?

Amortize cost of Stream::add_callback?

In order to coordinate multiple streams effectively, while also trying to maximize parallel usage of the device, I have a scheduler which will add a callback to a stream so that other work, which was mutually exclusive with it, could be immediately scheduled.

This is fine, and working as needed; however, I'm wondering if the maintainers here have any ideas around a way to amortize the cost of having to box the callback each time (this happens a lot ... always ... and never stops until the system halts).

We are already in unsafe territory with most usage of the GPU anyway, so perhaps we could just pass along a pointer to an equivalent callable. That way the caller could take on the burden of ensuring the memory is not freed too early. Thoughts?

StreamFlags::NON_BLOCKING is unsound because of fringe asynchronous memory copy behavior in CUDA

Streams with NON_BLOCKING exhibit very confusing and very dangerous behavior with regards to memcpy due to odd CUDA semantics, per the driver API docs:

For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

Because NON_BLOCKING streams do not synchronize with the null (default) stream, this leads to potential race conditions. NVIDIA appears to be aware of this issue, but in the mean time, it may be beneficial to implicitly disable NON_BLOCKING for now. Especially since cust does not expose stream ordered memory allocation.

This is what appears to be happening in the add example sometimes not doing anything on certain systems.

cust_core: Doctest failed

https://download.copr.fedorainfracloud.org/results/remilauzier/rust-cuda/fedora-rawhide-x86_64/03707783-rust-cust_core/builder-live.log.gz

running 3 tests
test src/lib.rs - _hidden::DeviceCopy (line 21) ... FAILED
test src/lib.rs - _hidden::DeviceCopy (line 34) - compile fail ... ok
test src/lib.rs - _hidden::DeviceCopy (line 44) ... FAILED

failures:

---- src/lib.rs - _hidden::DeviceCopy (line 21) stdout ----
error[E0432]: unresolved import cust
--> src/lib.rs:22:5
|
2 | use cust::DeviceCopy;
| ^^^^ use of undeclared crate or module cust

error: cannot determine resolution for the derive macro DeviceCopy
--> src/lib.rs:24:17
|
4 | #[derive(Clone, DeviceCopy)]
| ^^^^^^^^^^
|
= note: import resolution is stuck, try simplifying macro imports

Atomics design doc and discussion

This issue serves as a design document and a discussion on how atomics will/should be implemented.

CUDA Background

CUDA has had atomics for basically forever in the form of a few functions like atomicAdd, atomicCAS, etc. See the docs on it here. It also has _system and _block variants of them.

This has always been the overwhelmingly popular way of doing atomic things in CUDA, and for a while it was the only way, until compute 7.x. sm_70 introduced the .sem qualifier on the atom PTX instruction. This allowed users to specify a specific ordering for atomic operations.

CUDA decided to implement this by replicating std::atomic as its own thing called cuda::std::atomic. Atomic provides a generic container for atomic operations on types such as int. It offers atomic operations with user-specified orderings.

Usage of `cuda::std::atomic`

Despite NVIDIA pushing for users to use atomic, it has not seen wide adoption, presumably because of the following reasons:

cuda::std::atomic is a mess of templates and inheritance because CUDA wanted to make it compatible with the GPU, the CPU (with every compiler's weird atomic semantics), and user-defined functions. This yields weird errors and confusing dependency graphs.
Every CUDA example, sample, docs, tutorials, course, etc uses atomicAdd and similar. Unless you are deeply knowledgeable about CUDA you would not switch to atomic, if you even knew it existed.
atomic has had a rocky past in terms of it sometimes working or not working, for example, CUDA 10.2 had many issues with std::atomic
atomic for some reason does not support float add, i am totally unsure why, the PTX ISA has instructions for it

Importance of great atomics

Atomics are the core of many algorithms, therefore it is imperative for a project of this scale to implement them once and implement them well. Otherwise a poor implementation of them might mean users being stuck with such an implementation forever, as with CUDA's case. Therefore, i believe we should take our time with atomics and implement them once and do it well.

Low level implementation

The low level implementation of such atomics is not very difficult, it can mostly be taken from how cuda::std::atomic does it at the low level. It implements them in the following way:

If the CUDA Arch is >= 7.0 then it uses specialized PTX instructions with asm:

template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_acq_rel_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.acq_rel.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_acquire_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.acquire.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_relaxed_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.relaxed.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_release_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.release.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_volatile_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }

With seqcst additionally containing a fence before it:

        switch (__memorder) {
          case __ATOMIC_SEQ_CST: __cuda_fence_sc_device();
          case __ATOMIC_CONSUME:
          case __ATOMIC_ACQUIRE: __cuda_fetch_add_acquire_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_ACQ_REL: __cuda_fetch_add_acq_rel_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_RELEASE: __cuda_fetch_add_release_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_RELAXED: __cuda_fetch_add_relaxed_32_device(__ptr, __tmp, __tmp); break;
          default: assert(0);
        }

This can very easily be replicated by us since we have full support for inline asm.

Otherwise, if the arch is less than 7.0, it "emulates" it with barriers:

        switch (__memorder) {
          case __ATOMIC_SEQ_CST:
          case __ATOMIC_ACQ_REL: __cuda_membar_device();
          case __ATOMIC_CONSUME:
          case __ATOMIC_ACQUIRE: __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); __cuda_membar_device(); break;
          case __ATOMIC_RELEASE: __cuda_membar_device(); __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_RELAXED: __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); break;
          default: assert(0);
        }

You can find the code for this in CUDA_ROOT\include\cuda\std\detail\libcxx\include\support\atomic\atomic_cuda_generated.h for CUDA 11.5, and CUDA_ROOT\include\cuda\std\detail\__atomic_generated for older versions.

That file provides functions as intrinsics that the rest of libcu++ build off of:

template<class _Type, typename cuda::std::enable_if<sizeof(_Type)==4, int>::type = 0>
__device__ _Type __atomic_fetch_add_cuda(volatile _Type *__ptr, _Type __val, int __memorder, __thread_scope_device_tag) {
  /* ... */
}

Rust Intrinsic implementation

I propose we follow a similar approach of raw unsafe intrinsics for:

32 bit and 64 bit operations, loads, stores, compare_exchange, exchange, and fetch_{add, and, max, min, or, sub, xor}
block, device, and system operations

sm_70+ intrinsics are implemented in cuda_std::atomic::intrinsics::sm_70, emulated intrinsics are in cuda_std;:atomic::intrinsics::emulated.

Wrappers of the sm-specific intrinsics are in cuda_std::atomic::intrinsics. For example:

pub unsafe fn atomic_fetch_add_f32_device(ptr: *const f32, ordering: Ordering, val: f32) -> f32;

High level types

And finally, we expose high level types in cuda_std::atomic such as AtomicF32, AtomicF64, etc.

Block atomics (BlockAtomicF32) will need to be unsafe, this is because for device atomics, it is up to the caller of the kernels to ensure buffers and kernels do not contain data races, and systems prevent this. However, block atomics do not, it would be very easy to accidentally cause data races if the accesses are not intra-threadblock.

Atomic types will expose operations that they specifically allow, for example, per the ISA spec:

Every type has fetch_and, fetch_or, fetch_xor, compare_and_swap, and exchange.
Signed and unsigned Integers have fetch_add, fetch_inc, fetch_dec, fetch_min, and fetch_max.
Unsigned integers have fetch_inc and fetch_add that clamp to [0..b] (unsure if this means 0..MAX or something else).
Floats have fetch_add

Compatibility with core atomics

Core exposes atomics with a couple of things:

Every target has a cfg on whether an atomic width is supported on the target. I have not checked what atomic sizes core thinks nvptx has.
The codegen then generates atomic instructions as such:

    fn atomic_load(
        &mut self,
        ty: &'ll Type,
        ptr: &'ll Value,
        _order: AtomicOrdering,
        _size: Size,
    ) -> &'ll Value {
        /* ... */
    }

In addition to atomic_store, atomic_rmw, atomic_cmpxchg, and a couple more. We currently trap in all of these functions, partly because libnvvm doesn't support atomic instructions for many types, and partly because we want to bikeshed how to implement them nicely.

However, as expected, things are not quite the same on the CPU and the GPU, there are some very important differences:

CUDA has 32 bit and 64 bit atomics (16 bit too if you count f16), while core expects 8 bit atomics too (that is, unless its cfg-gated).
Core expects some operations to be available while they are not available, such as fetch_nand, we could implement this as a CAS loop but its a bit of an opaque behavior so im not too happy to do that.
CUDA has (limited) float atomics, which are the most used types of atomics by far, since GPU code often deals with floats. Core does not have atomic floats, so we would need a separate type, causing lots of confusion.
CUDA as previously mentioned has block and system atomics, which are unique to it.

Because of these limitations, we have a few options for implementing atomics:

Try our best to support core atomics, emulate anything thats missing with CAS loops, make AtomicF32 and AtomicF64 different types in cuda_std. Add block and system atomics as their own types in cuda_std::atomic. This maintains compat with core but splits up atomic types, which is not ideal.
Don't try to support core intrinsics at all, add everything in cuda_std::atomic, add only the methods that cuda natively supports without CAS loops. Don't try to make the atomics work on the CPU. This is easiest, has the nicest API, but doesn't work on the CPU.

Implementation Roadmap

Atomics will likely be implemented incrementally, most of the work is transferring over the raw intrinsics, after that, the hard part is done and we can just focus on the stable public API.

Device float atomics will be first, since it is by far the most used type of intrinsic. After that, the order will probably
follow:

Integer Device Atomics -> Float System Atomics -> Integer System Atomics -> Float Block Atomics -> Integer Block Atomics -> Anything that's missing

Feedback

I'd love to hear any feedback you have! We must make sure this is implemented once and implemented correctly.

`Module::from_ptx_cstr` segfaults when program is launched from Nvidia Nsight Compute, but not `Module::load_from_string`

When launching my program from Nvidia Nsight Compute, Module::load_from_string (deprecated) appears to work fine, but Module::from_ptx_cstr with an empty options list segfaults. I also tried Default::default() as the options list with the same outcome.

internal compiler error: unknown intrinsic 'raw_eq'

I am interested in trying ed25519 signature verification on cuda, and was trying to compile ed25519-dalek crate.

More specifically, for now I just added it as a dependency to gpu/add example to see if there will be any compilation errors.

I had to remove code related to secret key generation(since I am only interested in verification) and zeroize crate since it was causing obvious errors (see andll/ed25519-dalek@132de2d), but after that I get compiler error that I don't understand / know how to fix

  error: internal compiler error: crates/rustc_codegen_nvvm/src/intrinsic.rs:430:18: unknown intrinsic 'raw_eq'

  thread 'rustc' panicked at 'Box<dyn Any>', compiler/rustc_errors/src/lib.rs:1170:9
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
  warning: `ed25519-dalek` (lib) generated 3 warnings
  error: could not compile `ed25519-dalek`; 3 warnings emitted
  warning: build failed, waiting for other jobs to finish...
  error: atomic fence is not supported, use cuda_std intrinsics instead

I am also not sure if 'unknown intrinsic' and 'atomic fence is not supported' referring to the same problem or are they different issue.

Problem with those errors is that they don't point to source code that cause them, so for person like me who is not familiar with compiler internals don't even know where to look.

Do you have any advice what are the things in source that can cause this that I can try to remove/change with something else?

Is there a way to pin point specific source code line that cause the issue?

The current way of handling context is fundamentally incompatible with the Runtime API

Small tracking issue for sorting out context issues that are blocking cuBLAS and cuFFT work. The gist of it is that currently we use the "traditional" way of handling contexts per the driver API, which is such:

Make a new context when needed, this context is pushed to a thread-local stack in the driver api.
For multithreading, you get an unowned context and give that to each thread, then each thread sets the current context.
Dropping the context destroys any backing memory and resources, doing so while another thread is using the context is UB (albeit extremely rare).

This very different from what cudart does:

on any function, cudart checks if a context is made, if not, then it makes a new one.
this context is device-local and reference-counted.
Users can call cudaDeviceReset which nukes the device and the primary context.
If the driver api made a context and made it current, cudart will pick up and use that one.

This causes a good amount of issues when trying to interop with cudart, and is what is causing spurious segfaults in the cublas stuff i just pushed. What i presume is happing is:

driver pushes a context before cudart is initialized
cudart picks up on that
driver does stuff with cublas
driver drops the context, which presumably nukes anything in cudart and cublas too.
something happens when exiting which causes cudart/cublas to try and use an invalid context, making it segfault.

However, the driver API also has primary context handling, aka what cudart does except explicit, basically:

cuDevicePrimaryCtxRetain will retain a primary context handle for the device, this context is reference counted.
cuDevicePrimaryCtxRelease will release the context handle back to the driver, if this is the last handle, it will reset the context. Although presumably cudart holds on to it forever, so it will never be reset unless done explicitly.
The context is not pushed to the context stack, this context is essentially separate from the "normal" driver context handling.

So my proposal is as such:

Move the traditional context handling to cust::context::legacy, keeping the Context name to avoid too much breakage, just switch it to using primary context handling.
Update docs to reflect that the legacy way of doing contexts technically works, except it may cause a ton of issues if using cudart or cublas.

This would have a numerous amount of benefits:

No more unsoundness if you drop a context while a thread is using it because its reference counted.
No need for unowned context because again, reference counted.
Generally better for performance and for doctests, making many contexts murders performance and is usually not needed.
Should work perfectly with cudart because i presume cudart actually uses these driver API functions underneath the hood.
Makes cust compatible with libs like cuBLAS and cuFFT right off the bat, so users don't start using legacy versions of cust and making their library incompatible with cublas/cufft.
Creating contexts is no longer a gigantic expensive operation for the most part.

However, it does retain the issue of "if a user calls deviceReset from cudart or the driver, this destroys the ability for anything to do cuda work", but i don't think there is a way to 100% solve that issue, legacy context handling can do this through just dropping the context, while primary contexts can just call deviceReset. So either way a user can nuke cuda contexts if they want to. Except that deviceReset is more explicit and will probably be unsafe in cust.

I will start working on this and probably releasing these changes in cust 0.3.

DeviceCopy vs. Copy

In the Kernel ABI documentation we learn we can pass any struct that implements copy.

However, this seems not to work because cust still has assert_impl_devicecopy, that fails on compile (due to generics):

            fn assert_impl_devicecopy<T: $crate::memory::DeviceCopy>(_val: T) {}
            if false {
                $(
                    assert_impl_devicecopy($arg);
                )*
            }

(In functions.rs)

Is this an old remnant, or do we need to use DeviceCopy?

Is it possible to copy data to device constant memory?

When using C CUDA it's possible to allocate constant memory on the GPU by using cudaMemcpyToSymbol rather than cudaMemcpy. Is there a similar mechanism in the cust library for copying an array with a constant size from the host to the device constant memory?

OptiX (CPU) tracking issue

Issue to track what needs to be done to complete the CPU (host) side of OptiX:

(me) Denoising:

optixDenoiserComputeIntensity
optixDenoiserComputeAverageColor
setup_state_with_buffers
Some way of running the denoiser with the same input and output image.
AOV models
Tiled denoising

(@anderslanglands) Actual Raytracing:

port remaining iw examples
port SDK examples necessary to test remaining features (e.g. motion blur, curves)
implement instance transforms, motion blur

Expose the null stream in cust

Are there any reasons for currently not exposing the null stream in cust?

Would there be any problems to implementing Default::default() for Stream as the null stream?

A lot of traditional C++ cuda code uses the null stream. I think exposing it in rust as well can lower the complexity of having to manage streams yourself. It can also make the on-boarding experience close to what it is in C++.

Please add metadata and README to Cargo.toml files

Crates should have at least the repository, documentation and readme fields set in their Cargo.toml.

Commercial Project: Running DMNTK on the GPU

Would any of the maintainers of Rust-CUDA be interested in a paid project getting DMNTK to run on the GPU?

https://docs.rs/crate/dmntk/latest

Incorrect arithmetics in large function

I am trying more examples with using crypto, and this time problem that I encounter is with integer arithmetic mismatch between GPU and CPU.

There is an example of the code: CPU and GPU

They both have same code of fiat_25519_to_bytes function, however result is slightly different for CPU and GPU

CPU OUT: [192, 72, 16, 54, 192, 98, 172, 116, 44, 128, 112, 112, 150, 42, 195, 95, 129, 14, 47, 50, 18, 198, 117, 255, 32, 79, 57, 78, 137, 92, 244, 98]
GPU OUT: [192, 72, 16, 54, 192, 98, 172, 124, 44, 128, 112, 112, 86, 106, 195, 95, 129, 14, 47, 48, 20, 198, 117, 255, 32, 63, 57, 78, 137, 92, 244, 98]

(note element 7 for example)

Understandably function is fairly large, however I was not able to reduce example. For instance, when trying to narrow down what happen I can see that basically result on this line is incorrect:

fiat_25519_addcarryx_u51(&mut x14, &mut x15, x13, x3, (x11 & 0x7ffffffffffff));

However, when taking this line out of context of large function it works perfectly fine on GPU, so the problem is likely with some kind of optimization that breaks integer arithmetics logic.

Graph API

First, thanks for all of the excellent work on this! I have a workload which could benefit from the Graph API (as opposed to just using individual stream submissions and such).

Any ideas on LOE to implement, what current blockers are, and so on?

Perhaps we can discuss what getting started on implementing this should look like, desired API style, and all that.

build failure -> rustc_codegen_nvvm(build)

When I add this package
cuda_builder = { version = "0.2", path = "./Rust-CUDA/crates/cuda_builder" }
And compiled, and then they started a bunch of bugs, I had LLVM_ and CUDA installed

rustc_codegen_nvvm(build)

LLVM dependency version

rustc_codegen_nvvm v0.2.2 depends on LLVM7 which is somewhat outdated and not packaged for the latest ubuntu version (21.10). Adding the possibility of using LLVM9 or later would make it simpler to get started writing device code on Ubuntu 21.10.

What are the reasons for depending on specifically version 7? Would it be beneficial to change this?

Error: a PTX JIT compilation failed

Platform: Jetson Nano 2Gi
Arch: aarch64/arm64
OS: Linux Ubuntu 18.04 / Tegra

# Same output with -sass, -elf, and pretty much any of the other opts/flags for cuobjdump.
cuobjdump -ptx `which cns-rt`
cuobjdump info    : File '/usr/local/bin/cns-rt' does not contain device code

cuda-memcheck --report-api-errors all cns-rt
========= CUDA-MEMCHECK
========= Program hit CUDA_ERROR_INVALID_PTX (error 218) due to "a PTX JIT compilation failed" on CUDA API call to cuModuleLoadData.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 (cuModuleLoadData + 0x114) [0x1d449c]
=========     Host Frame:cns-rt [0x90dc]
=========     Host Frame:cns-rt [0x8d50]
=========     Host Frame:cns-rt [0x7e38]
=========     Host Frame:cns-rt [0x8e04]
=========     Host Frame:cns-rt [0x8e1c]
=========     Host Frame:cns-rt [0x1ea98]
=========     Host Frame:cns-rt [0x82ec]
=========     Host Frame:/lib/aarch64-linux-gnu/libc.so.6 (__libc_start_main + 0xe0) [0x20720]
=========     Host Frame:cns-rt [0x7afc]
=========
Error: "a PTX JIT compilation failed"
========= ERROR SUMMARY: 1 error

EDIT (added the PTX):

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-30521435
// Cuda compilation tools, release 11.4, V11.4.152
// Based on NVVM 7.0.1
//

.version 7.4
.target sm_61
.address_size 64

	// .globl	add

.visible .entry add(
	.param .u64 add_param_0,
	.param .u64 add_param_1,
	.param .u64 add_param_2,
	.param .u64 add_param_3,
	.param .u64 add_param_4
)
{
	.reg .pred 	%p<3>;
	.reg .f32 	%f<4>;
	.reg .b32 	%r<5>;
	.reg .b64 	%rd<14>;


	ld.param.u64 	%rd2, [add_param_0];
	ld.param.u64 	%rd6, [add_param_1];
	ld.param.u64 	%rd3, [add_param_2];
	ld.param.u64 	%rd4, [add_param_3];
	ld.param.u64 	%rd5, [add_param_4];
	mov.u32 	%r1, %ntid.x;
	mov.u32 	%r2, %ctaid.x;
	mov.u32 	%r3, %tid.x;
	mad.lo.s32 	%r4, %r1, %r2, %r3;
	cvt.u64.u32 	%rd1, %r4;
	setp.ge.u64 	%p1, %rd1, %rd6;
	@%p1 bra 	$L__BB0_4;

	setp.lt.u64 	%p2, %rd1, %rd4;
	@%p2 bra 	$L__BB0_3;
	bra.uni 	$L__BB0_2;

$L__BB0_3:
	cvta.to.global.u64 	%rd7, %rd5;
	shl.b64 	%rd8, %rd1, 2;
	add.s64 	%rd9, %rd7, %rd8;
	cvta.to.global.u64 	%rd10, %rd3;
	add.s64 	%rd11, %rd10, %rd8;
	ld.global.nc.f32 	%f1, [%rd11];
	cvta.to.global.u64 	%rd12, %rd2;
	add.s64 	%rd13, %rd12, %rd8;
	ld.global.nc.f32 	%f2, [%rd13];
	add.f32 	%f3, %f2, %f1;
	st.global.f32 	[%rd9], %f3;

$L__BB0_4:
	ret;

$L__BB0_2:
	trap;

}

An important note is that this is all compiled on an Ubuntu 18.04 arm64 container with Cuda 11.4, but the binary is then moved to the L4T-runtime container (which is needed for the Jetson device) which only supports Cuda 10.2. The docs in the Getting Started section of this repo seem to indicate that such a setup should be fine ... though I may have misinterpreted that statement.

Any ideas on what is causing this issue?

Implement DeviceCopy for feature gated vek types

Currently, DeviceCopy is only implemented for the vek types not under any features. This is a problem if you need to use any of the feature gated types. For example, I am writing image data from a GPU kernel, so I need to have a DeviceBuffer<vek::Rgb<u8>> to write to and copy from.

Idea: parse argument list of kernels for safer launching

The combination of cuLaunch requiring getting the argument list exactly right and major version changes of PTX ISA can change the argument list makes it close to impossible of being certain that a kernel is launched correctly using the driver API. One way to improve upon this would be to parse the ptx/fatbin for kernel argument list and verify the rust source against it. This would not be possible for cubin, but ptx/fatbin should cover most use cases.

There are basically two approaches to this. The runtime approach would be to add function argument list info to Cust::module::Module which is generated when ptx/fatbin is added to the module. Then it would be passed to the cust::function::Function when it is created.

The other alternative is a static/build.rs alternative. We could create a dev-dependency that parses a .ptx and create rust types containing information about the argument list layout. A generic method on Module would then return the specific type implementing some Kernel trait and the Stream would have a generic launch function that would accept some <Function as Kernel>::ArgList argument.

I'm not promising to implement this, but would love to get some feedback on the idea.

Building repo, getting "undefined symbol: setupterm"

Bear with me here as I'm on NixOS so installing the dependencies has been a journey. I've cloned this repo and am just trying to run cargo build. I've gotten as far as installing CUDA and OptiX, to the point where it's actually building the path_tracer crate, but now I'm getting some scary codegen errors from rustc:

error: failed to run custom build command for `path_tracer v0.1.0 ($REPO_ROOT/examples/cuda/cpu/path_tracer)`

Caused by:
  process didn't exit successfully: `$REPO_ROOT/target/debug/build/path_tracer-970d6a9b9c38170f/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=../../gpu/path_tracer_gpu

  --- stderr
  warning: $REPO_ROOT/crates/cust/Cargo.toml: `default-features = [".."]` was found in [features]. Did you mean to use `default = [".."]`?
  error: failed to run `rustc` to learn about target-specific information

  Caused by:
    process didn't exit successfully: `rustc - --crate-name ___ --print=file-names -Zcodegen-backend=$REPO_ROOT/target/debug/deps/librustc_codegen_nvvm.so -Cllvm-args=-arch=compute_61 --target nvptx64-nvidia-cuda --crate-type bin --crate-type rlib --crate-type dylib --crate-type cdylib --crate-type staticlib --crate-type proc-macro --print=sysroot --print=cfg` (exit status: 1)
    --- stderr
    error: couldn't load codegen backend "$REPO_ROOT/target/debug/deps/librustc_codegen_nvvm.so": "$REPO_ROOT/target/debug/deps/librustc_codegen_nvvm.so: undefined symbol: setupterm"

  thread 'main' panicked at 'Did not find output file in rustc output', crates/cuda_builder/src/lib.rs:444:10
  stack backtrace:
     0: rust_begin_unwind
               at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/std/src/panicking.rs:517:5
     1: core::panicking::panic_fmt
               at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/core/src/panicking.rs:100:14
     2: core::panicking::panic_display
               at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/core/src/panicking.rs:64:5
     3: core::option::expect_failed
               at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/core/src/option.rs:1637:5
     4: core::option::Option<T>::expect
               at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/core/src/option.rs:708:21
     5: cuda_builder::get_last_artifact
               at $REPO_ROOT/crates/cuda_builder/src/lib.rs:432:16
     6: cuda_builder::invoke_rustc
               at $REPO_ROOT/crates/cuda_builder/src/lib.rs:417:20
     7: cuda_builder::CudaBuilder::build
               at $REPO_ROOT/crates/cuda_builder/src/lib.rs:238:20
     8: build_script_build::main
               at ./build.rs:4:5
     9: core::ops::function::FnOnce::call_once
               at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/core/src/ops/function.rs:227:5
  note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
warning: build failed, waiting for other jobs to finish...
error: build failed

Any tips on what do here?

Versions:

CUDA 11.4.2
OptiX 7.3.0
LLVM 7.1.0
NVidia driver 470.63.01
rustc 1.57.0-nightly (4e89811b4 2021-10-16)

Single source for both CPU and GPU code possible?

Hello. Thanks for this awesome project. I can now compile CUDA kernels in rust into ptx in one cargo package, and use them in another package. Now I wonder whether it is possible to write both the kernels and the CPU code within one package, or even one rust source file. For example, it might look like this:

// shared code between cpu and gpu. struct definitions may also be shared
// similar to CUDA's __device__ __host__
#[devicehost like thing...]
pub fn adder_both_cpu_gpu(a: f32, b: f32) -> f32 {
    a + b
}

#[kernel]
pub unsafe fn add(a: &[f32], b: &[f32], c: *mut f32) {
    let idx = thread::index_1d() as usize;
    if idx < a.len() {
        let elem = &mut *c.add(idx);
        *elem = adder_both_cpu_gpu(a[idx], b[idx]);
    }
}

fn main() {
    // use kernel fn add
}

Unfortunately, this seems not possible at this moment. However, I think it is purely a matter of some convenient macros. For example, if we define the kernel macro to, instead of directly mark the function as kernel, launch a separate cargo build with cuda_builder, and replace the CPU code with a lazy_static ptx module import. This way it would be easier to manage dependency and reuse between CPU code and GPU code and save some boilerplates.
I don't know if you are interested.. Though it might be harder than I think:(

First release tracking issue

Small issue to track the things that need to be done before the project will be released.

Cooperative Groups Impl

I believe I am at a point where I need the cooperative groups API. Instead of re-writing my kernel code in C++, or using CXX to bridge the Rust code into C++, I would prefer to implement the Cooperative Groups API instead (at least some portion of it).

I've read the documentation on it a few times now. Not sure if others have already looked into this. Just wanted to touch base if folks have concerns or pointers as I dig into implementation.

Is CuBLAS coming to Rust-CUDA?

I saw a post on reddit from 8 months ago mentioning work had started on implementing CuBLAS. Can we expect to see CuBLAS in the near future?

Awesome crate guys, keep up the good work.

mlir-nvvm

MLIR seems particularly attuned towards GPU compute, and adopting it may help with adopting backends other than CUDA in the future. I noticed that it has an nvvm backend. I'm unsure whether it would have the same difficulties as LLVMPTX.

Derive DeviceCopy invariants?

I see that there is no impl DeviceCopy for core/alloc Vec, but I do see the impls for the vek::* types. Is there any particular concern with slapping on an unsafe impl DeviceCopy for the core/alloc Vec type?

I have some data structures which are pure data, no references, nothing like that, but there are vector fields. Would it be sounds to impl DeviceCopy for such a type? I’m wondering what the actual requirements are. Just no references and no_std?

Linux not supported

I was trying to build the add example, and since the guide makes no mention that linux isn't supported I assumed it was. When I ran the build script, I get Unsupported target with no matching prebuilt LLVM: x86_64-unknown-linux-gnu, install LLVM and set LLVM_CONFIG.

I went through the build script and saw this:

fn target_to_llvm_prebuilt(target: &str) -> String {
    println!("Target: {}", target);
    let base = match target {
        "x86_64-pc-windows-msvc" => "windows-x86_64",
        // NOTE(RDambrosio016): currently disabled because of weird issues with segfaults and building the C++ shim
        //"x86_64-unknown-linux-gnu" => "linux-x86_64",
        _ => panic!("Unsupported target with no matching prebuilt LLVM: `{}`, install LLVM and set LLVM_CONFIG", target)
    };
    format!("{}.tar.xz", base)
}

So is there a plan to re-support linux?

Sequencing tasks on multiple streams w/ Events

The following doc snippets seem to indicate that this is not yet supported:

Events can also be used to sequence tasks on multiple streams within the same context by specifying dependent tasks (not supported yet by cust).
~ https://docs.rs/cust/latest/cust/event/index.html

and

Sequencing between multiple streams can be achieved using events, which are not currently supported by cust.
~ https://docs.rs/cust/latest/cust/stream/index.html

However, the following bit makes it seem as though it is indeed supported, even across different devices/contexts.

Make the stream wait on an event.

All future work submitted to the stream will wait for the event to complete. Synchronization is performed on the device, if possible. The event may originate from different context or device than the stream.
~ https://docs.rs/cust/latest/cust/stream/struct.Stream.html#method.wait_event

Just wondering what the state of cross-stream/cross-context Event-based synchronization is. I'm happy to just go ahead and test, but I figured I would ask. This seems to be something supported on the Cuda level, not necessarily something which would need to be implemented in cust, but I could be wrong.

Also, if this needs to be implemented, I'm wondering if folks might have some thoughts on current blockers and such?

Can't build `denoiser` example

I couldn't build denoiser example.

I had succeeded to build it in fd87b73 but failed to current master branch

I used CUDA SDK 11.2.

PS C:\Users\hato2\Desktop\Rust-CUDA\examples\optix\denoiser> cargo build
    Updating crates.io index
   Compiling cust_derive v0.1.0 (C:\Users\hato2\Desktop\Rust-CUDA\crates\cust_derive)
   Compiling optix v0.1.0 (C:\Users\hato2\Desktop\Rust-CUDA\crates\optix)
   Compiling cust_core v0.1.0 (C:\Users\hato2\Desktop\Rust-CUDA\crates\cust_core)
   Compiling cust v0.2.2 (C:\Users\hato2\Desktop\Rust-CUDA\crates\cust)
error[E0425]: cannot find value `OptixInstanceFlags_OPTIX_INSTANCE_FLAG_DISABLE_TRANSFORM` in module `sys`
    --> crates\optix\src\acceleration.rs:1569:40
     |
1569 |         const DISABLE_TRANSFORM = sys::OptixInstanceFlags_OPTIX_INSTANCE_FLAG_DISABLE_TRANSFORM;
     |                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: a constant with a similar name exists: `OptixInstanceFlags_OPTIX_INSTANCE_FLAG_DISABLE_ANYHIT`
     |
    ::: C:\Users\hato2\Desktop\Rust-CUDA\target\debug\build\optix-715ab9f0db917c6b\out/optix_wrapper.rs:417:1
     |
417  | pub const OptixInstanceFlags_OPTIX_INSTANCE_FLAG_DISABLE_ANYHIT: OptixInstanceFlags = 4;
     | ---------------------------------------------------------------------------------------- similarly named constant `OptixInstanceFlags_OPTIX_INSTANCE_FLAG_DISABLE_ANYHIT` defined here

error[E0425]: cannot find value `OPTIX_COMPILE_DEBUG_LEVEL_LINEINFO` in module `sys::OptixCompileDebugLevel`
   --> crates\optix\src\pipeline.rs:162:45
    |
162 |     LineInfo = sys::OptixCompileDebugLevel::OPTIX_COMPILE_DEBUG_LEVEL_LINEINFO,
    |                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: a constant with a similar name exists: `OPTIX_COMPILE_DEBUG_LEVEL_MINIMAL`
    |
   ::: C:\Users\hato2\Desktop\Rust-CUDA\target\debug\build\optix-715ab9f0db917c6b\out/optix_wrapper.rs:696:5
    |
696 |     pub const OPTIX_COMPILE_DEBUG_LEVEL_MINIMAL: Type = 9041;
    |     --------------------------------------------------------- similarly named constant `OPTIX_COMPILE_DEBUG_LEVEL_MINIMAL` defined here

error[E0063]: missing field `endcapFlags` in initializer of `OptixBuildInputCurveArray`
    --> crates\optix\src\acceleration.rs:1094:58
     |
1094 |                 curve_array: std::mem::ManuallyDrop::new(sys::OptixBuildInputCurveArray {
     |                                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ missing `endcapFlags`

error[E0560]: struct `OptixPipelineCompileOptions` has no field named `reserved`
   --> crates\optix\src\pipeline.rs:296:21
    |
296 |                     reserved: 0,
    |                     ^^^^^^^^ `OptixPipelineCompileOptions` does not have this field
    |
    = note: available fields are: `usesMotionBlur`, `traversableGraphFlags`, `numPayloadValues`, `numAttributeValues`, `exceptionFlags` ... and 2 others

error[E0560]: struct `OptixPipelineCompileOptions` has no field named `reserved2`
   --> crates\optix\src\pipeline.rs:297:21
    |
297 |                     reserved2: 0,
    |                     ^^^^^^^^^ `OptixPipelineCompileOptions` does not have this field
    |
    = note: available fields are: `usesMotionBlur`, `traversableGraphFlags`, `numPayloadValues`, `numAttributeValues`, `exceptionFlags` ... and 2 others

error[E0063]: missing fields `buildFlags` and `curveEndcapFlags` in initializer of `OptixBuiltinISOptions`
   --> crates\optix\src\pipeline.rs:402:26
    |
402 |         let is_options = sys::OptixBuiltinISOptions {
    |                          ^^^^^^^^^^^^^^^^^^^^^^^^^^ missing `buildFlags` and `curveEndcapFlags`

error[E0560]: struct `OptixProgramGroupOptions` has no field named `reserved`
   --> crates\optix\src\pipeline.rs:566:62
    |
566 |             let pg_options = sys::OptixProgramGroupOptions { reserved: 0 };
    |                                                              ^^^^^^^^ `OptixProgramGroupOptions` does not have this field
    |
    = note: available fields are: `payloadType`

error[E0560]: struct `OptixProgramGroupOptions` has no field named `reserved`
   --> crates\optix\src\pipeline.rs:612:62
    |
612 |             let pg_options = sys::OptixProgramGroupOptions { reserved: 0 };
    |                                                              ^^^^^^^^ `OptixProgramGroupOptions` does not have this field
    |
    = note: available fields are: `payloadType`

error[E0063]: missing fields `numPayloadTypes` and `payloadTypes` in initializer of `OptixModuleCompileOptions`
   --> crates\optix\src\pipeline.rs:184:17
    |
184 |                 sys::OptixModuleCompileOptions {
    |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ missing `numPayloadTypes` and `payloadTypes`

Some errors have detailed explanations: E0063, E0425, E0560.
For more information about an error, try `rustc --explain E0063`.
error: could not compile `optix` due to 9 previous errors