Comments (14)
@BrunoQuaresma We may need to write a mini-RFC describing the status quo.
from envbuilder.
@bpmct I tried to use envbuilder in a GPU environment and it worked as expected. Here is how I made it:
- Spin up a k8s cluster with GPU support on GKE
- GKE version
1.27.13-gke.1000000
- Machine type
n1-standard-4
- GPU accelerators (per node)
2 x NVIDIA T4
- GKE version
- Setup a test repo with devcontainer using a Nvidia test image
- Example: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample
- NVidia example image:
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
- Set the following envbuilder config
GIT_URL
ashttps://github.com/BrunoQuaresma/envbuilder-gpu-test
INIT_SCRIPT
as/tmp/vectorAdd
This is the output:
envbuilder - Build development environments from repositories in a container
#1: ๐ฆ Cloning https://github.com/BrunoQuaresma/envbuilder-gpu-test to /workspaces/envbuilder-gpu-test...
#1: Enumerating objects: 4, done.
#1: Counting objects: 25% (1/4)
#1: Counting objects: 50% (2/4)
#1: Counting objects: 75% (3/4)
#1: Counting objects: 100% (4/4)
#1: Counting objects: 100% (4/4), done.
#1: Compressing objects: 50% (1/2)
#1: Compressing objects: 100% (2/2)
#1: Compressing objects: 100% (2/2), done.
#1: Total 4 (delta 0), reused 4 (delta 0), pack-reused 0
#1: ๐ฆ Cloned repository! [193.807769ms]
#2: Deleting filesystem...
#2: ๐๏ธ Building image...
#2: Retrieving image manifest nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
#2: Retrieving image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 from registry nvcr.io
#2: Built cross stage deps: map[]
#2: Retrieving image manifest nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
#2: Returning cached image manifest
#2: Executing 0 build triggers
#2: Building stage 'nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2' [idx: '0', base-idx: '-1']
#2: ๐๏ธ Built image! [3.019338331s]
#3: no user specified, using root
#3: ๐ Updating the ownership of the workspace...
#3: ๐ค Updated the ownership of the workspace! [449.651ยตs]
=== Running the init command /bin/sh [-c /tmp/vectorAdd] as the "root" user...
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
@bpmct do you think we can get more details from the user?
from envbuilder.
@johnstcn I'm seeing the issue on:
- K8s Rev: v1.27.7
- Node image: AKSUbuntu-2204gen2containerd-202401.09.0
- Plugin image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:1.11
- Pod image = "ghcr.io/coder/envbuilder:0.2.9"
from envbuilder.
Oh, I fixed it by
echo "/usr/lib64" > /etc/ld.so.conf.d/customized.conf
ldconfig
from envbuilder.
@marrotte does the @nikawang fix work for you?
from envbuilder.
@BrunoQuaresma I don't think I can test that as my envbuilder fails to build. I believe @nikawang either applied that fix to a envbuilder running container or the running container built by envbuilder. I did try applying @nikawang's fix to the AKS/K8s GPU node as if that might be where it was applied and that had no effect.
from envbuilder.
After talk to @mtojek I think I have a good plan:
- Try to run envbuilder in a regular environment
- Spin up a regular k8s cluster on Google Cloud
- Try to run envbuilder with a hello world image and see if it works
- Try to reproduce the user error by running envbuilder with a GPU
- Spin up a k8s cluster using a NVidia GPU on Google Cloud
- Try to run envbuilder with a hello world image and see if it works
- Try to find a workaround
- Investigate possible solutions using diff builders besides kaniko
from envbuilder.
I am closing this for now until we have more context from the user.
from envbuilder.
Try using the NVIDIA k8s device plugin (DaemonSet) and not a NVIDIA container image, e.g.:
https://github.com/NVIDIA/k8s-device-plugin
This is the recommended approach when using GPU-enabled node pools for Azure Linux.
from envbuilder.
Try using the NVIDIA k8s device plugin (DaemonSet)
@marrotte FYI while we tested using GKE, the cluster we tested on does use the device plugin. However, it appears to be a customized version for GKE COS, and I will freely admit that cluster is a bit old.
What Kubernetes version are you seeing issues with on AKS?
not a NVIDIA container image, e.g.:
This container image is the one NVIDIA recommends to test GPU support (c.f. https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#running-gpu-jobs)
What image would you recommend instead as a test? I note that the Azure docs you linked reference a separate MNIST test image.
from envbuilder.
@BrunoQuaresma
Still have issues via using your test repo on AKS agains ghcr.io/coder/envbuilder:0.2.9
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/# cd /tmp
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/tmp# ls
coder.wTqTN7 vectorAdd
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/tmp# ./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
/usr/lib64# ll
total 176176
drwxr-xr-x 2 root root 4096 Jun 13 14:00 ./
drwxr-xr-x 14 root root 4096 Jun 13 13:49 ../
lrwxrwxrwx 1 root root 42 Jun 13 13:49 ld-linux-x86-64.so.2 -> /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2*
-rwxr-xr-x 1 root root 28392536 Jun 13 12:01 libcuda.so.550.54.15*
-rwxr-xr-x 1 root root 10524136 Jun 13 12:01 libcudadebugger.so.550.54.15*
-rwxr-xr-x 1 root root 168744 Jun 13 12:01 libnvidia-allocator.so.550.54.15*
-rwxr-xr-x 1 root root 398968 Jun 13 12:01 libnvidia-cfg.so.550.54.15*
lrwxrwxrwx 1 root root 36 Jun 13 14:00 libnvidia-ml.so -> /usr/lib64/libnvidia-ml.so.550.54.15*
-rwxr-xr-x 1 root root 2078360 Jun 13 12:01 libnvidia-ml.so.550.54.15*
-rwxr-xr-x 1 root root 86842616 Jun 13 12:01 libnvidia-nvvm.so.550.54.15*
-rwxr-xr-x 1 root root 23293568 Jun 13 12:01 libnvidia-opencl.so.550.54.15*
-rwxr-xr-x 1 root root 10168 Jun 13 12:01 libnvidia-pkcs11.so.550.54.15*
-rwxr-xr-x 1 root root 28670368 Jun 13 12:01 libnvidia-ptxjitcompiler.so.550.54.15*
from envbuilder.
@marrotte could you please share with me how I can set up a similar k8s cluster step by step or a Terraform file where I can just run it?
from envbuilder.
I tried to create a Kubernetes GPU cluster on Azure following this tutorial, but without success. During the process, I managed to get the cluster up and running and register the required features and services through step five of the tutorial.
bruno [ ~ ]$ az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
{
"id": "/subscriptions/05e8b285-4ce1-46a3-b4c9-f51ba67d6acc/providers/Microsoft.Features/providers/Microsoft.ContainerService/features/GPUDedicatedVHDPreview",
"name": "Microsoft.ContainerService/GPUDedicatedVHDPreview",
"properties": {
"state": "Registered"
},
"type": "Microsoft.Features/providers/features"
}
However, when I began adding the node pool, I started encountering errors.
az aks nodepool add \
--resource-group bruno \
--cluster-name bruno-gpu \
--name gpunp \
--node-count 1 \
--node-vm-size Standard_NC6s_v3 \
--node-taints sku=gpu:NoSchedule \
--aks-custom-headers UseGPUDedicatedVHD=true \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3
(OperationNotAllowed) .properties.nodeProvisioningProfile.mode cannot be Auto while any AgentPools have .properties.enableAutoScaling enabled
Code: OperationNotAllowed
Message: .properties.nodeProvisioningProfile.mode cannot be Auto while any AgentPools have .properties.enableAutoScaling enabled
I tried searching for the error on Google to find a solution or any information related to properties.nodeProvisioningProfile.mode
, but I didn't find anything helpful. I realized that it might be better to ask if you could share a Terraform file or a more straightforward tutorial for us to reproduce your environment.
from envbuilder.
So ENVBUILDER_IGNORE_PATHS
can be set to /dev,/lib/firmware/nvidia,/usr/bin/nv-,/usr/bin/nvidia-,/usr/lib64/libcuda,/usr/lib64/libnvidia-,/var/run
, but we hit the known unlinkat
/device or resource busy
error.
The easiest to get the right environment to reproduce is likely the gpu-operator for Kubernetes or the NVIDIA Container Toolkit for Docker.
I believe #183 (and #249) can provide a workaround here by temporarily remounting the path out of the way instead trying to ignore them in Kaniko, although note that mount/umount require privileges.
Currently only read-only mounts are taken care of, but the NVIDIA container runtime mounts No special handling needed, I probably had forgotten to add back devtmpfs
filesystems at /var/run/nvidia-container-devices/GPU-<uuid>
(the actual mountpoint can be /run
since often /var/run
is a symlink to it), the logic would need to be extended to cover those (I have successfully done that)./var/run
to the ignored paths.
The runtime mounts libraries with symlinks:
libcuda.so -> libcuda.so.1
libcuda.so.1 -> libcuda.so.<driver-version>
libcuda.so.<driver-version>
libcudadebugger.so.1 -> libcudadebugger.so.<driver-version>
libcudadebugger.so.<driver-version>
libnvidia-allocator.so.1 -> libnvidia-allocator.so.<driver-version>
libnvidia-allocator.so.<driver-version>
libnvidia-cfg.so.1 -> libnvidia-cfg.so.<driver-version>
libnvidia-cfg.so.<driver-version>
libnvidia-ml.so.1 -> libnvidia-ml.so.<driver-version>
libnvidia-ml.so.<driver-version>
libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.<driver-version>
libnvidia-nvvm.so.<driver-version>
libnvidia-opencl.so.1 -> libnvidia-opencl.so.<driver-version>
libnvidia-opencl.so.<driver-version>
libnvidia-pkcs11-openssl3.so.<driver-version>
libnvidia-pkcs11.so.<driver-version>
libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.<driver-version>
libnvidia-ptxjitcompiler.so.<driver-version>
The symlinks must also be preserved. The location in the envbuilder
image is /usr/lib64
, but it differs between distros (for example in Debian, it should be /usr/lib/x86_64-linux-gnu
), so the remount process must discover the appropriate location in the new filesystem hierarchy.
I am using this quick-and-dirty script afterward to get things working:
remount_and_resymlink.sh
#!/usr/bin/env bash
set -euo pipefail
TARGET=/usr/lib/x86_64-linux-gnu
FIRMWARES=(/lib/firmware/nvidia/*)
VERSION="${FIRMWARES[0]}"
VERSION="${VERSION##*/}"
mount | awk '/\/usr\/lib64/{print $3}' | while read -r path; do
lib="${path##*/}"
mkdir -p "${TARGET}"
touch "${TARGET}/${lib}"
mount --bind "${path}" "${TARGET}/${lib}"
unmount "${path}"
case "${lib}" in
libnvidia-pkcs11.so.*) ;;
libnvidia-pkcs11-openssl3.so.*) ;;
libnvidia-nvvm.so.*)
n=4
;;
*)
n=1
;;
esac
if [[ -n "${n:-}" ]]; then
ln -s "${lib}" "${TARGET}/${lib%"${VERSION}"}${n}"
fi
if [[ "${lib}" == "libcuda.so."* ]]; then
ln -s "${lib%"${VERSION}"}${n}" "${TARGET}/${lib%".${VERSION}"}"
fi
done
This is the logic the runtime uses to pick the library directory: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_container.c#L151-L188
And this looks like the libraries it can potentially mount: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_info.c#L75-L139
Once that is all in place, the nvidia-smi
command should work, the GPU(s) should be visible as well as the CUDA version.
In an image with pytorch
(e.g. nvcr.io/nvidia/pytorch:24.05-py3
), python -c 'import torch; print(torch.cuda.is_available())'
should return True
.
One thing I have not figured out yet is why the container gets all GPUs when only 1 is requested (this works properly for a regular container) ๐ That's because the pod is running with privileges.
The manifest I am using at the moment:
pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: envbuilder
spec:
containers:
- name: envbuilder
image: ghcr.io/coder/envbuilder-preview
env:
- name: FALLBACK_IMAGE
value: debian
- name: INIT_SCRIPT
value: sh -c 'while :; do sleep 86400; done'
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
resources:
limits:
nvidia.com/gpu: "1"
securityContext:
privileged: true
If you are on GCP/GKE, the above should be valid for Ubuntu nodes (I think, I am not testing there). I need to investigate on GCP's ContainerOS too since things are wired a little differently. On GCP's ContainerOS the only mount is /usr/local/nvidia
, so this path can either be ignored or remounted and no care is given regarding the PATH
or ldconfig
search path by default, it has to be handle the user's image (e.g. LD_LIBRARY_PATH=/usr/local/nvidia/lib64 /usr/local/nvidia/bin/nvidia-smi
should work).
from envbuilder.
Related Issues (20)
- Dev Containers: Support volume mounts of devcontainer spec HOT 2
- Dev Containers: Script init and Entrypoint
- Error running devcontainer with container registry kubernetes secret volume mounted HOT 2
- Add support for feature order definition in devcontainer definitions
- usability: fix ownership of Docker volume mounts to /home HOT 1
- coder/kaniko: support multi-stage builds with DoCacheProbe HOT 1
- devcontainer: support multi-stage build with dangling build stage
- devcontainer: support docker-compose
- Implement devcontainer-lock.json
- Envbuilder create git repo folder with a trailing `.git`
- bug: lifecycle script `OnCreateCommand` is not executed.
- coder/kaniko: support caching ENV and ARG directives HOT 1
- Stop using deprecated codersdk.LogsSender function
- 401 error for requests to coder.example.com after "Update" option following template change HOT 3
- question: Ability to add features and/or modify feature parameters HOT 3
- feat: Add a way to select a specific branch before building the devcontainer HOT 2
- kubernetes: build with an initContainer or a Job HOT 2
- envbuilder - pass ssh key HOT 1
- envbuilder: fetch upstream changes from repo if local copy is not dirty HOT 5
- Add support for starting envbuilder from a built image
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from envbuilder.