Giter VIP home page Giter VIP logo

Comments (14)

mtojek avatar mtojek commented on August 11, 2024 1

@BrunoQuaresma We may need to write a mini-RFC describing the status quo.

from envbuilder.

BrunoQuaresma avatar BrunoQuaresma commented on August 11, 2024 1

@bpmct I tried to use envbuilder in a GPU environment and it worked as expected. Here is how I made it:

  • Spin up a k8s cluster with GPU support on GKE
    • GKE version 1.27.13-gke.1000000
    • Machine type n1-standard-4
    • GPU accelerators (per node) 2 x NVIDIA T4
  • Setup a test repo with devcontainer using a Nvidia test image
  • Set the following envbuilder config
    • GIT_URL as https://github.com/BrunoQuaresma/envbuilder-gpu-test
    • INIT_SCRIPT as /tmp/vectorAdd

This is the output:

envbuilder - Build development environments from repositories in a container
#1: ๐Ÿ“ฆ Cloning https://github.com/BrunoQuaresma/envbuilder-gpu-test to /workspaces/envbuilder-gpu-test...
#1: Enumerating objects: 4, done.
#1: Counting objects:  25% (1/4)
#1: Counting objects:  50% (2/4)
#1: Counting objects:  75% (3/4)
#1: Counting objects: 100% (4/4)
#1: Counting objects: 100% (4/4), done.
#1: Compressing objects:  50% (1/2)
#1: Compressing objects: 100% (2/2)
#1: Compressing objects: 100% (2/2), done.
#1: Total 4 (delta 0), reused 4 (delta 0), pack-reused 0
#1: ๐Ÿ“ฆ Cloned repository! [193.807769ms]
#2: Deleting filesystem...
#2: ๐Ÿ—๏ธ Building image...
#2: Retrieving image manifest nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
#2: Retrieving image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 from registry nvcr.io
#2: Built cross stage deps: map[]
#2: Retrieving image manifest nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
#2: Returning cached image manifest
#2: Executing 0 build triggers
#2: Building stage 'nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2' [idx: '0', base-idx: '-1']
#2: ๐Ÿ—๏ธ Built image! [3.019338331s]
#3: no user specified, using root
#3: ๐Ÿ”„ Updating the ownership of the workspace...
#3: ๐Ÿ‘ค Updated the ownership of the workspace! [449.651ยตs]
=== Running the init command /bin/sh [-c /tmp/vectorAdd] as the "root" user...
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

@bpmct do you think we can get more details from the user?

from envbuilder.

marrotte avatar marrotte commented on August 11, 2024 1

@johnstcn I'm seeing the issue on:

  • K8s Rev: v1.27.7
  • Node image: AKSUbuntu-2204gen2containerd-202401.09.0
  • Plugin image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:1.11
  • Pod image = "ghcr.io/coder/envbuilder:0.2.9"

from envbuilder.

nikawang avatar nikawang commented on August 11, 2024 1

Oh, I fixed it by

echo "/usr/lib64" > /etc/ld.so.conf.d/customized.conf 
ldconfig

from envbuilder.

BrunoQuaresma avatar BrunoQuaresma commented on August 11, 2024 1

@marrotte does the @nikawang fix work for you?

from envbuilder.

marrotte avatar marrotte commented on August 11, 2024 1

@BrunoQuaresma I don't think I can test that as my envbuilder fails to build. I believe @nikawang either applied that fix to a envbuilder running container or the running container built by envbuilder. I did try applying @nikawang's fix to the AKS/K8s GPU node as if that might be where it was applied and that had no effect.

from envbuilder.

BrunoQuaresma avatar BrunoQuaresma commented on August 11, 2024

After talk to @mtojek I think I have a good plan:

  • Try to run envbuilder in a regular environment
    • Spin up a regular k8s cluster on Google Cloud
    • Try to run envbuilder with a hello world image and see if it works
  • Try to reproduce the user error by running envbuilder with a GPU
    • Spin up a k8s cluster using a NVidia GPU on Google Cloud
    • Try to run envbuilder with a hello world image and see if it works
    • Try to find a workaround
    • Investigate possible solutions using diff builders besides kaniko

from envbuilder.

BrunoQuaresma avatar BrunoQuaresma commented on August 11, 2024

I am closing this for now until we have more context from the user.

from envbuilder.

marrotte avatar marrotte commented on August 11, 2024

Try using the NVIDIA k8s device plugin (DaemonSet) and not a NVIDIA container image, e.g.:

https://github.com/NVIDIA/k8s-device-plugin

https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#nvidia-device-plugin-installation

This is the recommended approach when using GPU-enabled node pools for Azure Linux.

from envbuilder.

johnstcn avatar johnstcn commented on August 11, 2024

Try using the NVIDIA k8s device plugin (DaemonSet)

@marrotte FYI while we tested using GKE, the cluster we tested on does use the device plugin. However, it appears to be a customized version for GKE COS, and I will freely admit that cluster is a bit old.

What Kubernetes version are you seeing issues with on AKS?

not a NVIDIA container image, e.g.:

This container image is the one NVIDIA recommends to test GPU support (c.f. https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#running-gpu-jobs)

What image would you recommend instead as a test? I note that the Azure docs you linked reference a separate MNIST test image.

from envbuilder.

nikawang avatar nikawang commented on August 11, 2024

@BrunoQuaresma
Still have issues via using your test repo on AKS agains ghcr.io/coder/envbuilder:0.2.9

root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/# cd /tmp
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/tmp# ls
coder.wTqTN7  vectorAdd
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/tmp# ./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
/usr/lib64# ll
total 176176
drwxr-xr-x  2 root root     4096 Jun 13 14:00 ./
drwxr-xr-x 14 root root     4096 Jun 13 13:49 ../
lrwxrwxrwx  1 root root       42 Jun 13 13:49 ld-linux-x86-64.so.2 -> /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2*
-rwxr-xr-x  1 root root 28392536 Jun 13 12:01 libcuda.so.550.54.15*
-rwxr-xr-x  1 root root 10524136 Jun 13 12:01 libcudadebugger.so.550.54.15*
-rwxr-xr-x  1 root root   168744 Jun 13 12:01 libnvidia-allocator.so.550.54.15*
-rwxr-xr-x  1 root root   398968 Jun 13 12:01 libnvidia-cfg.so.550.54.15*
lrwxrwxrwx  1 root root       36 Jun 13 14:00 libnvidia-ml.so -> /usr/lib64/libnvidia-ml.so.550.54.15*
-rwxr-xr-x  1 root root  2078360 Jun 13 12:01 libnvidia-ml.so.550.54.15*
-rwxr-xr-x  1 root root 86842616 Jun 13 12:01 libnvidia-nvvm.so.550.54.15*
-rwxr-xr-x  1 root root 23293568 Jun 13 12:01 libnvidia-opencl.so.550.54.15*
-rwxr-xr-x  1 root root    10168 Jun 13 12:01 libnvidia-pkcs11.so.550.54.15*
-rwxr-xr-x  1 root root 28670368 Jun 13 12:01 libnvidia-ptxjitcompiler.so.550.54.15*

from envbuilder.

BrunoQuaresma avatar BrunoQuaresma commented on August 11, 2024

@marrotte could you please share with me how I can set up a similar k8s cluster step by step or a Terraform file where I can just run it?

from envbuilder.

BrunoQuaresma avatar BrunoQuaresma commented on August 11, 2024

I tried to create a Kubernetes GPU cluster on Azure following this tutorial, but without success. During the process, I managed to get the cluster up and running and register the required features and services through step five of the tutorial.

bruno [ ~ ]$ az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
{
  "id": "/subscriptions/05e8b285-4ce1-46a3-b4c9-f51ba67d6acc/providers/Microsoft.Features/providers/Microsoft.ContainerService/features/GPUDedicatedVHDPreview",
  "name": "Microsoft.ContainerService/GPUDedicatedVHDPreview",
  "properties": {
    "state": "Registered"
  },
  "type": "Microsoft.Features/providers/features"
}

However, when I began adding the node pool, I started encountering errors.

az aks nodepool add \
    --resource-group bruno \
    --cluster-name bruno-gpu \
    --name gpunp \
    --node-count 1 \
    --node-vm-size Standard_NC6s_v3 \
    --node-taints sku=gpu:NoSchedule \
    --aks-custom-headers UseGPUDedicatedVHD=true \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 3
(OperationNotAllowed) .properties.nodeProvisioningProfile.mode cannot be Auto while any AgentPools have .properties.enableAutoScaling enabled
Code: OperationNotAllowed
Message: .properties.nodeProvisioningProfile.mode cannot be Auto while any AgentPools have .properties.enableAutoScaling enabled

I tried searching for the error on Google to find a solution or any information related to properties.nodeProvisioningProfile.mode, but I didn't find anything helpful. I realized that it might be better to ask if you could share a Terraform file or a more straightforward tutorial for us to reproduce your environment.

from envbuilder.

maxbrunet avatar maxbrunet commented on August 11, 2024

So ENVBUILDER_IGNORE_PATHS can be set to /dev,/lib/firmware/nvidia,/usr/bin/nv-,/usr/bin/nvidia-,/usr/lib64/libcuda,/usr/lib64/libnvidia-,/var/run, but we hit the known unlinkat/device or resource busy error.

The easiest to get the right environment to reproduce is likely the gpu-operator for Kubernetes or the NVIDIA Container Toolkit for Docker.

I believe #183 (and #249) can provide a workaround here by temporarily remounting the path out of the way instead trying to ignore them in Kaniko, although note that mount/umount require privileges.

Currently only read-only mounts are taken care of, but the NVIDIA container runtime mounts devtmpfs filesystems at /var/run/nvidia-container-devices/GPU-<uuid> (the actual mountpoint can be /run since often /var/run is a symlink to it), the logic would need to be extended to cover those (I have successfully done that). No special handling needed, I probably had forgotten to add back /var/run to the ignored paths.

The runtime mounts libraries with symlinks:

libcuda.so -> libcuda.so.1
libcuda.so.1 -> libcuda.so.<driver-version>
libcuda.so.<driver-version>
libcudadebugger.so.1 -> libcudadebugger.so.<driver-version>
libcudadebugger.so.<driver-version>
libnvidia-allocator.so.1 -> libnvidia-allocator.so.<driver-version>
libnvidia-allocator.so.<driver-version>
libnvidia-cfg.so.1 -> libnvidia-cfg.so.<driver-version>
libnvidia-cfg.so.<driver-version>
libnvidia-ml.so.1 -> libnvidia-ml.so.<driver-version>
libnvidia-ml.so.<driver-version>
libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.<driver-version>
libnvidia-nvvm.so.<driver-version>
libnvidia-opencl.so.1 -> libnvidia-opencl.so.<driver-version>
libnvidia-opencl.so.<driver-version>
libnvidia-pkcs11-openssl3.so.<driver-version>
libnvidia-pkcs11.so.<driver-version>
libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.<driver-version>
libnvidia-ptxjitcompiler.so.<driver-version>

The symlinks must also be preserved. The location in the envbuilder image is /usr/lib64, but it differs between distros (for example in Debian, it should be /usr/lib/x86_64-linux-gnu), so the remount process must discover the appropriate location in the new filesystem hierarchy.

I am using this quick-and-dirty script afterward to get things working:

remount_and_resymlink.sh
#!/usr/bin/env bash
set -euo pipefail

TARGET=/usr/lib/x86_64-linux-gnu

FIRMWARES=(/lib/firmware/nvidia/*)
VERSION="${FIRMWARES[0]}"
VERSION="${VERSION##*/}"

mount | awk '/\/usr\/lib64/{print $3}' | while read -r path; do
  lib="${path##*/}"

  mkdir -p "${TARGET}"
  touch "${TARGET}/${lib}"
  mount --bind "${path}" "${TARGET}/${lib}"
  unmount "${path}"

  case "${lib}" in
    libnvidia-pkcs11.so.*) ;;
    libnvidia-pkcs11-openssl3.so.*) ;;
    libnvidia-nvvm.so.*)
      n=4
      ;;
    *)
      n=1
      ;;
  esac

  if [[ -n "${n:-}" ]]; then
    ln -s "${lib}" "${TARGET}/${lib%"${VERSION}"}${n}"
  fi

  if [[ "${lib}" == "libcuda.so."* ]]; then
    ln -s "${lib%"${VERSION}"}${n}" "${TARGET}/${lib%".${VERSION}"}"
  fi
done

This is the logic the runtime uses to pick the library directory: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_container.c#L151-L188

And this looks like the libraries it can potentially mount: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_info.c#L75-L139

Once that is all in place, the nvidia-smi command should work, the GPU(s) should be visible as well as the CUDA version.

In an image with pytorch (e.g. nvcr.io/nvidia/pytorch:24.05-py3), python -c 'import torch; print(torch.cuda.is_available())' should return True.

One thing I have not figured out yet is why the container gets all GPUs when only 1 is requested (this works properly for a regular container) ๐Ÿ˜• That's because the pod is running with privileges.

The manifest I am using at the moment:

pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: envbuilder 
spec:
  containers:
  - name: envbuilder
    image: ghcr.io/coder/envbuilder-preview
    env:
    - name: FALLBACK_IMAGE
      value: debian
    - name: INIT_SCRIPT
      value: sh -c 'while :; do sleep 86400; done' 
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
    resources:
      limits:
        nvidia.com/gpu: "1"
    securityContext:
      privileged: true

If you are on GCP/GKE, the above should be valid for Ubuntu nodes (I think, I am not testing there). I need to investigate on GCP's ContainerOS too since things are wired a little differently. On GCP's ContainerOS the only mount is /usr/local/nvidia, so this path can either be ignored or remounted and no care is given regarding the PATH or ldconfig search path by default, it has to be handle the user's image (e.g. LD_LIBRARY_PATH=/usr/local/nvidia/lib64 /usr/local/nvidia/bin/nvidia-smi should work).

from envbuilder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.