nvidia / nvidia-container-runtime Goto Github PK

NVIDIA container runtime

License: Apache License 2.0

nvidia-container-runtime's Introduction

DEPRECATION NOTICE

This project has been superseded by the NVIDIA Container Toolkit. The toolking provided by it has been migrated to the NVIDIA Container Toolkit and this repository is archived.

For further instructions, see the NVIDIA Container Toolkit documentation and specifically the install guide.

Issues and Contributing

Checkout the Contributing document!

For questions, feature requests, or bugs, open an issue against the nvidia-container-toolkit repository.

nvidia-container-runtime's People

Contributors

Stargazers

Watchers

Forkers

dllehr81 zvonkok acbrewbaker almad fengguangyuan maksim-vatkin mahak tcwalther clnperez jolting wking julianocristian flx42 nagpach whisperai gijzelaerr kpjensen openthings kramergroup qinzhao168 hmizuma dev-zero anight guojianzhou wanyvic xiaolin1990 hansongwei sarjeet2013 cicean ijumps avantsao stjordanis kissthink rengaoshan zmoon111 krenshaw2018 marcotresch mtresch renefritze akihirosuda kgrvamsi jessiewy connectionmaster riverzhang jimoosciuc rulai-jianfang henglianghe edenbuaa alexrashed huawuzui ricwg atline curtismuntz elementyang judu rafalohaki euri10 louislee831 rajatchopra cugxchen hitzht lxyea chelarua alexwitt23 accepting isgasho igamemedia strontium1967 hongli-my quinndiggity hustcat huanwei devhliu amruta-bandhu-chaudhury king-jingxiang acidburn0zzz jinlmsft krzemienski cas-pian ksauzz rimms darkspadez archlitchi daddydrac drcwr richkoala dcermak xingfeng2510 sycomix swansealeo dearsource password442619 jiapei100 chaibapchya paroque28 interstallers shobhit-agarwal mdlglobal-atlassian-net qingshanyinyin nvjmayo

nvidia-container-runtime's Issues

runtime hook using newer cli

Using a kubernetes example:

$ docker build -f nvidia-cuda-vector_Dockerfile -t cuda-vector-add:v0.1 .

Sending build context to Docker daemon  2.046GB
Step 1/5 : FROM nvidia/cuda-ppc64le:8.0-devel-ubuntu16.04
 ---> 62553fb74993
Step 2/5 : RUN apt-get update && apt-get install -y --no-install-recommends         cuda-samples-$CUDA_PKG_VERSION &&     rm -rf /var/lib/apt/lists/*
 ---> Running in 1f1ddbb19617
OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/local/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --compat32 --graphics --utility --video --display --require=cuda>=8.0 --pid=121739 /var/lib/docker/aufs/mnt/8e537edc1ae0f2e5d7190854e90d35eee0d6d5251eb79d21a396811011333e05]\\\\nnvidia-container-cli configure: unrecognized option '--display'\\\\nTry `nvidia-container-cli configure --help' or `nvidia-container-cli configure\\\\n--usage' for more information.\\\\n\\\"\"": unknown

But if I bump up my FROM from 8.0 to 9.2, I don't get that error and my container builds. I see that the --display option was added to the configure subcommand in late Feb, so I'm thinking this is just a mismatch that expects the cli version to be newer?

I found someone else running 8.0 on x86 has hit the same issue: https://stackoverflow.com/questions/49938024/errors-on-install-driverless-ai

$ nvidia-container-cli --version
version: 1.0.0
build date: 2018-02-07T18:40+00:00
build revision: c4cef33eca7ec335007b4f00d95c76a92676b993
build compiler: gcc-5 5.4.0 20160609
build platform: ppc64le

I'm being a little lazy and not figuring this out myself, but I'm sure you know pretty quickly what caused this so I don't feel too guilty about my laziness. :D

Building for ubuntu 19.04 - can't find Docker image

I am trying to build the runtime for Ubuntu 19.04 by making some modifications to the Makefile and using the same patchset as 18.04. However I am stuck at the point where it is trying to grab a nvidia/base/ubuntu:19.04 image and yields:

docker build --build-arg VERSION_ID="19.04" \
                        --build-arg RUNC_COMMIT="6635b4f0c6af3810594d2770f662f34ddc15b40d" \
                        --build-arg PKG_VERS="2.0.0+docker" \
                        --build-arg PKG_REV="1" \
                        -t "nvidia/runtime/ubuntu:"19.04"-docker" -f Dockerfile.ubuntu .
Sending build context to Docker daemon  69.12kB
Step 1/21 : ARG VERSION_ID
Step 2/21 : FROM nvidia/base/ubuntu:${VERSION_ID}
pull access denied for nvidia/base/ubuntu, repository does not exist or may require 'docker login'
make: *** [Makefile:81: ubuntu] Error 1

I guess this is to be expected, but I was unable to find the Dockerfiles for nvidia/base/ubuntu so that I could make one for ubuntu 19.04. Any tips would be appreciated, thanks.

Add packages for the Docker versions supported by KOPS

Sister issue to NVIDIA/nvidia-docker/issues/689

Syntax error in runtime/Makefile

Same as NVIDIA/nvidia-docker#692

please remove the comma after 18.03.0 in runtime/Makefile

what are needed to install in the container?

Hi,

I installed nvidia-container-runtime in Centos 7, and ran successfully with images nvidia/cuda. However, I need to run GPU container based on another images. So I have to install something in the container to use GPU. Someone said that only cuda is needed to install in the container. So I installed cuda 8.0 and cudnn 6.0 in the container. But it just didn't work.
And the error messages are below:
`/opt/conda/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)

2018-04-16 14:24:08.755238: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

2018-04-16 14:24:08.757483: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUresult(-1)

2018-04-16 14:24:08.757535: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: e6cbf58314a6

2018-04-16 14:24:08.757565: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: e6cbf58314a6

2018-04-16 14:24:08.757635: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.66.0

2018-04-16 14:24:08.757672: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.66 Tue Aug 1 16:02:12 PDT 2017

GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)
"""

2018-04-16 14:24:08.757731: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 384.66.0

2018-04-16 14:24:08.757748: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 384.66.0

b'Hello, My TensorFlow!'
`
Anyone know why this happened? tks!

Error

Ubuntu 16.04

I've been using docker2 with no issues except that I kept receiving this error when running a package.

libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast

I could not troubleshoot the issue. Apparently docker2 does not support.

I removed Docker2 with apt remove and purge.

I installed nvidia-container-runtime with apt. (apparently runs libgl)

I created the systemd file, etc.

When I run sudo systemctl restart docker I get the following error.

Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.

Not sure what this is?

When I tried running docker I get the following error:

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Please help!

Thanks.

gpg: no valid OpenPGP data found.

I got the error "gpg: no valid OpenPGP data found. "
after run "curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey |
sudo apt-key add -"

How to fix it?

How can I install nvidia-docker2, nvidia-container-runtime in other linux distributions?

Hi,

I have been trying to install docker-nvidia2 in Debian (8) Jessie without success. I am able to install 17.12.0~~ce-0~~debian. As I don't see any mention of "other supported distributions", I wonder if is this installation possible?

Best regards,

Failed to install deb pkg

I came across the following problem when installing pkg

# dpkg -i nvidia-container-runtime_2.0.0+docker1.12.6-1_amd64.deb
dpkg-deb: error: 'nvidia-container-runtime_2.0.0+docker1.12.6-1_amd64.deb' is not a debian format archive
dpkg: error processing archive nvidia-container-runtime_2.0.0+docker1.12.6-1_amd64.deb (--install):
 subprocess dpkg-deb --control returned error exit status 2
Errors were encountered while processing:
 nvidia-container-runtime_2.0.0+docker1.12.6-1_amd64.deb

and my os is

# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.1 LTS
Release:	16.04
Codename:	xenial

How to solve it ?

unable to install on Docker

root@2c62044607a9:/# curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey |sudo apt-key add -
OK

root@2c62044607a9:/# curl -s -L https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/amd64 /
deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64 /

root@2c62044607a9:/# sudo apt-get update
Ign:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 InRelease
Hit:2 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 Release
Hit:3 http://security.ubuntu.com/ubuntu xenial-security InRelease
Hit:5 http://archive.ubuntu.com/ubuntu xenial InRelease
Hit:6 http://archive.ubuntu.com/ubuntu xenial-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu xenial-backports InRelease
0% [Working]

Just hangs on 0% [Working]

Is this a replacement of nvidia-docker in the future?

@flx42 Hi, I just found this project and have a try on my local, it is awesome!

I'm just wondering is this a replacement of nvidia-docker in the future?
We have used nvidia-docker 2.0 in staging env, and i see nvidia-docker project is still on the way to 2.0 release.

What is this project goal? Will it replace nvidia-docker?

ppc64le, RedHat7.6 docker-nvidia2

Hi
I´m trying to find the nvidia2-docker for ppc64le RedHat (power9 repo), any tips would be perfect.
Cheers
S

Startup Problems on Ubuntu 14.04

Trying to start docker with runtime=nvidia, I get the following error message

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused "process_linux.go:385: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --device=all --compute --utility --video --pid=30055 /var/lib/docker/aufs/mnt/e5827a97552f6cd9825b4b3e68fc565e3ae791addeb680394c12de943a5c64d4]\\nnvidia-container-cli: mount error: file creation failed: /var/lib/docker/aufs/mnt/e5827a97552f6cd9825b4b3e68fc565e3ae791addeb680394c12de943a5c64d4/usr/bin/nvidia-smi: file exists\\n\""": unknown.
ERRO[0000] error waiting for container: context canceled

Apparently, the multiple line command is not recognized as such but as a file name for the volume parameter.

PS:
Docker version 18.06.1-ce, build e68fc7a

NVIDIA_VISIBLE_DEVICES makes checkpointing impossible

Issue Description

First of all, thank you for this useful container! Now, my problem: I want to perform a live migration of a running docker container. My container is using the Nvidia runtime and GPU passthrough to container using ENV NVIDIA_VISIBLE_DEVICES all. Live migration is done through docker's experimental checkpoint and restore using CRIU. So I start my container

nvidia-docker run --name myApp -i app

which runs fine, and also exits normally if not checkpointed. However, if I create a checkpoint using

docker checkpoint create --leave-running=true myApp checkpoint1

I get following error response:

Error response from daemon: Cannot checkpoint container myApp: nvidia-container-runtime did not terminate sucessfully: criu failed: type NOTIFY errno 0 path= /var/run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/10eb4bf688652c5d8f612fca192c12c80cc59bce605cf3ff0a0e8a0e07ce17da/criu-dump.log: unknown

Inspecting file criu-dump.log leads me to the following error:

Error (criu/mount.c:925): mnt: Mount 448 ./proc/driver/nvidia/gpus/0000:01:00.0 (master_id: 13 shared_id: 0) has unreachable sharing. Try --enable-external-masters.

So it seems to be an issue with CRUI not being able to create a checkpoint of the GPU. The --enable-external-masters command is not suitable for GPUs. Including ENV NVIDIA_DRIVER_CAPABILITIES all in the Dockerfile does not resolve the issue either. So my main question is if there is a way of integrating a GPU dump from the nvidia-docker container checkpoint. Proper support from Nvidia seems to be not given yet, but there is Nvidia software for live migrating between GPUs. So can we expect support for this kind of application, too?

Steps to Reproduce

Set docker to experimental mode, install criu for docker, install nvidia driver, install nvidia-docker, install nvidia-container-runtime.

Create your Dockerfile with the following two lines:

FROM ubuntu:16.04
ENV NVIDIA_VISIBLE_DEVICES all

Build the container

sudo docker build -t app .

When the container is built, run it

nvidia-docker run --name myApp -i app

where you have to waive the -t flag to avoid issues with CRIU. Next, we try to create the checkpoint

docker checkpoint create --leave-running=true myApp checkpoint1

upon which you will receive the error response from the issue description.

System Information

I'm attaching hopefully relevant system information, as suggested in the nvidia-docker issues.

Kernel version from uname -a:

Linux ECS 4.15.0-39-generic #42~16.04.1-Ubuntu SMP Wed Oct 24 17:09:54 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Driver information from nvidia-smi -a:

==============NVSMI LOG==============

Timestamp                           : Tue Nov 20 09:37:42 2018
Driver Version                      : 384.130

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Product Name                    : GeForce GTX 1080
    Product Brand                   : GeForce

Docker version from docker version:

Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:24:56 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:21 2018
  OS/Arch:          linux/amd64
  Experimental:     true

Nvidia packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*':

_or_: command not found
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                   Version          Architecture     Description
+++-======================-================-================-==================================================
ii  libnvidia-container-to 1.0.0-1          amd64            NVIDIA container runtime library (command-line too
ii  libnvidia-container1:a 1.0.0-1          amd64            NVIDIA container runtime library
ii  nvidia-384             384.130-0ubuntu0 amd64            NVIDIA binary driver - version 384.130
ii  nvidia-384-dev         384.130-0ubuntu0 amd64            NVIDIA binary Xorg driver development files
un  nvidia-common          <none>           <none>           (no description available)
ii  nvidia-container-runti 2.0.0+docker18.0 amd64            NVIDIA container runtime
ii  nvidia-container-runti 1.4.0-1          amd64            NVIDIA container runtime hook
un  nvidia-docker          <none>           <none>           (no description available)
ii  nvidia-docker2         2.0.3+docker18.0 all              nvidia-docker CLI wrapper
un  nvidia-driver-binary   <none>           <none>           (no description available)
un  nvidia-legacy-340xx-vd <none>           <none>           (no description available)
un  nvidia-libopencl1-384  <none>           <none>           (no description available)
un  nvidia-libopencl1-dev  <none>           <none>           (no description available)
ii  nvidia-modprobe        384.81-0ubuntu1  amd64            Load the NVIDIA kernel driver and create device fi
un  nvidia-opencl-icd      <none>           <none>           (no description available)
ii  nvidia-opencl-icd-384  384.130-0ubuntu0 amd64            NVIDIA OpenCL ICD
un  nvidia-persistenced    <none>           <none>           (no description available)
ii  nvidia-prime           0.8.2            amd64            Tools to enable NVIDIA's Prime
ii  nvidia-settings        384.81-0ubuntu1  amd64            Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary <none>           <none>           (no description available)
un  nvidia-smi             <none>           <none>           (no description available)
un  nvidia-vdpau-driver    <none>           <none>           (no description available)
dpkg-query: no packages found matching *nvidia*rpm
dpkg-query: no packages found matching -qa

NVIDIA container library version from nvidia-container-cli -V:

version: 1.0.0
build date: 2018-09-20T20:18+00:00
build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1
build compiler: gcc-5 5.4.0 20160609
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Nvidia-docker version from nvidia-docker --version:

Docker version 18.06.1-ce, build e68fc7a

The debugging logs from nvidia-container-runtime do not contain relevant information.

Fix compilation from source

It seams that access is denied:
nvidia-container-runtime/runtime$ make ubuntu18.04
make: *** Keine Regel, um „18.05.0-ubuntu18.04-runc“ zu erstellen. Schluss.
runc="" &&
docker build --build-arg VERSION_ID="18.04"
--build-arg RUNC_COMMIT="${runc}"
--build-arg PKG_VERS="2.0.0+docker18.05.0"
--build-arg PKG_REV="1"
-t "nvidia/runtime/ubuntu:18.04-docker18.05.0" -f Dockerfile.ubuntu .
Sending build context to Docker daemon 67.07kB
Step 1/21 : ARG VERSION_ID
Step 2/21 : FROM nvidia/base/ubuntu:${VERSION_ID}
pull access denied for nvidia/base/ubuntu, repository does not exist or may require 'docker login'
Makefile:59: recipe for target '18.05.0-ubuntu18.04' failed
make: *** [18.05.0-ubuntu18.04] Error 1

nvidia-container-runtime/runtime$ docker pull nvidia/base/ubuntu
Using default tag: latest
Error response from daemon: pull access denied for nvidia/base/ubuntu, repository does not exist or may require 'docker login'

ubuntu18.04/arm64 Release

Windows support (docker running linux containers)

With recent changes/improvements to windows support for things like linux support, I am wondering if there may be a future to get this working.

Setup local mirror for CentOS/RHEL

Hi!

In our company only a few machines have direct access to the internet.

This is the reason why I need to setup a local mirror of CentOS/RHEL packages.

Can you please give me a starting URL for wget --mirror?

Thanks a lot

Dirk

Configuration totally wrong for Fedora

I'm trying to install nvidia-container-runtime to run with the latest version of docker on Fedora. All methods described in the documentation don't work to register the nvidia runtime.

The /etc/docker/daemon.json doesn't work (even the path is wrong in the documentation!). Using the file described in the documentation the docker daemon doesn't start using "systemctl start docker". It rejects the daemon.json in the log file.
The override.conf doesn't work because the dockerd deamon doesn't exist!!
The same problem occurs if you try to use the dockerd command, as it doesn't exist!

No hook config file for ppc64le

@flx42 We noticed you removed the steps to manually create the hook config file from your nvidia-docker README(NVIDIA/nvidia-docker@ddb80bf)

However, we also see that the nvidia-container-runtime update that moves the hook config file into the nvidia-container-runtime-hook package(7372924) hasn't been rebuilt for ppc64le.

So at this point we're in limbo, there are no steps to create the file, but the package doesn't include them either. Is it possible to rebuild the ppc64le version of nvidia-container-runtime-hook?

Non-default nvidia-container-runtime-hook config file

Hi I'm not sure if this is correct but it looks like the path to config.toml for nvidia-container-runtime-hook is hard-coded here

nvidia-container-runtime/hook/nvidia-container-runtime-hook/hook_config.go

Line 12 in 03af0a8

configPath = "/etc/nvidia-container-runtime/config.toml"

which means if I want to use a config file located somewhere else I need to edit the code and recompile?

Is it a good idea to make this path configurable via an arg to nvidia-container-runtime or an environment variable? I can submit a PR if that's the case.

Can't find runc commit for copy during build

It looks like the current master is not working, or I am doing something wrong. When I run make ubuntu16.04 I get:

Step 15/21 : ARG RUNC_COMMIT                                                                                                                                                                                
 ---> Running in be509c9aae08                                                                                                                                                                               
Removing intermediate container be509c9aae08                                                                                                                                                                
 ---> be1855195d03                                                                                                                                                                                          
Step 16/21 : COPY runc/$RUNC_COMMIT/ /tmp/patches/runc                                                                                                                                                      
COPY failed: stat /var/lib/docker/tmp/docker-builder035100033/runc/4fc53a81fb7c994640722ac585fa9ca548971871: no such file or directory                                                                      
Makefile:65: recipe for target '18.03.0-ubuntu16.04' failed                                                                                                                                                 
make[1]: *** [18.03.0-ubuntu16.04] Error 1                                                                                                                                                                  
make[1]: Leaving directory '/home/gijs/Work/nvidia-container-runtime/runtime'                                                                                                                               
Makefile:31: recipe for target 'runtime-ubuntu16.04' failed                                                                                                                                                 
make: *** [runtime-ubuntu16.04] Error 2

socket: no such device or address

Using nvidia-container-runtime with containerd I am getting this error:

ctr: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=5773 /run/containerd/io.containerd.runtime.v1.linux/default/nvidia-smi/rootfs]\\\\nnvidia-container-cli: mount error: file creation failed: /run/containerd/io.containerd.runtime.v1.linux/default/nvidia-smi/rootfs/run/nvidia-persistenced/socket: no such device or address\\\\n\\\"\"": unknown

Failed to start docker after installing docker container.

I got the same error message same with this post and Follow @flx42 solution I removed --add-runtime=... part and now docker runs again. the --add-runtime=.. part is from the README here
So could you please update the README?

How to install nvidia-container-runtime in alpine OS?

When running cuda container with v2 I get ECC error from GPU driver

I am using ubuntu 16.04 in AWS instance type is p2.xl

NVIDIA driver:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   45C    P0    69W / 149W |  10450MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3701      C   python                                     10436MiB |
+-----------------------------------------------------------------------------+

Error I get after running docker command with nvidia runtime

I get driver error on ECC

[  374.665347] NVRM: Xid (PCI:0000:00:1e): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 0, subpartition 1.
[  374.675104] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000010, engmsk 00000100
[  374.695751] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000010, engmsk 00000100
[  374.716755] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000011, engmsk 00000100
[  374.737608] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000012, engmsk 00000100
[  374.758500] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000013, engmsk 00000100
[  374.779381] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000014, engmsk 00000100
[  374.800005] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000015, engmsk 00000100
[  374.820647] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000016, engmsk 00000100
[  374.838173] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000017, engmsk 00000100
[  374.860733] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000011, engmsk 00000100
[  374.877895] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000012, engmsk 00000100
[  374.953751] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000013, engmsk 00000100
[  374.974460] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000014, engmsk 00000100
[  374.995350] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000015, engmsk 00000100
[  375.017168] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000016, engmsk 00000100
[  375.038242] NVRM: Xid (PCI:0000:00:1e): 45, Ch 00000017, engmsk 00000100
[  376.421044] NVRM: Xid (PCI:0000:00:1e): 64, Dynamic Page Retirement: Page is already pending retirement, reboot to retire page (0x00000000002ce857).
[  376.433349] NVRM: Xid (PCI:0000:00:1e): 64, Dynamic Page Retirement: Page is already pending retirement, reboot to retire page (0x00000000002ce857).
[  376.450507] NVRM: Xid (PCI:0000:00:1e): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000002ce857).

Error: ubuntu18.04/arm64 Release Not Found

Running Ubuntu 18.04 LTS,

"sudo apt-get udpate"

I get:

Err:14 https://nvidia.github.io/nvidia-container-runtime/ubuntu18.04/arm64  Release
  404  Not Found [IP: 185.199.111.153 443]
E: The repository 'https://nvidia.github.io/nvidia-container-runtime/ubuntu18.04/arm64  Release' does not have a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.

Full error readout attached:
Ubuntu 18.04 LTS Error.pdf

Make selinux labes persist through reboots

Hi, I am using the nvidia-container-runtime-hook with fedora docker and based on advice in other issues, run the following command to set selinux labels to allow containers access to the nvidia devices:

chcon -t container_file_t /dev/nvidia*

However on reboot these labels obviously get lost and need to be reset each time. Is there a recommend way of making them persist, for example using udev or an nvidia specific way?

deb

For RHEL 7.5 the repo is empty/not found

Following install instructions @ https://github.com/NVIDIA/nvidia-docker:

'distribution' value is set to 'rhel7.5'.

But curl returns empty .repo file.

curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo

Where is the repo for RHEL 7.5?

/usr/bin overlay uses `nosuid` argument preventing sudo

Not sure if this repo is the appropriate place to post this issue. When I run the following command:

docker run --rm -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all --rm ubuntu:14.04 cat /etc/mtab | grep usr/bin

The output shows:

overlay /usr/bin overlay rw,nosuid,nodev,relatime,lowerdir=/var/lib/docker/o...

Indicating that /usr/bin has been mounted with the nosuid option. When running in the container as a root user this does not present any issues. However, when running as a regular user and trying to invoke sudo at /usr/bin/sudo it results in the following error:

sudo: effective uid is not 0, is /usr/bin/sudo on a file system with the 'nosuid' option set or an NFS file system without root privileges?

The expected behavior is for the sudo command to work.

I am not sure which line is responsible for mounting /usr/bin but presumably it has to do with making commands such as nvidia-smi available within the container. This is useful but certainly not at the expense of sudo capability. If this is outside the scope of this project to fix, then at least having a way to disable /usr/bin mounting would be appreciated to avoid any nosuid mount flags being set.

Support running at node without GPU devices

Hi, I'm using k8s 1.11 and device-plugin .Some of my nodes has no GPU resource, device-plugin pod will crash .
Logs like below

container_linux.go:262: starting container process caused "process_linux.go:345: container init caused \"process_linux.go:328: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command:
[/usr/bin/nvidia-container-cli --load-kmods --debug=/dev/stderr configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=6666

There is a sloution to use Affinity in k8s,but i have to determine which node have GPU resource.
Another way is add some logical to check whether this node have GPU ,and return success when no GPU found .

nvidia-container-runtime/hook/nvidia-container-runtime-hook/main.go

Lines 89 to 95 in aa11413

 nvidia := container.Nvidia 

 if nvidia == nil { 

 // Not a GPU container, nothing to do. 

 return 

 } 

 rootfs := getRootfsPath(container)

For now I made my change at line 94 ,and it works fine.

Usage example fail

Hi, I run the README usage example, and it is failed.
The usage example:

cd $(mktemp -d) && mkdir rootfs
curl -sS http://cdimage.ubuntu.com/ubuntu-base/releases/16.04/release/ubuntu-base-16.04-core-amd64.tar.gz | tar --exclude 'dev/*' -C rootfs -xz
nvidia-container-runtime spec
sed -i 's;"sh";"nvidia-smi";' config.json
sed -i 's;("TERM=xterm");\1, "NVIDIA_VISIBLE_DEVICES=0";' config.json
sudo nvidia-container-runtime run nvidia_smi

The error log:

container_linux.go:345: starting container process caused "process_linux.go:424: container init caused "process_linux.go:407: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=0 --utility --pid=11009 /tmp/tmp.1fN7JGkKQQ/rootfs]\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\n\"""

This error log is very similar to the one I got as I run the nvidia-docker2 test command docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi, so I guess my problem comes from nvidia-container-runtime

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused "process_linux.go:407: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=10900 /var/lib/docker/aufs/mnt/51dfbfd55f1095ff103cf334aca6459cf1e10a4152494b7e991939fe63b463e2]\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\n\""": unknown.

How can I fix it?

I provide my system information below:
Kernel version from uname -a:

Linux josper-ThinkPad-T460p 4.15.0-47-generic #50~16.04.1-Ubuntu SMP Fri Mar 15 16:06:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Driver information from nvidia-smi -a:

==============NVSMI LOG==============

Timestamp : Thu Apr 18 11:13:20 2019
Driver Version : 418.40.04
CUDA Version : 10.1

Attached GPUs : 1
GPU 00000000:02:00.0
Product Name : GeForce 940MX
Product Brand : GeForce

Docker version from docker version:

Client:
Version: 18.09.5
API version: 1.39
Go version: go1.10.8
Git commit: e8ff056
Built: Thu Apr 11 04:44:24 2019
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 18.09.5
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: e8ff056
Built: Thu Apr 11 04:10:53 2019
OS/Arch: linux/amd64
Experimental: false

Nvidia packages version from dpkg -l '*nvidia*':

NVIDIA container library version from nvidia-container-cli -V:

Nvidia-docker version from nvidia-docker --version:

Docker version 18.09.5, build e8ff056

Failed to do Docker Engine Setup

Hello I am working on a Ubuntu 16.04 with Docker 17.12.1-1. I had already installed nvidia-docker2 before following your post.

I run Usage example without problems. I did the edition of conf file for Docker Engine Setup and after doing docker restart I got the following error:

Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.

This is the output of systemctl status docker.service:

● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─override.conf
Active: inactive (dead) (Result: exit-code) since Mon 2018-03-05 17:55:44 CET; 17s ago
Docs: https://docs.docker.com
Process: 25885 ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime (code=exited, status=1/FAILURE)
Main PID: 25885 (code=exited, status=1/FAILURE)

Mar 05 17:55:44 svstation systemd[1]: Failed to start Docker Application Container Engine.
Mar 05 17:55:44 svstation systemd[1]: docker.service: Unit entered failed state.
Mar 05 17:55:44 svstation systemd[1]: docker.service: Failed with result 'exit-code'.
Mar 05 17:55:44 svstation systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Mar 05 17:55:44 svstation systemd[1]: Stopped Docker Application Container Engine.
Mar 05 17:55:44 svstation systemd[1]: docker.service: Start request repeated too quickly.
Mar 05 17:55:44 svstation systemd[1]: Failed to start Docker Application Container Engine.

And this one for journalctl -xe:

-- Subject: Unit docker.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.service has failed.

-- The result is failed.
Mar 05 18:04:35 svstation systemd[1]: docker.service: Unit entered failed state.
Mar 05 18:04:35 svstation systemd[1]: docker.service: Failed with result 'exit-code'.
Mar 05 18:04:36 svstation systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Mar 05 18:04:36 svstation systemd[1]: Stopped Docker Application Container Engine.
-- Subject: Unit docker.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.service has finished shutting down.
Mar 05 18:04:36 svstation systemd[1]: Closed Docker Socket for the API.
-- Subject: Unit docker.socket has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.socket has finished shutting down.
Mar 05 18:04:36 svstation systemd[1]: Stopping Docker Socket for the API.
-- Subject: Unit docker.socket has begun shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.socket has begun shutting down.
Mar 05 18:04:36 svstation systemd[1]: Starting Docker Socket for the API.
-- Subject: Unit docker.socket has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.socket has begun starting up.
Mar 05 18:04:36 svstation systemd[1]: Listening on Docker Socket for the API.
-- Subject: Unit docker.socket has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.socket has finished starting up.

-- The start-up result is done.
Mar 05 18:04:36 svstation systemd[1]: docker.service: Start request repeated too quickly.
Mar 05 18:04:36 svstation systemd[1]: Failed to start Docker Application Container Engine.
-- Subject: Unit docker.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.service has failed.

-- The result is failed.
Mar 05 18:04:36 svstation systemd[1]: docker.socket: Unit entered failed state.
Mar 05 18:05:04 svstation sudo[26035]: mqg : TTY=pts/4 ; PWD=/ ; USER=root ; COMMAND=/bin/journalctl -xe
Mar 05 18:05:04 svstation sudo[26035]: pam_unix(sudo:session): session opened for user root by (uid=0)

Thanks in advance, best regards.

Marcos.

Allow to mount nvidia/cuda driver into containers on non-GPU machines

We have a mix of GPU and non-GPU hosts, nvidia driver and nvidia-container-runtime is installed everywhere for simplicity. We're trying to run tensorflow-gpu, which is statically linked to libcuda.so.1 also on non-GPU hosts, but it fails on import with ImportError: libcuda.so.1: cannot open shared object file: No such file or directory, so we'd like to have nvidia/cuda driver to be mounted inside the container.

docker run --rm -ti -e NVIDIA_VISIBLE_DEVICES=void -e NVIDIA_DRIVER_CAPABILITIES=compute,utility debian bash
root@1a71d40fb5d0:/# find /usr -name "libcuda*"

docker run --rm -ti -e NVIDIA_VISIBLE_DEVICES=none -e NVIDIA_DRIVER_CAPABILITIES=compute,utility debian bash
docker: Error response from daemon: oci runtime error: container_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/opt/nvidia-container-runtime/bin/nvidia-container-cli --debug=/var/log/nvidia-container-runtime-hook.log --ldcache=/etc/ld.so.cache configure --ldconfig=@/sbin/ldconfig --compute --utility --pid=11455 /var/lib/docker/overlay2/88c3c96d9df17b257298cfe290b8f59a3477a03b18362f3eafaad26ef8f2f1d3/merged]\\\\nnvidia-container-cli: initialization error: cuda error: unknown error\\\\n\\\"\"".




cat /var/log/nvidia-container-runtime-hook.log

-- WARNING, the following logs are for debugging purposes only --

I0808 14:53:35.970880 14062 nvc.c:281] initializing library context (version=1.0.0, build=e3a2035da5a44b8a83d9568b91a8a0b542ee15d5)
I0808 14:53:35.970944 14062 nvc.c:255] using root /
I0808 14:53:35.970951 14062 nvc.c:256] using ldcache /etc/ld.so.cache
I0808 14:53:35.970956 14062 nvc.c:257] using unprivileged user 65534:65534
I0808 14:53:35.971168 14068 driver.c:136] starting driver service
I0808 14:53:35.975580 14062 driver.c:228] driver service terminated with signal 15

The last command works on hosts with GPU/driver loaded.

docker run --rm -ti -e NVIDIA_VISIBLE_DEVICES=none -e NVIDIA_DRIVER_CAPABILITIES=compute,utility debian bash
root@5c304d8b9dd6:/# find /usr -name "libcuda*"
/usr/lib/x86_64-linux-gnu/libcuda.so.396.37
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so

Is there a way to get it working on non-GPU hosts?

Add Ubuntu 18.04 target

SELinux Module for NVIDIA containers

When we run NVIDIA containers on a SELinux enabled distribution we need a separate SELinux module to run the container contained. Without a SELinux module we have to run the container privileged as this is the only way to allow specific SELinux contexts to interact (read, write, chattr, ...) with the files mounted into the container.

A container running privileged will get the spc_t label that is allowed to rw, chattr of base types. The base types (device_t, bin_t, proc_t, ...) are introduced by the bind mounts of the hook. A bind mount cannot have two different SELinux contexts as SELinux operates on inode level.

I have created the following SELinux nvidia-container.te that works with podman/cri-o/docker.

A prerequisit for the SELinux module to work correctly is to ensure that the labels are correct for the mounted files. Therefore I have added a additional line to the oci-nvidia-hook where I am running a

nvidia-container-cli -k list | restorecon -v -f -

With this, everytime a container is started the files to be mounted will have the correct SELinux label and the SELinux will work.

Now I can run NVIDIA containers without the privileged , can cap-drop=ALL capabilites and security-opt=no-new-privileges.

podman run  --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
docker run  --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1

podman run  --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
docker run  --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1

Issue installing nvidia-container-runtime

$ sudo apt-get install nvidia-container-runtime
Reading package lists... Done
Building dependency tree
Reading state information... Done
Package nvidia-container-runtime is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
nvidia-container-runtime-hook

E: Package 'nvidia-container-runtime' has no installation candidate

Ubuntu 14.04

Issues installing nvidia-container-runtime

Trying to install the last dependency for nvidia docker 2 which is nvidia container runtime. Followed the steps for a Ubuntu install and after I call "sudo apt-get install nvidia-container-runtime" this is the error that I am getting:

Reading package lists... Done
Building dependency tree
Reading state information... Done
Package nvidia-container-runtime is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'nvidia-container-runtime' has no installation candidate

Has anyone encountered this and figured out what went wrong?

Arm64 support?

Hi,
When will be included arm64 support?

Best
Giuseppe

Manual Installation Instructions (Other Linux like CoreOS)

Hi,

I want to install nvidia-docker2 in CoreOS, but there exists no package for it. I'm prepared to follow some manual installation/provisioning steps but unable to find a source to follow. Can someone please help?

From what I pieced together until now, nvidia-docker2 is simply packaging which installs the nvidia-container-runtime and adds a docker daemon drop-in instructing to add the nvidia runtime to the daemon. Dropping a daemon manually won't be a problem, but I cannot find an nvidia-container-runtime binary for CoreOS or instructions to build it. (hence I opened this ticket in this repo).

Can you please help with instructions to install in arbitrary Linux distributions? Is the above understanding correct?

Need Nvidia docker 2 for 18.09.1

Docker CE 18.09.1 has released, please support it in Nvidia-container-runtime. Thanks.

RPM repository is unavailable

The RPM repository identified at NVIDIA-docker (CentOS 7 configuration) points to NVIDIA Container Runtime. Unfortunately, no repository data comes back when I add that to my YUM configuration.

If I try to curl the URL I would normally expect to see repository contents but instead I get a 301 (Moved Permanently) showing a Location header that points to the same path with the HTTP protocol. Querying for the HTTP (not HTTPS) path shows a 301 redirect back to the HTTPS path.

How or from where should I get the nvidia-container-runtime RPMs?

HTTPS query

curl -i https://nvidia.github.io/nvidia-container-runtime/centos7/x86_64
HTTP/1.0 200 Connection established

HTTP/1.1 301 Moved Permanently
Server: GitHub.com
Content-Type: text/html
Location: http://nvidia.github.io/nvidia-container-runtime/centos7/x86_64/
Access-Control-Allow-Origin: *
Expires: Tue, 31 Oct 2017 13:58:48 GMT
Cache-Control: max-age=600
X-GitHub-Request-Id: E052:327E:D6217C2:1356DBD7:59F87F3F
Content-Length: 178
Accept-Ranges: bytes
Date: Tue, 31 Oct 2017 13:48:48 GMT
Via: 1.1 varnish
Age: 0
Connection: keep-alive
X-Served-By: cache-iad2123-IAD
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1509457728.107603,VS0,VE6
Vary: Accept-Encoding
X-Fastly-Request-ID: b07db5257415504b3d8ff5838ab4a6e01e21b2fc

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx</center>
</body>
</html>

HTTP query

curl -i http://nvidia.github.io/nvidia-container-runtime/centos7/x86_64
HTTP/1.1 301 Moved Permanently
Server: GitHub.com
Content-Type: text/html
Location: https://nvidia.github.io/nvidia-container-runtime/centos7/x86_64
X-GitHub-Request-Id: 1F1A:7A69:12EDC0A8:1B64C01A:59F87C91
Content-Length: 178
Accept-Ranges: bytes
Date: Tue, 31 Oct 2017 13:51:55 GMT
Via: 1.1 varnish
Age: 872
X-Served-By: cache-iad2643-IAD
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1509457916.920828,VS0,VE0
Vary: Accept-Encoding
X-Fastly-Request-ID: 65c6947865bb90f76f9d0ffb85d1e1e06f62c8f9
Proxy-Connection: Keep-Alive

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx</center>
</body>
</html>

Plugin requirements

Does this plugin function due to a limitation of runC or containerd settings? Is it possible to configure the prestart hooks at the containerd daemon settings layer instead or replacing runC with this plugin?

can't start docker due to nvidia-container-runtime

I am having no luck getting docker-ce to run on my ubunut. Everytime I try starting docker daemon, it fails and print out this command it is executing

Process: 16402 ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime (code=exite
Main PID: 14715 (code=exited, status=1/FAILURE)

There is something wrong with is picture. First of all, I removed all nvidia docker related packages, and /usr/bin/nvidia-container-runtime doesn't even exists. Why is dockerd trying to add this as a runtime?

I am guessing this is the problem right now. What do I do? do I find some configuration file and remove this --add-runtime parameter? Why is docker automatically adding it?

v2.0 GPU isolation does not appear to function when run from a properly isolated cgroup

Hello,
In version 1.0, if we started nvidia-docker from a cgroup that limited the devices and had isolation set up correctly, the nvidia-docker run would start with appropriate isolation.
With version 2.0, this is no longer the case, and unless we set NVIDIA_VISIBLE_DEVICES, the container has access to all devices.

What the system looks like outside of a cgroup:

[dockeradm@centos7-gpu2 ~]$ env | grep VISIBLE
[dockeradm@centos7-gpu2 ~]$ nvidia-smi -L
GPU 0: Tesla M60 (UUID: GPU-abf1c81d-476a-7160-9697-61502dfaf622)
GPU 1: Tesla M60 (UUID: GPU-f47d176f-d1f2-f607-c5b4-bd73205b2c44)

In version 2.0, running nvidia-docker directly while inside an isolated cgroup results in no isolation:

[dockeradm@centos7-gpu2 ~]$ nvidia-docker version
NVIDIA Docker: 2.0.3
Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:20:16 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:23:58 2018
  OS/Arch:      linux/amd64
  Experimental: false
[dockeradm@centos7-gpu2 ~]$ cat /sys/fs/cgroup/devices/pbspro.slice/pbspro-36.centos7\\x2dgpu2.slice/devices.list
b *:* rwm
c 10:* rwm
c 4:* rwm
c 5:* rwm
c 1:* rwm
c 7:* rwm
c 247:0 rwm
c 195:255 rwm
c 195:1 rwm
[dockeradm@centos7-gpu2 ~]$ env | grep VISIBLE
CUDA_VISIBLE_DEVICES=1
[dockeradm@centos7-gpu2 ~]$ nvidia-smi -L
GPU 0: Tesla M60 (UUID: GPU-abf1c81d-476a-7160-9697-61502dfaf622)
[dockeradm@centos7-gpu2 ~]$ nvidia-docker run --rm -ti nvidia/cuda              
root@90133636645b:/# env | grep VISIBLE
NVIDIA_VISIBLE_DEVICES=all
root@90133636645b:/# nvidia-smi -L
GPU 0: Tesla M60 (UUID: GPU-abf1c81d-476a-7160-9697-61502dfaf622)
GPU 1: Tesla M60 (UUID: GPU-f47d176f-d1f2-f607-c5b4-bd73205b2c44)
root@90133636645b:/# exit

This works properly on version 1.0 from our DGX-1 system:

jnewman@trdgx1:~$ nvidia-docker version
NVIDIA Docker: 1.0.1

Client:
 Version:       17.12.1-ce
 API version:   1.35
 Go version:    go1.9.4
 Git commit:    7390fc6
 Built: Tue Feb 27 22:17:40 2018
 OS/Arch:       linux/amd64

Server:
 Engine:
  Version:      17.12.1-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   7390fc6
  Built:        Tue Feb 27 22:16:13 2018
  OS/Arch:      linux/amd64
  Experimental: false
jnewman@trdgx1:~$ cat /sys/fs/cgroup/devices/pbspro.slice/pbspro-$PBS_JOBID.slice/devices.list
b *:* rwm
c 10:* rwm
c 4:* rwm
c 5:* rwm
c 1:* rwm
c 7:* rwm
c 240:0 rwm
c 195:255 rwm
c 195:4 rwm
c 195:5 rwm
c 195:6 rwm
c 195:7 rwm
jnewman@trdgx1:~$ env | grep VISIBLE
CUDA_VISIBLE_DEVICES=4,5,6,7
jnewman@trdgx1:~$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-e0d96df6-ccb9-84f4-c4c8-9be64c6f3d73)
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-1efc1dd6-d297-5e1d-a27d-e6a51ee44256)
GPU 2: Tesla V100-SXM2-16GB (UUID: GPU-fc06a698-e576-7fe1-cf10-5229fc900c6b)
GPU 3: Tesla V100-SXM2-16GB (UUID: GPU-e5468d8d-c9da-3e14-a9cc-14df9cad0a96)
jnewman@trdgx1:~$ nvidia-docker run --rm -ti nvidia/cuda
root@f75d480c607d:/# env | grep VISIBLE
root@f75d480c607d:/# nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-e0d96df6-ccb9-84f4-c4c8-9be64c6f3d73)
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-1efc1dd6-d297-5e1d-a27d-e6a51ee44256)
GPU 2: Tesla V100-SXM2-16GB (UUID: GPU-fc06a698-e576-7fe1-cf10-5229fc900c6b)
GPU 3: Tesla V100-SXM2-16GB (UUID: GPU-e5468d8d-c9da-3e14-a9cc-14df9cad0a96)
root@f75d480c607d:/# exit

reference is not a tree while building master for centos7

make centos7

Step 8/20 : RUN git clone https://github.com/opencontainers/runc.git .
 ---> Using cache
 ---> ea0d355579a8
Step 9/20 : ARG PKG_VERS
 ---> Using cache
 ---> d4e8f2c1ab0d
Step 10/20 : ARG PKG_REV
 ---> Using cache
 ---> f2156e75a9ee
Step 11/20 : ENV VERSION $PKG_VERS
 ---> Using cache
 ---> 7be22e4ab3f9
Step 12/20 : ENV RELEASE $PKG_REV
 ---> Using cache
 ---> b97377ee5d3a
Step 13/20 : ENV DIST_DIR=/tmp/nvidia-container-runtime-$PKG_VERS/SOURCES
 ---> Using cache
 ---> 56440d901a67
Step 14/20 : RUN mkdir -p $DIST_DIR /dist
 ---> Using cache
 ---> 836a8916d472
Step 15/20 : ARG RUNC_COMMIT
 ---> Using cache
 ---> 017a8a8f75eb
Step 16/20 : COPY runc/$RUNC_COMMIT/ /tmp/patches/runc
 ---> Using cache
 ---> e855da2586dc
Step 17/20 : RUN git checkout $RUNC_COMMIT &&     git apply /tmp/patches/runc/* &&     if [ -f vendor.conf ]; then vndr; fi &&     make BUILDTAGS="seccomp selinux" &&     mv runc $DIST_DIR/nvidia-container-runtime
 ---> Running in a4a66dd2e2be
fatal: reference is not a tree: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
The command '/bin/sh -c git checkout $RUNC_COMMIT &&     git apply /tmp/patches/runc/* &&     if [ -f vendor.conf ]; then vndr; fi &&     make BUILDTAGS="seccomp selinux" &&     mv runc $DIST_DIR/nvidia-container-runtime' returned a non-zero code: 128
make[1]: *** [17.06.2-centos7] Error 128
make[1]: Leaving directory `/home/opc/nvidia/nvidia-container-runtime/runtime'
make: *** [runtime-centos7] Error 2

ubuntu18.04/arm64 Release Not Found

Just to be clear here. The Transfer Learning Toolkit on the Jetson Nano requires the nvidia-docker and the nvidia-container-runtime 2.0 be installed. The Jetson Nano is an arm64 based architecture. So this means the Transfer Learning Toolkit cannot be installed or run on the Jetson Nano?

Installation instructions invalid on Linux Mint 18.3

Problem lies in the repository configuration:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

This results in a value of linuxmint18.3, which is probably not what was intended. When a github page 404s out like this, the result is a ton of garbage written to nvidia-container-runtime.list:

$ cat /etc/apt/sources.list.d/nvidia-container-runtime.list

...
4OkaxXCp+7drdDBCAdubm6eidX+2WwqT5komwh4YQLk+H4aE93h8Xg2gvHekQZOGSgLZTLyDTLJ4Lx9/KZWKBSainT4Iy3FqQBfnUZR42PKQFksBr9QKVXCPusD3OiA/RkQ5kP8qV/Jl1WywAp/6+dcmPM2zL1UrUahe4JqfnWWKXIul3uUbfP8njAFLW1OFr3gdFtZ72cNH+PtQT7/brW+NXqJAHh0y9V8/U/A1U7AfwIMAD7mS3pCbuWJAAAAAElFTkSuQmCC">
      </a>
    </div>
  </body>
</html>

The contents of /etc/os-release are as follows:

$ cat /etc/os-release
NAME="Linux Mint"
VERSION="18.3 (Sylvia)"
ID=linuxmint
ID_LIKE=ubuntu
PRETTY_NAME="Linux Mint 18.3"
VERSION_ID="18.3"
HOME_URL="http://www.linuxmint.com/"
SUPPORT_URL="http://forums.linuxmint.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/linuxmint/"
VERSION_CODENAME=sylvia
UBUNTU_CODENAME=xenial

The UBUNTU_CODENAME=xenial is what we want, which maps to ubuntu16.04.

In lieu of supporting every distribution under the sun, maybe the installation instructions could add one line to check the distribution variable for sanity, and print something if it looks strange?

Alternatively, check the success of the curl command, and use that to print an error.

Ubuntu18.04/amd64 Release Not Found

https://nvidia.github.io/libnvidia-container/ubuntu18.04/libnvidia-container.list is listing deb https://nvidia.github.io/libnvidia-container/ubuntu18.04/$(ARCH) /. This is all fine but https://nvidia.github.io/libnvidia-container/ubuntu18.04/amd64 reports # Unsupported distribution! # Check https://nvidia.github.io/libnvidia-container instead of a list of packages.

	nvidia := container.Nvidia
	if nvidia == nil {
	// Not a GPU container, nothing to do.
	return
	}

	rootfs := getRootfsPath(container)

nvidia / nvidia-container-runtime Goto Github PK

nvidia-container-runtime's Introduction

DEPRECATION NOTICE

Issues and Contributing

nvidia-container-runtime's People

Contributors

Stargazers

Watchers

Forkers

nvidia-container-runtime's Issues

Issue Description

Steps to Reproduce

System Information

-- Subject: Unit docker.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.service has failed.

-- Unit docker.service has finished shutting down. Mar 05 18:04:36 svstation systemd[1]: Closed Docker Socket for the API. -- Subject: Unit docker.socket has finished shutting down -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.socket has finished shutting down. Mar 05 18:04:36 svstation systemd[1]: Stopping Docker Socket for the API. -- Subject: Unit docker.socket has begun shutting down -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.socket has begun shutting down. Mar 05 18:04:36 svstation systemd[1]: Starting Docker Socket for the API. -- Subject: Unit docker.socket has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.socket has begun starting up. Mar 05 18:04:36 svstation systemd[1]: Listening on Docker Socket for the API. -- Subject: Unit docker.socket has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.socket has finished starting up.

-- Unit docker.service has failed.

Recommend Projects

Recommend Topics

Recommend Org

-- Subject: Unit docker.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.service has finished shutting down.
Mar 05 18:04:36 svstation systemd[1]: Closed Docker Socket for the API.
-- Subject: Unit docker.socket has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.socket has finished shutting down.
Mar 05 18:04:36 svstation systemd[1]: Stopping Docker Socket for the API.
-- Subject: Unit docker.socket has begun shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.socket has begun shutting down.
Mar 05 18:04:36 svstation systemd[1]: Starting Docker Socket for the API.
-- Subject: Unit docker.socket has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit docker.socket has begun starting up.
Mar 05 18:04:36 svstation systemd[1]: Listening on Docker Socket for the API.
-- Subject: Unit docker.socket has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel