Comments (23)
I finally was able to make it work, however i got this error:
so now I am pretty sure it cant be run with crio at the moment, any workaround to make it happen?
from gpu-manager.
I finally was able to make it work, however i got this error:
so now I am pretty sure it cant be run with crio at the moment, any workaround to make it happen?
gpu-manager want to connect docker to find some information to recover topology and usage.
from gpu-manager.
We released a new version which support CRI interface. Welcome to have a try
from gpu-manager.
@mYmNeo
When trying to run gpu-manager with crio as the default runtime I get this error
any details on how to fix it?
from gpu-manager.
@mYmNeo
When trying to run gpu-manager with crio as the default runtime I get this error
any details on how to fix it?
This error means the gpu-manager doesn't detect the gpu card on your machine. Did you install driver in your node?
from gpu-manager.
@mYmNeo
Thanks for the response.
Pretty sure I do.
when setting up Nvidia-docker runtime hook:
{
"version": "1.0.0",
"hook": {
"path": "/usr/local/nvidia/toolkit/nvidia-container-toolkit",
"args": ["nvidia-container-toolkit", "prestart"],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/nvidia/toolkit"
]
},
"when": {
"always": true,
"commands": [".*"]
},
"stages": ["prestart"]
}
I can run nvidia-smi from inside a simple Jupyter notebook container.
After removing the hook file, and running the gpu-manager daemon set I get the error I posted, also tried to add the path:
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/nvidia/toolkit
as env variable to the gpu-manager, got the same error.
from gpu-manager.
@mYmNeo
I've solved the problem, had to change:
- name: usr-directory
hostPath:
type: Directory
path: /usr
to:
- name: usr-directory
hostPath:
type: Directory
path: /run/nvidia/driver/usr
However, I came across a new problem when trying to observe the gpu-manager metrics.
I get this error:
E0625 11:52:57.438606 127961 runtime.go:110] can't read /sys/fs/cgroup/memory/kubepods/besteffort/pod6f88cfc6-a9b2-4d51-add4-35a588e4990c/6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1/cgroup.procs, open /sys/fs/cgroup/memory/kubepods/besteffort/pod6f88cfc6-a9b2-4d51-add4-35a588e4990c/6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1/cgroup.procs: no such file or directory
looking at me host I saw the path gpu-manager looking for is looking different:
/sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/crio-conmon-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs
is there any quick fix?
from gpu-manager.
I guess i found the "solution" by setting --cgroup-driver like mentioned in the FAQ.
However I still get an error message since there is a typo in
E0625 12:28:22.892393 358033 runtime.go:110] can't read /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/cri-o-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs, open /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/cri-o-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs: no such file or directory
the right location:
/sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/crio-conmon-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs
from gpu-manager.
After fixing this line:
gpu-manager/pkg/runtime/runtime.go
Line 146 in 4701c60
to:
return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil
the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:
Do you have any idea how to fix it?
@mYmNeo
p.s
sorry for the all the messages :(
from gpu-manager.
/sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/crio-conmon-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs
Is your cri-o running with a systemd service with a name cri-common
?
from gpu-manager.
After fixing this line:
gpu-manager/pkg/runtime/runtime.go
Line 146 in 4701c60
to:
return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil
the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:
Do you have any idea how to fix it?
@mYmNeop.s
sorry for the all the messages :(
What's your pod yaml? The metrics will not report data point, if utilization is 0.
from gpu-manager.
After fixing this line:
gpu-manager/pkg/runtime/runtime.go
Line 146 in 4701c60
to:
return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil
the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:
Do you have any idea how to fix it?
@mYmNeo
p.s
sorry for the all the messages :(What's your pod yaml? The metrics will not report data point, if utilization is 0.
@mYmNeo
apiVersion: v1
kind: Service
metadata:
name: tf-notebook
labels:
app: tf-notebook
spec:
type: NodePort
ports:
- port: 80
name: http
targetPort: 8888
nodePort: 30001
selector:
app: tf-notebook
---
apiVersion: v1
kind: Pod
metadata:
name: tf-notebook
labels:
app: tf-notebook
spec:
securityContext:
fsGroup: 0
containers:
- name: tf-notebook
image: tensorflow/tensorflow:latest-gpu-jupyter
resources:
requests:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 10
limits:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 10
env:
- name: LOGGER_LEVEL
value: "5"
ports:
- containerPort: 8888
name: notebook
after more debuging I found my self-editing more of the gpu-manager source code in order to make it fit with my odd use case.
since my Nvidia drivers are located in a different place than usual, I need to modify a few paths in the gpu-manager code for example:
Original:
const (
NvidiaCtlDevice = "/dev/nvidiactl"
NvidiaUVMDevice = "/dev/nvidia-uvm"
NvidiaFullpathRE = `^/dev/nvidia([0-9]*)$`
NvidiaDevicePrefix = "/dev/nvidia"
)
My version:
const (
NvidiaCtlDevice = "/run/nvidia/driver/dev/nvidiactl"
NvidiaUVMDevice = "/run/nvidia/driver/dev/nvidia-uvm"
NvidiaFullpathRE = `^/run/nvidia/driver/dev/nvidia([0-9]*)$`
NvidiaDevicePrefix = "/run/nvidia/driver/dev/nvidia"
)
also edited the LD_LIBERY_PATH.
After doing so, my pod managed to see the GPU device and managed to used it. However, I've realized there was no enforcement on the memory limit.
I added the env variable LOGGER_LEVEL=5 to my pod, as you can see in the YAML file to try and debug the vcuda-controller, but there were no logs from it. As far as I understand, the vcuda-controller is triggered by a hook to the Cuda libraries, so there are few questions I want to ask to locate my problem:
- How can I verify vcuda-controller is present in my pod?
- How is the vcuda-controller ending up present in my pod?
- How I make Tensorflow, for example, use the vcuda-controller libraries?
- How the vcuda-controller working?
my assumptions:
- The vcuda-controller is not present in my pod
- Since I was editing all the path, I forgot something, and the TensorFlow app is using different libraries rather than the vcuda-controller
from gpu-manager.
After fixing this line:
gpu-manager/pkg/runtime/runtime.go
Line 146 in 4701c60
to:
return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil
the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:
Do you have any idea how to fix it?
@mYmNeo
p.s
sorry for the all the messages :(What's your pod yaml? The metrics will not report data point, if utilization is 0.
@mYmNeoapiVersion: v1 kind: Service metadata: name: tf-notebook labels: app: tf-notebook spec: type: NodePort ports: - port: 80 name: http targetPort: 8888 nodePort: 30001 selector: app: tf-notebook --- apiVersion: v1 kind: Pod metadata: name: tf-notebook labels: app: tf-notebook spec: securityContext: fsGroup: 0 containers: - name: tf-notebook image: tensorflow/tensorflow:latest-gpu-jupyter resources: requests: tencent.com/vcuda-core: 10 tencent.com/vcuda-memory: 10 limits: tencent.com/vcuda-core: 10 tencent.com/vcuda-memory: 10 env: - name: LOGGER_LEVEL value: "5" ports: - containerPort: 8888 name: notebook
after more debuging I found my self-editing more of the gpu-manager source code in order to make it fit with my odd use case.
since my Nvidia drivers are located in a different place than usual, I need to modify a few paths in the gpu-manager code for example:
Original:const ( NvidiaCtlDevice = "/dev/nvidiactl" NvidiaUVMDevice = "/dev/nvidia-uvm" NvidiaFullpathRE = `^/dev/nvidia([0-9]*)$` NvidiaDevicePrefix = "/dev/nvidia" )
My version:
const ( NvidiaCtlDevice = "/run/nvidia/driver/dev/nvidiactl" NvidiaUVMDevice = "/run/nvidia/driver/dev/nvidia-uvm" NvidiaFullpathRE = `^/run/nvidia/driver/dev/nvidia([0-9]*)$` NvidiaDevicePrefix = "/run/nvidia/driver/dev/nvidia" )
also edited the LD_LIBERY_PATH.
After doing so, my pod managed to see the GPU device and managed to used it. However, I've realized there was no enforcement on the memory limit.
I added the env variable LOGGER_LEVEL=5 to my pod, as you can see in the YAML file to try and debug the vcuda-controller, but there were no logs from it. As far as I understand, the vcuda-controller is triggered by a hook to the Cuda libraries, so there are few questions I want to ask to locate my problem:
- How can I verify vcuda-controller is present in my pod?
- How is the vcuda-controller ending up present in my pod?
- How I make Tensorflow, for example, use the vcuda-controller libraries?
- How the vcuda-controller working?
my assumptions:
- The vcuda-controller is not present in my pod
- Since I was editing all the path, I forgot something, and the TensorFlow app is using different libraries rather than the vcuda-controller
I don't know why your nvidia libraries are located at the tmpfs directory /run
. The gpu-manager try to find nvidia libraries from the mounted directory named/usr/local/host
in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.
For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so
in the directory /etc/gpu-manager/vdriver
.
For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
For 4, the README of the vcuda-controller project(https://github.com/tkestack/vcuda-controller) gives a link about the paper we released how vcuda-controller works.
from gpu-manager.
@mYmNeo Thank you for the replay!
few things I didn't understand from your answer
/usr/local/host in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.
This copy should take place only once? or every time a new pod which asks for gpu is starting?
When the gpu-manager pod is starting, I indeed see from the logs that files are being copied from /user/local/host to the path mentioned in this line:
gpu-manager/build/copy-bin-lib.sh
Line 10 in 4701c60
For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so in the directory /etc/gpu-manager/vdriver.
In the host? gpu-manager pod? app pod?
since I think it only make sense to be on the host, I don't understand how it gonna end up in my app pod
For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
The LD_LIBRARY_PATH should be pointing to the location specified in this line?
gpu-manager/build/copy-bin-lib.sh
Line 10 in 4701c60
from gpu-manager.
@mYmNeo
Is there any documentation/low-level design document you can share?
I've read your paper, it was a good high-level overview but it is not enough to understand what is going on in your code
from gpu-manager.
@mYmNeo Thank you for the replay!
few things I didn't understand from your answer
/usr/local/host in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.
This copy should take place only once? or every time a new pod which asks for gpu is starting?
When the gpu-manager pod is starting, I indeed see from the logs that files are being copied from /user/local/host to the path mentioned in this line:
gpu-manager/build/copy-bin-lib.sh
Line 10 in 4701c60
For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so in the directory /etc/gpu-manager/vdriver.
In the host? gpu-manager pod? app pod?
since I think it only make sense to be on the host, I don't understand how it gonna end up in my app pod
For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
The LD_LIBRARY_PATH should be pointing to the location specified in this line?
gpu-manager/build/copy-bin-lib.sh
Line 10 in 4701c60
/etc/gpu-manager/vdriver
can be found both in the gpu-manager pod and your host, so libcuda-control.so
should be found at the sub-directory nvidia
in /etc/gpu-manager/vdriver
in your case, if not, that means gpu-manager
doesn't located the necessary libraries.
Before your application pod is running, gpu-manager
will bind mount the driver directory located in either /etc/gpu-manager/vdriver/nvidia
for fraction request or /etc/gpu-manager/vdriver/origin
. Since your application pod uses fraction request, you should find your necessary libraries in /usr/local/nvidia/lib64
or /usr/local/nvidia/lib
of your application pod. After that, you have to check whether if you have override the default LD_LIBRARY_PATH environment which gpu-manager
set for your application. For your situation, before your app is running, the wrapped libraries libcuda-control.so
try to register some information to gpu-manager
, then the control system can be running correctly. To see the libcuda-control.so
log, set environment LOGGER_LEVEL=5
before launching your app.
As your description of your problem that no log found after setting LOGGER_LEVEL=5
, I think the problem may be that your app loaded libraries from another path not the ones gpu-manager
provides.
from gpu-manager.
@mYmNeo
Is there any documentation/low-level design document you can share?
I've read your paper, it was a good high-level overview but it is not enough to understand what is going on in your code
The vcuda-controller
project is very simple only a few of source code, so no more low-level documents are provided.
from gpu-manager.
@mYmNeo
I am adding some screenshots and logs, hopefully, it will guide me to the problem
gpu-manger logs:
copy /usr/local/host/lib/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1.1.4 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/bin/nvidia-cuda-mps-control to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-cuda-mps-server to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-debugdump to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-persistenced to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-smi to /run/nvidia/driver/usr/local/nvidia/bin/
rebuild ldcache
launch gpu manager
E0628 11:55:28.387789 168134 server.go:131] Unable to set Type=notify in systemd service file?
E0628 11:55:31.219430 168134 tree.go:337] No topology level found at 0
gpu pod - tensorflow logs:
2020-06-28 11:57:24.301082: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-28 11:57:24.343386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2020-06-28 11:57:24.343661: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.345313: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-28 11:57:24.346878: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-28 11:57:24.347176: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-28 11:57:24.348832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-28 11:57:24.349628: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-28 11:57:24.353395: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-28 11:57:24.356657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-28 11:57:24.356990: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-28 11:57:24.364042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2593995000 Hz
2020-06-28 11:57:24.364771: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f30e0000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-28 11:57:24.364796: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-06-28 11:57:24.478234: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4a46e40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-28 11:57:24.478270: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla P40, Compute Capability 6.1
2020-06-28 11:57:24.479479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2020-06-28 11:57:24.479532: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.479547: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-28 11:57:24.479559: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-28 11:57:24.479576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-28 11:57:24.479588: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-28 11:57:24.479599: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-28 11:57:24.479611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-28 11:57:24.481683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-28 11:57:24.481736: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.483290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-28 11:57:24.483315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0
2020-06-28 11:57:24.483323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N
2020-06-28 11:57:24.485578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21397 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0001:00:00.0, compute capability: 6.1)
2020-06-28 11:57:25.656053: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
from gpu-manager.
@mYmNeo
I am adding some screenshots and logs, hopefully, it will guide me to the problem
gpu-manger logs:copy /usr/local/host/lib/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1.1.4 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib copy /usr/local/host/lib64/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/lib64/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64 copy /usr/local/host/bin/nvidia-cuda-mps-control to /run/nvidia/driver/usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-cuda-mps-server to /run/nvidia/driver/usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-debugdump to /run/nvidia/driver/usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-persistenced to /run/nvidia/driver/usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-smi to /run/nvidia/driver/usr/local/nvidia/bin/ rebuild ldcache launch gpu manager E0628 11:55:28.387789 168134 server.go:131] Unable to set Type=notify in systemd service file? E0628 11:55:31.219430 168134 tree.go:337] No topology level found at 0
gpu pod - tensorflow logs:
2020-06-28 11:57:24.301082: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-06-28 11:57:24.343386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1 coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s 2020-06-28 11:57:24.343661: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-06-28 11:57:24.345313: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-06-28 11:57:24.346878: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-06-28 11:57:24.347176: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-06-28 11:57:24.348832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-06-28 11:57:24.349628: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-06-28 11:57:24.353395: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-28 11:57:24.356657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0 2020-06-28 11:57:24.356990: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-06-28 11:57:24.364042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2593995000 Hz 2020-06-28 11:57:24.364771: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f30e0000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-06-28 11:57:24.364796: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-06-28 11:57:24.478234: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4a46e40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-06-28 11:57:24.478270: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla P40, Compute Capability 6.1 2020-06-28 11:57:24.479479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1 coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s 2020-06-28 11:57:24.479532: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-06-28 11:57:24.479547: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-06-28 11:57:24.479559: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-06-28 11:57:24.479576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-06-28 11:57:24.479588: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-06-28 11:57:24.479599: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-06-28 11:57:24.479611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-28 11:57:24.481683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0 2020-06-28 11:57:24.481736: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-06-28 11:57:24.483290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-28 11:57:24.483315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 2020-06-28 11:57:24.483323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N 2020-06-28 11:57:24.485578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21397 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0001:00:00.0, compute capability: 6.1) 2020-06-28 11:57:25.656053: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
The log you provided is just not completed since you just gives the screen output. The log should be found at /etc/gpu-manager/logs
from gpu-manager.
@mYmNeo Thank you for the replay!
few things I didn't understand from your answer
/usr/local/host in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.
This copy should take place only once? or every time a new pod which asks for gpu is starting?
When the gpu-manager pod is starting, I indeed see from the logs that files are being copied from /user/local/host to the path mentioned in this line:
gpu-manager/build/copy-bin-lib.sh
Line 10 in 4701c60
For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so in the directory /etc/gpu-manager/vdriver.
In the host? gpu-manager pod? app pod?
since I think it only make sense to be on the host, I don't understand how it gonna end up in my app pod
For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
The LD_LIBRARY_PATH should be pointing to the location specified in this line?
gpu-manager/build/copy-bin-lib.sh
Line 10 in 4701c60
/etc/gpu-manager/vdriver
can be found both in the gpu-manager pod and your host, solibcuda-control.so
should be found at the sub-directorynvidia
in/etc/gpu-manager/vdriver
in your case, if not, that meansgpu-manager
doesn't located the necessary libraries.
Before your application pod is running,gpu-manager
will bind mount the driver directory located in either/etc/gpu-manager/vdriver/nvidia
for fraction request or/etc/gpu-manager/vdriver/origin
. Since your application pod uses fraction request, you should find your necessary libraries in/usr/local/nvidia/lib64
or/usr/local/nvidia/lib
of your application pod. After that, you have to check whether if you have override the default LD_LIBRARY_PATH environment whichgpu-manager
set for your application. For your situation, before your app is running, the wrapped librarieslibcuda-control.so
try to register some information togpu-manager
, then the control system can be running correctly. To seethe libcuda-control.so
log, set environmentLOGGER_LEVEL=5
before launching your app.As your description of your problem that no log found after setting
LOGGER_LEVEL=5
, I think the problem may be that your app loaded libraries from another path not the onesgpu-manager
provides.
@mYmNeo
before your app is running, the wrapped libraries
libcuda-control.sotry to register some information to
gpu-manager``
I think this part is not working for me
from gpu-manager.
[root@ocp4-krpkk-worker-gpu-rnw9j log]# cat gpu-manager.INFO | grep -v "util"
Log file created at: 2020/06/28 11:55:28
Running on machine: gpu-manager-daemonset-pjf5q
Binary: Built with gc go1.14.3 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0628 11:55:28.373296 168134 app.go:87] Wait for internal server ready
I0628 11:55:28.376088 168134 volume.go:133] Find binaries: [/usr/bin/gpu-client]
I0628 11:55:28.376139 168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376142 168134 volume.go:139] Find 64bit libraries: [/usr/lib64/libcuda-control.so]
I0628 11:55:28.376891 168134 volume.go:133] Find binaries: []
I0628 11:55:28.376927 168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376930 168134 volume.go:139] Find 64bit libraries: []
I0628 11:55:28.376946 168134 volume.go:176] Mirror /usr/bin/gpu-client to /etc/gpu-manager/vdriver/nvidia/bin
I0628 11:55:28.386992 168134 volume.go:176] Mirror /usr/lib64/libcuda-control.so to /etc/gpu-manager/vdriver/nvidia/lib64
I0628 11:55:28.387769 168134 volume.go:152] Volume manager is running
E0628 11:55:28.387789 168134 server.go:131] Unable to set Type=notify in systemd service file?
W0628 11:55:28.388430 168134 client_config.go:543] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0628 11:55:28.390318 168134 logs.go:79] parsed scheme: ""
I0628 11:55:28.390323 168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:28.390327 168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/crio/crio.sock 0 <nil>}] <nil>}
I0628 11:55:28.390338 168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:28.390391 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, CONNECTING
I0628 11:55:28.390739 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, READY
I0628 11:55:28.391357 168134 runtime.go:69] Container runtime is cri-o
I0628 11:55:28.391364 168134 server.go:155] Container runtime manager is running
I0628 11:55:28.391442 168134 reflector.go:150] Starting reflector *v1.Pod (1m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
I0628 11:55:28.391452 168134 reflector.go:185] Listing and watching *v1.Pod from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
I0628 11:55:29.374625 168134 app.go:87] Wait for internal server ready
I0628 11:55:29.391501 168134 watchdog.go:64] Pod cache is running
I0628 11:55:31.196446 168134 server.go:158] Watchdog is running
I0628 11:55:31.196455 168134 label.go:102] Labeler for hostname ocp4-krpkk-worker-gpu-rnw9j
I0628 11:55:31.211512 168134 label.go:153] Auto label is running
I0628 11:55:31.211559 168134 manager.go:195] Start vDevice watcher
I0628 11:55:31.211912 168134 manager.go:244] Recover vDevice server for /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:55:31.211929 168134 manager.go:191] Virtual manager is running
I0628 11:55:31.211961 168134 manager.go:269] Starting garbage directory collector
I0628 11:55:31.211987 168134 manager.go:360] Starting process vm events
I0628 11:55:31.213970 168134 tree.go:187] Detect 1 gpu cards
E0628 11:55:31.219430 168134 tree.go:337] No topology level found at 0
I0628 11:55:31.219448 168134 tree.go:340] Only one card topology
I0628 11:55:31.219990 168134 tree.go:119] Update device information
I0628 11:55:31.226564 168134 allocator.go:263] Load extra config from /etc/gpu-manager/extra-config.json
W0628 11:55:31.226608 168134 allocator.go:1209] Failed to read from checkpoint due to key is not found
I0628 11:55:31.226652 168134 allocator.go:618] Pods to be removed: []
I0628 11:55:31.249243 168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:55:31.287340 168134 allocator.go:978] failed to get pod 2f7545ea-77d8-4e8e-81ef-9135740843bf from allocatedPod cache
I0628 11:55:31.287343 168134 allocator.go:223] failed to get ready annotations for pod 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:55:31.287387 168134 server.go:182] Starting the GRPC server, driver nvidia, queryPort 9400
I0628 11:55:31.287445 168134 server.go:236] Server tencent.com/vcuda-core is running
I0628 11:55:31.287448 168134 server.go:236] Server tencent.com/vcuda-memory is running
I0628 11:55:31.287498 168134 logs.go:79] parsed scheme: ""
I0628 11:55:31.287503 168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:31.287512 168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/gpu-manager.sock 0 <nil>}] <nil>}
I0628 11:55:31.287526 168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:31.287546 168134 server.go:250] Server is ready at /var/run/gpu-manager.sock
I0628 11:55:31.287557 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, CONNECTING
I0628 11:55:31.287605 168134 vcore.go:81] Server tencent.com/vcuda-core is ready at /var/lib/kubelet/device-plugins/vcore.sock
I0628 11:55:31.287663 168134 vmemory.go:80] Server tencent.com/vcuda-memory is ready at /var/lib/kubelet/device-plugins/vmemory.sock
I0628 11:55:31.287765 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, READY
I0628 11:55:32.191243 168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.191251 168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.191257 168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vcore.sock 0 <nil>}] <nil>}
I0628 11:55:32.191281 168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.191326 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, CONNECTING
I0628 11:55:32.192059 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, READY
I0628 11:55:32.192073 168134 server.go:90] Server /var/lib/kubelet/device-plugins/vcore.sock is ready, readyServers: 1
I0628 11:55:32.192080 168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.192088 168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.192092 168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vmemory.sock 0 <nil>}] <nil>}
I0628 11:55:32.192098 168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.192124 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, CONNECTING
I0628 11:55:32.192138 168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192206 168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192595 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, READY
I0628 11:55:32.192613 168134 server.go:90] Server /var/lib/kubelet/device-plugins/vmemory.sock is ready, readyServers: 2
I0628 11:55:32.192625 168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.192628 168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.192631 168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/kubelet.sock 0 <nil>}] <nil>}
I0628 11:55:32.192637 168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.192657 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, CONNECTING
I0628 11:55:32.192669 168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192768 168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192899 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, READY
I0628 11:55:32.192909 168134 server.go:334] Register to kubelet with endpoint vcore.sock
I0628 11:55:32.194889 168134 server.go:334] Register to kubelet with endpoint vmemory.sock
I0628 11:55:32.195089 168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.195323 168134 vcore.go:93] ListAndWatch request for vcore
I0628 11:55:32.195413 168134 vmemory.go:97] ListAndWatch request for vmemory
I0628 11:55:45.591024 168134 vmemory.go:87] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-memory-268435456-69 tencent.com/vcuda-memory-268435456-21 tencent.com/vcuda-memory-268435456-14 tencent.com/vcuda-memory-268435456-26 tencent.com/vcuda-memory-268435456-58 tencent.com/vcuda-memory-268435456-18 tencent.com/vcuda-memory-268435456-45 tencent.com/vcuda-memory-268435456-48 tencent.com/vcuda-memory-268435456-23 tencent.com/vcuda-memory-268435456-39],},},} allocation request for vmemory
I0628 11:55:45.591388 168134 vcore.go:88] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-17 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1 tencent.com/vcuda-core-63],},},} allocation request for vcore
I0628 11:55:45.591426 168134 allocator.go:663] Request GPU device: tencent.com/vcuda-core-48,tencent.com/vcuda-core-58,tencent.com/vcuda-core-96,tencent.com/vcuda-core-87,tencent.com/vcuda-core-17,tencent.com/vcuda-core-99,tencent.com/vcuda-core-82,tencent.com/vcuda-core-66,tencent.com/vcuda-core-1,tencent.com/vcuda-core-63
I0628 11:55:45.617421 168134 allocator.go:1131] candidate pod tf-notebook in ns default with timestamp 1593345345000000000 is found.
I0628 11:55:45.617432 168134 allocator.go:715] Found candidate Pod 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook) with device count 10
I0628 11:55:45.617492 168134 allocator.go:618] Pods to be removed: []
I0628 11:55:45.624573 168134 tree.go:119] Update device information
I0628 11:55:45.631080 168134 allocator.go:375] Tree graph: ROOT:1
|---GPU0 (pids: [], usedMemory: 0, totalMemory: 24032378880, allocatableCores: 100, allocatableMemory: 24032378880)
I0628 11:55:45.631089 168134 allocator.go:386] Try allocate for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), vcore 10, vmemory 2684354560
I0628 11:55:45.631095 168134 share.go:58] Pick up 0 mask 1, cores: 100, memory: 24032378880
I0628 11:55:45.631101 168134 allocator.go:479] Allocate /run/nvidia/driver/dev/nvidia0 for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), Meta (0:0)
I0628 11:55:45.631108 168134 tree.go:491] Occupy /run/nvidia/driver/dev/nvidia0 with 10 2684354560, mask 1
I0628 11:55:45.631111 168134 tree.go:518] Occupy /run/nvidia/driver/dev/nvidia0 parent 1
I0628 11:55:45.631115 168134 tree.go:501] /run/nvidia/driver/dev/nvidia0 cores 100->90
I0628 11:55:45.631119 168134 tree.go:507] /run/nvidia/driver/dev/nvidia0 memory 24032378880->21348024320
I0628 11:55:47.875495 168134 vcore.go:103] PreStartContainer request for vcore
I0628 11:55:47.875514 168134 allocator.go:784] get preStartContainer call from k8s, req: &PreStartContainerRequest{DevicesIDs:[tencent.com/vcuda-core-17 tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-63 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1],}
I0628 11:55:47.875889 168134 manager.go:363] process 3d89b0e4-d5c1-43f4-bcc6-98650521894a
I0628 11:55:47.876058 168134 manager.go:352] Start vDevice server for /etc/gpu-manager/vm/3d89b0e4-d5c1-43f4-bcc6-98650521894a
I0628 11:55:47.889934 168134 vmemory.go:107] PreStartContainer request for vmemory
I0628 11:56:01.287588 168134 allocator.go:204] Checking allocation of pods on this node
W0628 11:56:31.212696 168134 manager.go:290] Find orphaned pod 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:56:31.212700 168134 manager.go:296] Remove directory 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:56:31.287568 168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:57:01.287570 168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:57:31.212208 168134 manager.go:260] Close orphaned server /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:57:31.287580 168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:58:01.287580 168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:58:31.287573 168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:59:01.287574 168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:59:31.287568 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:00:01.287566 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:00:31.287566 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:01:01.287578 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:01:31.287572 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:02:01.287569 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:02:31.287566 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:03:01.287581 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:03:29.410276 168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 12 items received
I0628 12:03:31.287567 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:04:01.287576 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:04:31.287508 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:05:01.287503 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:05:31.287568 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:06:01.287569 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:06:31.287525 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:07:01.287543 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:07:31.287566 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:08:01.287570 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:08:31.287565 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:09:01.287569 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:09:31.287569 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:10:01.287576 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:10:31.287568 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:11:01.287566 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:11:31.287569 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:12:01.287571 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:12:31.287561 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:13:01.287571 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:13:11.412348 168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 0 items received
I0628 12:13:31.287564 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:14:01.287577 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:14:31.287561 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:15:01.287575 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:15:31.287568 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:16:01.287571 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:16:31.287560 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:17:01.287581 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:17:31.287565 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:18:01.287581 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:18:31.287587 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:19:01.287572 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:19:31.287566 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:20:01.287576 168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:20:31.287563 168134 allocator.go:204] Checking allocation of pods on this node
@mYmNeo
I excluded all the logs from util
from gpu-manager.
[root@ocp4-krpkk-worker-gpu-rnw9j log]# cat gpu-manager.INFO | grep -v "util" Log file created at: 2020/06/28 11:55:28 Running on machine: gpu-manager-daemonset-pjf5q Binary: Built with gc go1.14.3 for linux/amd64 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0628 11:55:28.373296 168134 app.go:87] Wait for internal server ready I0628 11:55:28.376088 168134 volume.go:133] Find binaries: [/usr/bin/gpu-client] I0628 11:55:28.376139 168134 volume.go:138] Find 32bit libraries: [] I0628 11:55:28.376142 168134 volume.go:139] Find 64bit libraries: [/usr/lib64/libcuda-control.so] I0628 11:55:28.376891 168134 volume.go:133] Find binaries: [] I0628 11:55:28.376927 168134 volume.go:138] Find 32bit libraries: [] I0628 11:55:28.376930 168134 volume.go:139] Find 64bit libraries: [] I0628 11:55:28.376946 168134 volume.go:176] Mirror /usr/bin/gpu-client to /etc/gpu-manager/vdriver/nvidia/bin I0628 11:55:28.386992 168134 volume.go:176] Mirror /usr/lib64/libcuda-control.so to /etc/gpu-manager/vdriver/nvidia/lib64 I0628 11:55:28.387769 168134 volume.go:152] Volume manager is running E0628 11:55:28.387789 168134 server.go:131] Unable to set Type=notify in systemd service file? W0628 11:55:28.388430 168134 client_config.go:543] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0628 11:55:28.390318 168134 logs.go:79] parsed scheme: "" I0628 11:55:28.390323 168134 logs.go:79] scheme "" not registered, fallback to default scheme I0628 11:55:28.390327 168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/crio/crio.sock 0 <nil>}] <nil>} I0628 11:55:28.390338 168134 logs.go:79] ClientConn switching balancer to "pick_first" I0628 11:55:28.390391 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, CONNECTING I0628 11:55:28.390739 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, READY I0628 11:55:28.391357 168134 runtime.go:69] Container runtime is cri-o I0628 11:55:28.391364 168134 server.go:155] Container runtime manager is running I0628 11:55:28.391442 168134 reflector.go:150] Starting reflector *v1.Pod (1m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105 I0628 11:55:28.391452 168134 reflector.go:185] Listing and watching *v1.Pod from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105 I0628 11:55:29.374625 168134 app.go:87] Wait for internal server ready I0628 11:55:29.391501 168134 watchdog.go:64] Pod cache is running I0628 11:55:31.196446 168134 server.go:158] Watchdog is running I0628 11:55:31.196455 168134 label.go:102] Labeler for hostname ocp4-krpkk-worker-gpu-rnw9j I0628 11:55:31.211512 168134 label.go:153] Auto label is running I0628 11:55:31.211559 168134 manager.go:195] Start vDevice watcher I0628 11:55:31.211912 168134 manager.go:244] Recover vDevice server for /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf I0628 11:55:31.211929 168134 manager.go:191] Virtual manager is running I0628 11:55:31.211961 168134 manager.go:269] Starting garbage directory collector I0628 11:55:31.211987 168134 manager.go:360] Starting process vm events I0628 11:55:31.213970 168134 tree.go:187] Detect 1 gpu cards E0628 11:55:31.219430 168134 tree.go:337] No topology level found at 0 I0628 11:55:31.219448 168134 tree.go:340] Only one card topology I0628 11:55:31.219990 168134 tree.go:119] Update device information I0628 11:55:31.226564 168134 allocator.go:263] Load extra config from /etc/gpu-manager/extra-config.json W0628 11:55:31.226608 168134 allocator.go:1209] Failed to read from checkpoint due to key is not found I0628 11:55:31.226652 168134 allocator.go:618] Pods to be removed: [] I0628 11:55:31.249243 168134 allocator.go:204] Checking allocation of pods on this node I0628 11:55:31.287340 168134 allocator.go:978] failed to get pod 2f7545ea-77d8-4e8e-81ef-9135740843bf from allocatedPod cache I0628 11:55:31.287343 168134 allocator.go:223] failed to get ready annotations for pod 2f7545ea-77d8-4e8e-81ef-9135740843bf I0628 11:55:31.287387 168134 server.go:182] Starting the GRPC server, driver nvidia, queryPort 9400 I0628 11:55:31.287445 168134 server.go:236] Server tencent.com/vcuda-core is running I0628 11:55:31.287448 168134 server.go:236] Server tencent.com/vcuda-memory is running I0628 11:55:31.287498 168134 logs.go:79] parsed scheme: "" I0628 11:55:31.287503 168134 logs.go:79] scheme "" not registered, fallback to default scheme I0628 11:55:31.287512 168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/gpu-manager.sock 0 <nil>}] <nil>} I0628 11:55:31.287526 168134 logs.go:79] ClientConn switching balancer to "pick_first" I0628 11:55:31.287546 168134 server.go:250] Server is ready at /var/run/gpu-manager.sock I0628 11:55:31.287557 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, CONNECTING I0628 11:55:31.287605 168134 vcore.go:81] Server tencent.com/vcuda-core is ready at /var/lib/kubelet/device-plugins/vcore.sock I0628 11:55:31.287663 168134 vmemory.go:80] Server tencent.com/vcuda-memory is ready at /var/lib/kubelet/device-plugins/vmemory.sock I0628 11:55:31.287765 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, READY I0628 11:55:32.191243 168134 logs.go:79] parsed scheme: "" I0628 11:55:32.191251 168134 logs.go:79] scheme "" not registered, fallback to default scheme I0628 11:55:32.191257 168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vcore.sock 0 <nil>}] <nil>} I0628 11:55:32.191281 168134 logs.go:79] ClientConn switching balancer to "pick_first" I0628 11:55:32.191326 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, CONNECTING I0628 11:55:32.192059 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, READY I0628 11:55:32.192073 168134 server.go:90] Server /var/lib/kubelet/device-plugins/vcore.sock is ready, readyServers: 1 I0628 11:55:32.192080 168134 logs.go:79] parsed scheme: "" I0628 11:55:32.192088 168134 logs.go:79] scheme "" not registered, fallback to default scheme I0628 11:55:32.192092 168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vmemory.sock 0 <nil>}] <nil>} I0628 11:55:32.192098 168134 logs.go:79] ClientConn switching balancer to "pick_first" I0628 11:55:32.192124 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, CONNECTING I0628 11:55:32.192138 168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0628 11:55:32.192206 168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0628 11:55:32.192595 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, READY I0628 11:55:32.192613 168134 server.go:90] Server /var/lib/kubelet/device-plugins/vmemory.sock is ready, readyServers: 2 I0628 11:55:32.192625 168134 logs.go:79] parsed scheme: "" I0628 11:55:32.192628 168134 logs.go:79] scheme "" not registered, fallback to default scheme I0628 11:55:32.192631 168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/kubelet.sock 0 <nil>}] <nil>} I0628 11:55:32.192637 168134 logs.go:79] ClientConn switching balancer to "pick_first" I0628 11:55:32.192657 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, CONNECTING I0628 11:55:32.192669 168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0628 11:55:32.192768 168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0628 11:55:32.192899 168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, READY I0628 11:55:32.192909 168134 server.go:334] Register to kubelet with endpoint vcore.sock I0628 11:55:32.194889 168134 server.go:334] Register to kubelet with endpoint vmemory.sock I0628 11:55:32.195089 168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0628 11:55:32.195323 168134 vcore.go:93] ListAndWatch request for vcore I0628 11:55:32.195413 168134 vmemory.go:97] ListAndWatch request for vmemory I0628 11:55:45.591024 168134 vmemory.go:87] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-memory-268435456-69 tencent.com/vcuda-memory-268435456-21 tencent.com/vcuda-memory-268435456-14 tencent.com/vcuda-memory-268435456-26 tencent.com/vcuda-memory-268435456-58 tencent.com/vcuda-memory-268435456-18 tencent.com/vcuda-memory-268435456-45 tencent.com/vcuda-memory-268435456-48 tencent.com/vcuda-memory-268435456-23 tencent.com/vcuda-memory-268435456-39],},},} allocation request for vmemory I0628 11:55:45.591388 168134 vcore.go:88] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-17 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1 tencent.com/vcuda-core-63],},},} allocation request for vcore I0628 11:55:45.591426 168134 allocator.go:663] Request GPU device: tencent.com/vcuda-core-48,tencent.com/vcuda-core-58,tencent.com/vcuda-core-96,tencent.com/vcuda-core-87,tencent.com/vcuda-core-17,tencent.com/vcuda-core-99,tencent.com/vcuda-core-82,tencent.com/vcuda-core-66,tencent.com/vcuda-core-1,tencent.com/vcuda-core-63 I0628 11:55:45.617421 168134 allocator.go:1131] candidate pod tf-notebook in ns default with timestamp 1593345345000000000 is found. I0628 11:55:45.617432 168134 allocator.go:715] Found candidate Pod 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook) with device count 10 I0628 11:55:45.617492 168134 allocator.go:618] Pods to be removed: [] I0628 11:55:45.624573 168134 tree.go:119] Update device information I0628 11:55:45.631080 168134 allocator.go:375] Tree graph: ROOT:1 |---GPU0 (pids: [], usedMemory: 0, totalMemory: 24032378880, allocatableCores: 100, allocatableMemory: 24032378880) I0628 11:55:45.631089 168134 allocator.go:386] Try allocate for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), vcore 10, vmemory 2684354560 I0628 11:55:45.631095 168134 share.go:58] Pick up 0 mask 1, cores: 100, memory: 24032378880 I0628 11:55:45.631101 168134 allocator.go:479] Allocate /run/nvidia/driver/dev/nvidia0 for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), Meta (0:0) I0628 11:55:45.631108 168134 tree.go:491] Occupy /run/nvidia/driver/dev/nvidia0 with 10 2684354560, mask 1 I0628 11:55:45.631111 168134 tree.go:518] Occupy /run/nvidia/driver/dev/nvidia0 parent 1 I0628 11:55:45.631115 168134 tree.go:501] /run/nvidia/driver/dev/nvidia0 cores 100->90 I0628 11:55:45.631119 168134 tree.go:507] /run/nvidia/driver/dev/nvidia0 memory 24032378880->21348024320 I0628 11:55:47.875495 168134 vcore.go:103] PreStartContainer request for vcore I0628 11:55:47.875514 168134 allocator.go:784] get preStartContainer call from k8s, req: &PreStartContainerRequest{DevicesIDs:[tencent.com/vcuda-core-17 tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-63 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1],} I0628 11:55:47.875889 168134 manager.go:363] process 3d89b0e4-d5c1-43f4-bcc6-98650521894a I0628 11:55:47.876058 168134 manager.go:352] Start vDevice server for /etc/gpu-manager/vm/3d89b0e4-d5c1-43f4-bcc6-98650521894a I0628 11:55:47.889934 168134 vmemory.go:107] PreStartContainer request for vmemory I0628 11:56:01.287588 168134 allocator.go:204] Checking allocation of pods on this node W0628 11:56:31.212696 168134 manager.go:290] Find orphaned pod 2f7545ea-77d8-4e8e-81ef-9135740843bf I0628 11:56:31.212700 168134 manager.go:296] Remove directory 2f7545ea-77d8-4e8e-81ef-9135740843bf I0628 11:56:31.287568 168134 allocator.go:204] Checking allocation of pods on this node I0628 11:57:01.287570 168134 allocator.go:204] Checking allocation of pods on this node I0628 11:57:31.212208 168134 manager.go:260] Close orphaned server /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf I0628 11:57:31.287580 168134 allocator.go:204] Checking allocation of pods on this node I0628 11:58:01.287580 168134 allocator.go:204] Checking allocation of pods on this node I0628 11:58:31.287573 168134 allocator.go:204] Checking allocation of pods on this node I0628 11:59:01.287574 168134 allocator.go:204] Checking allocation of pods on this node I0628 11:59:31.287568 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:00:01.287566 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:00:31.287566 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:01:01.287578 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:01:31.287572 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:02:01.287569 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:02:31.287566 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:03:01.287581 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:03:29.410276 168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 12 items received I0628 12:03:31.287567 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:04:01.287576 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:04:31.287508 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:05:01.287503 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:05:31.287568 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:06:01.287569 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:06:31.287525 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:07:01.287543 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:07:31.287566 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:08:01.287570 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:08:31.287565 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:09:01.287569 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:09:31.287569 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:10:01.287576 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:10:31.287568 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:11:01.287566 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:11:31.287569 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:12:01.287571 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:12:31.287561 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:13:01.287571 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:13:11.412348 168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 0 items received I0628 12:13:31.287564 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:14:01.287577 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:14:31.287561 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:15:01.287575 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:15:31.287568 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:16:01.287571 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:16:31.287560 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:17:01.287581 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:17:31.287565 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:18:01.287581 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:18:31.287587 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:19:01.287572 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:19:31.287566 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:20:01.287576 168134 allocator.go:204] Checking allocation of pods on this node I0628 12:20:31.287563 168134 allocator.go:204] Checking allocation of pods on this node
@mYmNeo
I excluded all the logs from util
I0628 11:55:28.373296 168134 app.go:87] Wait for internal server ready
I0628 11:55:28.376088 168134 volume.go:133] Find binaries: [/usr/bin/gpu-client]
I0628 11:55:28.376139 168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376142 168134 volume.go:139] Find 64bit libraries: [/usr/lib64/libcuda-control.so]
I0628 11:55:28.376891 168134 volume.go:133] Find binaries: []
I0628 11:55:28.376927 168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376930 168134 volume.go:139] Find 64bit libraries: []
I0628 11:55:28.376946 168134 volume.go:176] Mirror /usr/bin/gpu-client to /etc/gpu-manager/vdriver/nvidia/bin
I0628 11:55:28.386992 168134 volume.go:176] Mirror /usr/lib64/libcuda-control.so to /etc/gpu-manager/vdriver/nvidia/lib64
The log has showed that gpu-manager
only detect /usr/lib64/libcuda-control.so
and /usr/bin/gpu-client
. And coping them into /etc/gpu-manager/vdriver/nvidia/lib64
, but the correct one should have a few nvidia libraries. Since you've changed the copy-lib.sh
, the rebuild ldcache
procedure doesn't find your changes.
from gpu-manager.
@mYmNeo is it solvable?
from gpu-manager.
Related Issues (20)
- python code will be killed when call GPU resource HOT 3
- No metric collected when no "tencent.com/vcuda-core" assigned
- make img failed HOT 5
- What the fields means in metrics api /usage ? 获取指标服务的字段都是什么意思,有文档吗请问 HOT 4
- 使用gpu-manager调度的pod起不来,会报Segmentation fault (core dumped) ,cuda版本是11.1 HOT 3
- Pod ignores limits. HOT 2
- Readme范例说明和代码不一致 HOT 1
- pod状态UnexpectedAdmissionError HOT 1
- fail to get response from manager, error rpc error: code = Unknown desc = can't find kubepods-besteffort-pod6f1c7606_fb47_4c34_82aa_9b9966435a65.slice from docker HOT 2
- cgroup 读取进程列表为空,请帮忙看看
- Is this project still under maintanace ? HOT 4
- 模型训练报错 CUDNN_STATUS_NOT_INITIALIZED
- /tmp/cuda-control/src/hijack_call.c:471 cuInit error no CUDA-capable device is detected
- 在共享模式下 无法生成tensorrt
- Allocate failed due to rpc error: code = Unknown desc = no free node, which is unexpected HOT 3
- Gpumanager is unable to control GPU threshold and GPU memory. HOT 2
- Program terminated with signal SIGSEGV, Segmentation fault. HOT 4
- 所有有安装完成了,测试程序也安装完成了,执行测试脚本时提示bash: /tmp/cuda-control/src/loader.c: No such file or directory HOT 1
- gaiaGPU对k8s 1.21版本兼容问题 HOT 2
- /usr/local/lib/python3.9/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 36: API call is not supported in the installed CUDA driver (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpu-manager.