Hi, I am using Openshift 4.3 which means my default container runtime is Cri-o. Wh

I finally was able to make it work, however i got this error: <a target="_blank" r

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

After fixing this line: <div class="Bo

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data

Cri-o support,about tkestack/gpu-manager

Comments (23)

raz-bn commented on September 17, 2024

I finally was able to make it work, however i got this error:

so now I am pretty sure it cant be run with crio at the moment, any workaround to make it happen?

from gpu-manager.

mYmNeo commented on September 17, 2024

I finally was able to make it work, however i got this error:

so now I am pretty sure it cant be run with crio at the moment, any workaround to make it happen?

gpu-manager want to connect docker to find some information to recover topology and usage.

from gpu-manager.

mYmNeo commented on September 17, 2024

We released a new version which support CRI interface. Welcome to have a try

from gpu-manager.

raz-bn commented on September 17, 2024

@mYmNeo
When trying to run gpu-manager with crio as the default runtime I get this error

any details on how to fix it?

from gpu-manager.

mYmNeo commented on September 17, 2024

@mYmNeo
When trying to run gpu-manager with crio as the default runtime I get this error

any details on how to fix it?

This error means the gpu-manager doesn't detect the gpu card on your machine. Did you install driver in your node?

from gpu-manager.

raz-bn commented on September 17, 2024

@mYmNeo
Thanks for the response.
Pretty sure I do.
when setting up Nvidia-docker runtime hook:

{
    "version": "1.0.0",
    "hook": {
        "path": "/usr/local/nvidia/toolkit/nvidia-container-toolkit",
        "args": ["nvidia-container-toolkit", "prestart"],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/nvidia/toolkit"
        ]
    },
    "when": {
        "always": true,
        "commands": [".*"]
    },
    "stages": ["prestart"]
}

I can run nvidia-smi from inside a simple Jupyter notebook container.

After removing the hook file, and running the gpu-manager daemon set I get the error I posted, also tried to add the path:

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/nvidia/toolkit

as env variable to the gpu-manager, got the same error.

from gpu-manager.

raz-bn commented on September 17, 2024

@mYmNeo
I've solved the problem, had to change:

        - name: usr-directory
          hostPath:
            type: Directory
            path: /usr

to:

        - name: usr-directory
          hostPath:
            type: Directory
            path: /run/nvidia/driver/usr

However, I came across a new problem when trying to observe the gpu-manager metrics.

I get this error:

E0625 11:52:57.438606  127961 runtime.go:110] can't read /sys/fs/cgroup/memory/kubepods/besteffort/pod6f88cfc6-a9b2-4d51-add4-35a588e4990c/6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1/cgroup.procs, open /sys/fs/cgroup/memory/kubepods/besteffort/pod6f88cfc6-a9b2-4d51-add4-35a588e4990c/6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1/cgroup.procs: no such file or directory

looking at me host I saw the path gpu-manager looking for is looking different:

/sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/crio-conmon-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs

is there any quick fix?

from gpu-manager.

raz-bn commented on September 17, 2024

I guess i found the "solution" by setting --cgroup-driver like mentioned in the FAQ.
However I still get an error message since there is a typo in

E0625 12:28:22.892393  358033 runtime.go:110] can't read /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/cri-o-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs, open /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/cri-o-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs: no such file or directory

the right location:

/sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/crio-conmon-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs

from gpu-manager.

raz-bn commented on September 17, 2024

After fixing this line:

gpu-manager/pkg/runtime/runtime.go

Line 146 in 4701c60

 return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), m.runtimeName, containerID), nil 

to:

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil

the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:

Do you have any idea how to fix it?
@mYmNeo

p.s
sorry for the all the messages :(

from gpu-manager.

mYmNeo commented on September 17, 2024

/sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/crio-conmon-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs

Is your cri-o running with a systemd service with a name cri-common?

from gpu-manager.

mYmNeo commented on September 17, 2024

After fixing this line:

gpu-manager/pkg/runtime/runtime.go

Line 146 in 4701c60

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), m.runtimeName, containerID), nil

to:
return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil
the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:

Do you have any idea how to fix it?
@mYmNeo

p.s
sorry for the all the messages :(

What's your pod yaml? The metrics will not report data point, if utilization is 0.

from gpu-manager.

raz-bn commented on September 17, 2024

After fixing this line:

gpu-manager/pkg/runtime/runtime.go

Line 146 in 4701c60

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), m.runtimeName, containerID), nil

to:
return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil
the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:

Do you have any idea how to fix it?
@mYmNeo
p.s
sorry for the all the messages :(
What's your pod yaml? The metrics will not report data point, if utilization is 0.
@mYmNeo

apiVersion: v1
kind: Service
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  type: NodePort
  ports:
  - port: 80
    name: http
    targetPort: 8888
    nodePort: 30001
  selector:
    app: tf-notebook
---
apiVersion: v1
kind: Pod
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  securityContext:
    fsGroup: 0
  containers:
  - name: tf-notebook
    image: tensorflow/tensorflow:latest-gpu-jupyter
    resources:
      requests:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 10
      limits:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 10
    env:
      - name: LOGGER_LEVEL
        value: "5"
    ports:
    - containerPort: 8888
      name: notebook

after more debuging I found my self-editing more of the gpu-manager source code in order to make it fit with my odd use case.
since my Nvidia drivers are located in a different place than usual, I need to modify a few paths in the gpu-manager code for example:
Original:

const (
	NvidiaCtlDevice    = "/dev/nvidiactl"
	NvidiaUVMDevice    = "/dev/nvidia-uvm"
	NvidiaFullpathRE   = `^/dev/nvidia([0-9]*)$`
	NvidiaDevicePrefix = "/dev/nvidia"
)

My version:

const (
	NvidiaCtlDevice    = "/run/nvidia/driver/dev/nvidiactl"
	NvidiaUVMDevice    = "/run/nvidia/driver/dev/nvidia-uvm"
	NvidiaFullpathRE   = `^/run/nvidia/driver/dev/nvidia([0-9]*)$`
	NvidiaDevicePrefix = "/run/nvidia/driver/dev/nvidia"
)

also edited the LD_LIBERY_PATH.

After doing so, my pod managed to see the GPU device and managed to used it. However, I've realized there was no enforcement on the memory limit.
I added the env variable LOGGER_LEVEL=5 to my pod, as you can see in the YAML file to try and debug the vcuda-controller, but there were no logs from it. As far as I understand, the vcuda-controller is triggered by a hook to the Cuda libraries, so there are few questions I want to ask to locate my problem:

How can I verify vcuda-controller is present in my pod?
How is the vcuda-controller ending up present in my pod?
How I make Tensorflow, for example, use the vcuda-controller libraries?
How the vcuda-controller working?

my assumptions:

The vcuda-controller is not present in my pod
Since I was editing all the path, I forgot something, and the TensorFlow app is using different libraries rather than the vcuda-controller

from gpu-manager.

mYmNeo commented on September 17, 2024

After fixing this line:

gpu-manager/pkg/runtime/runtime.go

Line 146 in 4701c60

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), m.runtimeName, containerID), nil

to:
return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil
the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:

Do you have any idea how to fix it?
@mYmNeo
p.s
sorry for the all the messages :(
What's your pod yaml? The metrics will not report data point, if utilization is 0.
@mYmNeo
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  type: NodePort
  ports:
  - port: 80
    name: http
    targetPort: 8888
    nodePort: 30001
  selector:
    app: tf-notebook
---
apiVersion: v1
kind: Pod
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  securityContext:
    fsGroup: 0
  containers:
  - name: tf-notebook
    image: tensorflow/tensorflow:latest-gpu-jupyter
    resources:
      requests:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 10
      limits:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 10
    env:
      - name: LOGGER_LEVEL
        value: "5"
    ports:
    - containerPort: 8888
      name: notebook
after more debuging I found my self-editing more of the gpu-manager source code in order to make it fit with my odd use case.
since my Nvidia drivers are located in a different place than usual, I need to modify a few paths in the gpu-manager code for example:
Original:
const (
	NvidiaCtlDevice    = "/dev/nvidiactl"
	NvidiaUVMDevice    = "/dev/nvidia-uvm"
	NvidiaFullpathRE   = `^/dev/nvidia([0-9]*)$`
	NvidiaDevicePrefix = "/dev/nvidia"
)
My version:
const (
	NvidiaCtlDevice    = "/run/nvidia/driver/dev/nvidiactl"
	NvidiaUVMDevice    = "/run/nvidia/driver/dev/nvidia-uvm"
	NvidiaFullpathRE   = `^/run/nvidia/driver/dev/nvidia([0-9]*)$`
	NvidiaDevicePrefix = "/run/nvidia/driver/dev/nvidia"
)
also edited the LD_LIBERY_PATH.

After doing so, my pod managed to see the GPU device and managed to used it. However, I've realized there was no enforcement on the memory limit.
I added the env variable LOGGER_LEVEL=5 to my pod, as you can see in the YAML file to try and debug the vcuda-controller, but there were no logs from it. As far as I understand, the vcuda-controller is triggered by a hook to the Cuda libraries, so there are few questions I want to ask to locate my problem:

How can I verify vcuda-controller is present in my pod?

How is the vcuda-controller ending up present in my pod?

How I make Tensorflow, for example, use the vcuda-controller libraries?

How the vcuda-controller working?

my assumptions:

The vcuda-controller is not present in my pod

Since I was editing all the path, I forgot something, and the TensorFlow app is using different libraries rather than the vcuda-controller

I don't know why your nvidia libraries are located at the tmpfs directory /run. The gpu-manager try to find nvidia libraries from the mounted directory named/usr/local/host in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.

For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so in the directory /etc/gpu-manager/vdriver.
For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
For 4, the README of the vcuda-controller project(https://github.com/tkestack/vcuda-controller) gives a link about the paper we released how vcuda-controller works.

from gpu-manager.

raz-bn commented on September 17, 2024

@mYmNeo Thank you for the replay!
few things I didn't understand from your answer

/usr/local/host in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.

This copy should take place only once? or every time a new pod which asks for gpu is starting?
When the gpu-manager pod is starting, I indeed see from the logs that files are being copied from /user/local/host to the path mentioned in this line:

gpu-manager/build/copy-bin-lib.sh

Line 10 in 4701c60

readonly NV_DIR="/usr/local/nvidia"

For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so in the directory /etc/gpu-manager/vdriver.
In the host? gpu-manager pod? app pod?
since I think it only make sense to be on the host, I don't understand how it gonna end up in my app pod

For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
The LD_LIBRARY_PATH should be pointing to the location specified in this line?

gpu-manager/build/copy-bin-lib.sh

Line 10 in 4701c60

readonly NV_DIR="/usr/local/nvidia"

from gpu-manager.

raz-bn commented on September 17, 2024

@mYmNeo
Is there any documentation/low-level design document you can share?
I've read your paper, it was a good high-level overview but it is not enough to understand what is going on in your code

from gpu-manager.

mYmNeo commented on September 17, 2024

@mYmNeo Thank you for the replay!
few things I didn't understand from your answer

/usr/local/host in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.

This copy should take place only once? or every time a new pod which asks for gpu is starting?
When the gpu-manager pod is starting, I indeed see from the logs that files are being copied from /user/local/host to the path mentioned in this line:

gpu-manager/build/copy-bin-lib.sh

Line 10 in 4701c60

readonly NV_DIR="/usr/local/nvidia"

For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so in the directory /etc/gpu-manager/vdriver.
In the host? gpu-manager pod? app pod?
since I think it only make sense to be on the host, I don't understand how it gonna end up in my app pod

For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
The LD_LIBRARY_PATH should be pointing to the location specified in this line?

gpu-manager/build/copy-bin-lib.sh

Line 10 in 4701c60

readonly NV_DIR="/usr/local/nvidia"

/etc/gpu-manager/vdriver can be found both in the gpu-manager pod and your host, so libcuda-control.so should be found at the sub-directory nvidia in /etc/gpu-manager/vdriver in your case, if not, that means gpu-manager doesn't located the necessary libraries.
Before your application pod is running, gpu-manager will bind mount the driver directory located in either /etc/gpu-manager/vdriver/nvidia for fraction request or /etc/gpu-manager/vdriver/origin. Since your application pod uses fraction request, you should find your necessary libraries in /usr/local/nvidia/lib64 or /usr/local/nvidia/lib of your application pod. After that, you have to check whether if you have override the default LD_LIBRARY_PATH environment which gpu-manager set for your application. For your situation, before your app is running, the wrapped libraries libcuda-control.so try to register some information to gpu-manager, then the control system can be running correctly. To see the libcuda-control.so log, set environment LOGGER_LEVEL=5 before launching your app.

As your description of your problem that no log found after setting LOGGER_LEVEL=5 , I think the problem may be that your app loaded libraries from another path not the ones gpu-manager provides.

from gpu-manager.

mYmNeo commented on September 17, 2024

@mYmNeo
Is there any documentation/low-level design document you can share?
I've read your paper, it was a good high-level overview but it is not enough to understand what is going on in your code

The vcuda-controller project is very simple only a few of source code, so no more low-level documents are provided.

from gpu-manager.

raz-bn commented on September 17, 2024

@mYmNeo
I am adding some screenshots and logs, hopefully, it will guide me to the problem
gpu-manger logs:

copy /usr/local/host/lib/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1.1.4 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/bin/nvidia-cuda-mps-control to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-cuda-mps-server to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-debugdump to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-persistenced to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-smi to /run/nvidia/driver/usr/local/nvidia/bin/
rebuild ldcache
launch gpu manager
E0628 11:55:28.387789  168134 server.go:131] Unable to set Type=notify in systemd service file?
E0628 11:55:31.219430  168134 tree.go:337] No topology level found at 0

gpu node:

gpu pod:

gpu pod - tensorflow logs:

2020-06-28 11:57:24.301082: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-28 11:57:24.343386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2020-06-28 11:57:24.343661: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.345313: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-28 11:57:24.346878: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-28 11:57:24.347176: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-28 11:57:24.348832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-28 11:57:24.349628: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-28 11:57:24.353395: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-28 11:57:24.356657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-28 11:57:24.356990: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-28 11:57:24.364042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2593995000 Hz
2020-06-28 11:57:24.364771: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f30e0000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-28 11:57:24.364796: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-28 11:57:24.478234: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4a46e40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-28 11:57:24.478270: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P40, Compute Capability 6.1
2020-06-28 11:57:24.479479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2020-06-28 11:57:24.479532: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.479547: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-28 11:57:24.479559: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-28 11:57:24.479576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-28 11:57:24.479588: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-28 11:57:24.479599: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-28 11:57:24.479611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-28 11:57:24.481683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-28 11:57:24.481736: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.483290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-28 11:57:24.483315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2020-06-28 11:57:24.483323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2020-06-28 11:57:24.485578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21397 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0001:00:00.0, compute capability: 6.1)
2020-06-28 11:57:25.656053: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10

from gpu-manager.

mYmNeo commented on September 17, 2024

@mYmNeo
I am adding some screenshots and logs, hopefully, it will guide me to the problem
gpu-manger logs:

copy /usr/local/host/lib/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1.1.4 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/bin/nvidia-cuda-mps-control to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-cuda-mps-server to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-debugdump to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-persistenced to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-smi to /run/nvidia/driver/usr/local/nvidia/bin/
rebuild ldcache
launch gpu manager
E0628 11:55:28.387789  168134 server.go:131] Unable to set Type=notify in systemd service file?
E0628 11:55:31.219430  168134 tree.go:337] No topology level found at 0

gpu node:

gpu pod:

gpu pod - tensorflow logs:

2020-06-28 11:57:24.301082: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-28 11:57:24.343386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2020-06-28 11:57:24.343661: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.345313: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-28 11:57:24.346878: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-28 11:57:24.347176: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-28 11:57:24.348832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-28 11:57:24.349628: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-28 11:57:24.353395: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-28 11:57:24.356657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-28 11:57:24.356990: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-28 11:57:24.364042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2593995000 Hz
2020-06-28 11:57:24.364771: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f30e0000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-28 11:57:24.364796: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-28 11:57:24.478234: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4a46e40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-28 11:57:24.478270: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P40, Compute Capability 6.1
2020-06-28 11:57:24.479479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2020-06-28 11:57:24.479532: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.479547: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-28 11:57:24.479559: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-28 11:57:24.479576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-28 11:57:24.479588: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-28 11:57:24.479599: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-28 11:57:24.479611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-28 11:57:24.481683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-28 11:57:24.481736: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.483290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-28 11:57:24.483315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2020-06-28 11:57:24.483323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2020-06-28 11:57:24.485578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21397 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0001:00:00.0, compute capability: 6.1)
2020-06-28 11:57:25.656053: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10

The log you provided is just not completed since you just gives the screen output. The log should be found at /etc/gpu-manager/logs

from gpu-manager.

raz-bn commented on September 17, 2024

@mYmNeo Thank you for the replay!
few things I didn't understand from your answer
/usr/local/host in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.
This copy should take place only once? or every time a new pod which asks for gpu is starting?
When the gpu-manager pod is starting, I indeed see from the logs that files are being copied from /user/local/host to the path mentioned in this line:

gpu-manager/build/copy-bin-lib.sh

Line 10 in 4701c60

readonly NV_DIR="/usr/local/nvidia"

For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so in the directory /etc/gpu-manager/vdriver.
In the host? gpu-manager pod? app pod?
since I think it only make sense to be on the host, I don't understand how it gonna end up in my app pod
For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
The LD_LIBRARY_PATH should be pointing to the location specified in this line?

gpu-manager/build/copy-bin-lib.sh

Line 10 in 4701c60

readonly NV_DIR="/usr/local/nvidia"

/etc/gpu-manager/vdriver can be found both in the gpu-manager pod and your host, so libcuda-control.so should be found at the sub-directory nvidia in /etc/gpu-manager/vdriver in your case, if not, that means gpu-manager doesn't located the necessary libraries.
Before your application pod is running, gpu-manager will bind mount the driver directory located in either /etc/gpu-manager/vdriver/nvidia for fraction request or /etc/gpu-manager/vdriver/origin. Since your application pod uses fraction request, you should find your necessary libraries in /usr/local/nvidia/lib64 or /usr/local/nvidia/lib of your application pod. After that, you have to check whether if you have override the default LD_LIBRARY_PATH environment which gpu-manager set for your application. For your situation, before your app is running, the wrapped libraries libcuda-control.so try to register some information to gpu-manager, then the control system can be running correctly. To see the libcuda-control.so log, set environment LOGGER_LEVEL=5 before launching your app.

As your description of your problem that no log found after setting LOGGER_LEVEL=5 , I think the problem may be that your app loaded libraries from another path not the ones gpu-manager provides.

@mYmNeo
before your app is running, the wrapped libraries libcuda-control.sotry to register some information togpu-manager``
I think this part is not working for me

from gpu-manager.

raz-bn commented on September 17, 2024

[root@ocp4-krpkk-worker-gpu-rnw9j log]# cat gpu-manager.INFO | grep -v "util"
Log file created at: 2020/06/28 11:55:28
Running on machine: gpu-manager-daemonset-pjf5q
Binary: Built with gc go1.14.3 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0628 11:55:28.373296  168134 app.go:87] Wait for internal server ready
I0628 11:55:28.376088  168134 volume.go:133] Find binaries: [/usr/bin/gpu-client]
I0628 11:55:28.376139  168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376142  168134 volume.go:139] Find 64bit libraries: [/usr/lib64/libcuda-control.so]
I0628 11:55:28.376891  168134 volume.go:133] Find binaries: []
I0628 11:55:28.376927  168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376930  168134 volume.go:139] Find 64bit libraries: []
I0628 11:55:28.376946  168134 volume.go:176] Mirror /usr/bin/gpu-client to /etc/gpu-manager/vdriver/nvidia/bin
I0628 11:55:28.386992  168134 volume.go:176] Mirror /usr/lib64/libcuda-control.so to /etc/gpu-manager/vdriver/nvidia/lib64
I0628 11:55:28.387769  168134 volume.go:152] Volume manager is running
E0628 11:55:28.387789  168134 server.go:131] Unable to set Type=notify in systemd service file?
W0628 11:55:28.388430  168134 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0628 11:55:28.390318  168134 logs.go:79] parsed scheme: ""
I0628 11:55:28.390323  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:28.390327  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/crio/crio.sock 0  <nil>}] <nil>}
I0628 11:55:28.390338  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:28.390391  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, CONNECTING
I0628 11:55:28.390739  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, READY
I0628 11:55:28.391357  168134 runtime.go:69] Container runtime is cri-o
I0628 11:55:28.391364  168134 server.go:155] Container runtime manager is running
I0628 11:55:28.391442  168134 reflector.go:150] Starting reflector *v1.Pod (1m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
I0628 11:55:28.391452  168134 reflector.go:185] Listing and watching *v1.Pod from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
I0628 11:55:29.374625  168134 app.go:87] Wait for internal server ready
I0628 11:55:29.391501  168134 watchdog.go:64] Pod cache is running
I0628 11:55:31.196446  168134 server.go:158] Watchdog is running
I0628 11:55:31.196455  168134 label.go:102] Labeler for hostname ocp4-krpkk-worker-gpu-rnw9j
I0628 11:55:31.211512  168134 label.go:153] Auto label is running
I0628 11:55:31.211559  168134 manager.go:195] Start vDevice watcher
I0628 11:55:31.211912  168134 manager.go:244] Recover vDevice server for /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:55:31.211929  168134 manager.go:191] Virtual manager is running
I0628 11:55:31.211961  168134 manager.go:269] Starting garbage directory collector
I0628 11:55:31.211987  168134 manager.go:360] Starting process vm events
I0628 11:55:31.213970  168134 tree.go:187] Detect 1 gpu cards
E0628 11:55:31.219430  168134 tree.go:337] No topology level found at 0
I0628 11:55:31.219448  168134 tree.go:340] Only one card topology
I0628 11:55:31.219990  168134 tree.go:119] Update device information
I0628 11:55:31.226564  168134 allocator.go:263] Load extra config from /etc/gpu-manager/extra-config.json
W0628 11:55:31.226608  168134 allocator.go:1209] Failed to read from checkpoint due to key is not found
I0628 11:55:31.226652  168134 allocator.go:618] Pods to be removed: []
I0628 11:55:31.249243  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:55:31.287340  168134 allocator.go:978] failed to get pod 2f7545ea-77d8-4e8e-81ef-9135740843bf from allocatedPod cache
I0628 11:55:31.287343  168134 allocator.go:223] failed to get ready annotations for pod 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:55:31.287387  168134 server.go:182] Starting the GRPC server, driver nvidia, queryPort 9400
I0628 11:55:31.287445  168134 server.go:236] Server tencent.com/vcuda-core is running
I0628 11:55:31.287448  168134 server.go:236] Server tencent.com/vcuda-memory is running
I0628 11:55:31.287498  168134 logs.go:79] parsed scheme: ""
I0628 11:55:31.287503  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:31.287512  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/gpu-manager.sock 0  <nil>}] <nil>}
I0628 11:55:31.287526  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:31.287546  168134 server.go:250] Server is ready at /var/run/gpu-manager.sock
I0628 11:55:31.287557  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, CONNECTING
I0628 11:55:31.287605  168134 vcore.go:81] Server tencent.com/vcuda-core is ready at /var/lib/kubelet/device-plugins/vcore.sock
I0628 11:55:31.287663  168134 vmemory.go:80] Server tencent.com/vcuda-memory is ready at /var/lib/kubelet/device-plugins/vmemory.sock
I0628 11:55:31.287765  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, READY
I0628 11:55:32.191243  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.191251  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.191257  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vcore.sock 0  <nil>}] <nil>}
I0628 11:55:32.191281  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.191326  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, CONNECTING
I0628 11:55:32.192059  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, READY
I0628 11:55:32.192073  168134 server.go:90] Server /var/lib/kubelet/device-plugins/vcore.sock is ready, readyServers: 1
I0628 11:55:32.192080  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.192088  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.192092  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vmemory.sock 0  <nil>}] <nil>}
I0628 11:55:32.192098  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.192124  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, CONNECTING
I0628 11:55:32.192138  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192206  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192595  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, READY
I0628 11:55:32.192613  168134 server.go:90] Server /var/lib/kubelet/device-plugins/vmemory.sock is ready, readyServers: 2
I0628 11:55:32.192625  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.192628  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.192631  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/kubelet.sock 0  <nil>}] <nil>}
I0628 11:55:32.192637  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.192657  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, CONNECTING
I0628 11:55:32.192669  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192768  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192899  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, READY
I0628 11:55:32.192909  168134 server.go:334] Register to kubelet with endpoint vcore.sock
I0628 11:55:32.194889  168134 server.go:334] Register to kubelet with endpoint vmemory.sock
I0628 11:55:32.195089  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.195323  168134 vcore.go:93] ListAndWatch request for vcore
I0628 11:55:32.195413  168134 vmemory.go:97] ListAndWatch request for vmemory
I0628 11:55:45.591024  168134 vmemory.go:87] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-memory-268435456-69 tencent.com/vcuda-memory-268435456-21 tencent.com/vcuda-memory-268435456-14 tencent.com/vcuda-memory-268435456-26 tencent.com/vcuda-memory-268435456-58 tencent.com/vcuda-memory-268435456-18 tencent.com/vcuda-memory-268435456-45 tencent.com/vcuda-memory-268435456-48 tencent.com/vcuda-memory-268435456-23 tencent.com/vcuda-memory-268435456-39],},},} allocation request for vmemory
I0628 11:55:45.591388  168134 vcore.go:88] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-17 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1 tencent.com/vcuda-core-63],},},} allocation request for vcore
I0628 11:55:45.591426  168134 allocator.go:663] Request GPU device: tencent.com/vcuda-core-48,tencent.com/vcuda-core-58,tencent.com/vcuda-core-96,tencent.com/vcuda-core-87,tencent.com/vcuda-core-17,tencent.com/vcuda-core-99,tencent.com/vcuda-core-82,tencent.com/vcuda-core-66,tencent.com/vcuda-core-1,tencent.com/vcuda-core-63
I0628 11:55:45.617421  168134 allocator.go:1131] candidate pod tf-notebook in ns default with timestamp 1593345345000000000 is found.
I0628 11:55:45.617432  168134 allocator.go:715] Found candidate Pod 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook) with device count 10
I0628 11:55:45.617492  168134 allocator.go:618] Pods to be removed: []
I0628 11:55:45.624573  168134 tree.go:119] Update device information
I0628 11:55:45.631080  168134 allocator.go:375] Tree graph: ROOT:1
|---GPU0 (pids: [], usedMemory: 0, totalMemory: 24032378880, allocatableCores: 100, allocatableMemory: 24032378880)
I0628 11:55:45.631089  168134 allocator.go:386] Try allocate for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), vcore 10, vmemory 2684354560
I0628 11:55:45.631095  168134 share.go:58] Pick up 0 mask 1, cores: 100, memory: 24032378880
I0628 11:55:45.631101  168134 allocator.go:479] Allocate /run/nvidia/driver/dev/nvidia0 for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), Meta (0:0)
I0628 11:55:45.631108  168134 tree.go:491] Occupy /run/nvidia/driver/dev/nvidia0 with 10 2684354560, mask 1
I0628 11:55:45.631111  168134 tree.go:518] Occupy /run/nvidia/driver/dev/nvidia0 parent 1
I0628 11:55:45.631115  168134 tree.go:501] /run/nvidia/driver/dev/nvidia0 cores 100->90
I0628 11:55:45.631119  168134 tree.go:507] /run/nvidia/driver/dev/nvidia0 memory 24032378880->21348024320
I0628 11:55:47.875495  168134 vcore.go:103] PreStartContainer request for vcore
I0628 11:55:47.875514  168134 allocator.go:784] get preStartContainer call from k8s, req: &PreStartContainerRequest{DevicesIDs:[tencent.com/vcuda-core-17 tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-63 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1],}
I0628 11:55:47.875889  168134 manager.go:363] process 3d89b0e4-d5c1-43f4-bcc6-98650521894a
I0628 11:55:47.876058  168134 manager.go:352] Start vDevice server for /etc/gpu-manager/vm/3d89b0e4-d5c1-43f4-bcc6-98650521894a
I0628 11:55:47.889934  168134 vmemory.go:107] PreStartContainer request for vmemory
I0628 11:56:01.287588  168134 allocator.go:204] Checking allocation of pods on this node
W0628 11:56:31.212696  168134 manager.go:290] Find orphaned pod 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:56:31.212700  168134 manager.go:296] Remove directory 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:56:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:57:01.287570  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:57:31.212208  168134 manager.go:260] Close orphaned server /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:57:31.287580  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:58:01.287580  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:58:31.287573  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:59:01.287574  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:59:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:00:01.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:00:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:01:01.287578  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:01:31.287572  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:02:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:02:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:03:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:03:29.410276  168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 12 items received
I0628 12:03:31.287567  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:04:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:04:31.287508  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:05:01.287503  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:05:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:06:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:06:31.287525  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:07:01.287543  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:07:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:08:01.287570  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:08:31.287565  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:09:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:09:31.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:10:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:10:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:11:01.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:11:31.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:12:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:12:31.287561  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:13:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:13:11.412348  168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 0 items received
I0628 12:13:31.287564  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:14:01.287577  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:14:31.287561  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:15:01.287575  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:15:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:16:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:16:31.287560  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:17:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:17:31.287565  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:18:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:18:31.287587  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:19:01.287572  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:19:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:20:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:20:31.287563  168134 allocator.go:204] Checking allocation of pods on this node

@mYmNeo
I excluded all the logs from util

from gpu-manager.

mYmNeo commented on September 17, 2024

[root@ocp4-krpkk-worker-gpu-rnw9j log]# cat gpu-manager.INFO | grep -v "util"
Log file created at: 2020/06/28 11:55:28
Running on machine: gpu-manager-daemonset-pjf5q
Binary: Built with gc go1.14.3 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0628 11:55:28.373296  168134 app.go:87] Wait for internal server ready
I0628 11:55:28.376088  168134 volume.go:133] Find binaries: [/usr/bin/gpu-client]
I0628 11:55:28.376139  168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376142  168134 volume.go:139] Find 64bit libraries: [/usr/lib64/libcuda-control.so]
I0628 11:55:28.376891  168134 volume.go:133] Find binaries: []
I0628 11:55:28.376927  168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376930  168134 volume.go:139] Find 64bit libraries: []
I0628 11:55:28.376946  168134 volume.go:176] Mirror /usr/bin/gpu-client to /etc/gpu-manager/vdriver/nvidia/bin
I0628 11:55:28.386992  168134 volume.go:176] Mirror /usr/lib64/libcuda-control.so to /etc/gpu-manager/vdriver/nvidia/lib64
I0628 11:55:28.387769  168134 volume.go:152] Volume manager is running
E0628 11:55:28.387789  168134 server.go:131] Unable to set Type=notify in systemd service file?
W0628 11:55:28.388430  168134 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0628 11:55:28.390318  168134 logs.go:79] parsed scheme: ""
I0628 11:55:28.390323  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:28.390327  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/crio/crio.sock 0  <nil>}] <nil>}
I0628 11:55:28.390338  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:28.390391  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, CONNECTING
I0628 11:55:28.390739  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, READY
I0628 11:55:28.391357  168134 runtime.go:69] Container runtime is cri-o
I0628 11:55:28.391364  168134 server.go:155] Container runtime manager is running
I0628 11:55:28.391442  168134 reflector.go:150] Starting reflector *v1.Pod (1m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
I0628 11:55:28.391452  168134 reflector.go:185] Listing and watching *v1.Pod from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
I0628 11:55:29.374625  168134 app.go:87] Wait for internal server ready
I0628 11:55:29.391501  168134 watchdog.go:64] Pod cache is running
I0628 11:55:31.196446  168134 server.go:158] Watchdog is running
I0628 11:55:31.196455  168134 label.go:102] Labeler for hostname ocp4-krpkk-worker-gpu-rnw9j
I0628 11:55:31.211512  168134 label.go:153] Auto label is running
I0628 11:55:31.211559  168134 manager.go:195] Start vDevice watcher
I0628 11:55:31.211912  168134 manager.go:244] Recover vDevice server for /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:55:31.211929  168134 manager.go:191] Virtual manager is running
I0628 11:55:31.211961  168134 manager.go:269] Starting garbage directory collector
I0628 11:55:31.211987  168134 manager.go:360] Starting process vm events
I0628 11:55:31.213970  168134 tree.go:187] Detect 1 gpu cards
E0628 11:55:31.219430  168134 tree.go:337] No topology level found at 0
I0628 11:55:31.219448  168134 tree.go:340] Only one card topology
I0628 11:55:31.219990  168134 tree.go:119] Update device information
I0628 11:55:31.226564  168134 allocator.go:263] Load extra config from /etc/gpu-manager/extra-config.json
W0628 11:55:31.226608  168134 allocator.go:1209] Failed to read from checkpoint due to key is not found
I0628 11:55:31.226652  168134 allocator.go:618] Pods to be removed: []
I0628 11:55:31.249243  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:55:31.287340  168134 allocator.go:978] failed to get pod 2f7545ea-77d8-4e8e-81ef-9135740843bf from allocatedPod cache
I0628 11:55:31.287343  168134 allocator.go:223] failed to get ready annotations for pod 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:55:31.287387  168134 server.go:182] Starting the GRPC server, driver nvidia, queryPort 9400
I0628 11:55:31.287445  168134 server.go:236] Server tencent.com/vcuda-core is running
I0628 11:55:31.287448  168134 server.go:236] Server tencent.com/vcuda-memory is running
I0628 11:55:31.287498  168134 logs.go:79] parsed scheme: ""
I0628 11:55:31.287503  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:31.287512  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/gpu-manager.sock 0  <nil>}] <nil>}
I0628 11:55:31.287526  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:31.287546  168134 server.go:250] Server is ready at /var/run/gpu-manager.sock
I0628 11:55:31.287557  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, CONNECTING
I0628 11:55:31.287605  168134 vcore.go:81] Server tencent.com/vcuda-core is ready at /var/lib/kubelet/device-plugins/vcore.sock
I0628 11:55:31.287663  168134 vmemory.go:80] Server tencent.com/vcuda-memory is ready at /var/lib/kubelet/device-plugins/vmemory.sock
I0628 11:55:31.287765  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, READY
I0628 11:55:32.191243  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.191251  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.191257  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vcore.sock 0  <nil>}] <nil>}
I0628 11:55:32.191281  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.191326  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, CONNECTING
I0628 11:55:32.192059  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, READY
I0628 11:55:32.192073  168134 server.go:90] Server /var/lib/kubelet/device-plugins/vcore.sock is ready, readyServers: 1
I0628 11:55:32.192080  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.192088  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.192092  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vmemory.sock 0  <nil>}] <nil>}
I0628 11:55:32.192098  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.192124  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, CONNECTING
I0628 11:55:32.192138  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192206  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192595  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, READY
I0628 11:55:32.192613  168134 server.go:90] Server /var/lib/kubelet/device-plugins/vmemory.sock is ready, readyServers: 2
I0628 11:55:32.192625  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.192628  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.192631  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/kubelet.sock 0  <nil>}] <nil>}
I0628 11:55:32.192637  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.192657  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, CONNECTING
I0628 11:55:32.192669  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192768  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192899  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, READY
I0628 11:55:32.192909  168134 server.go:334] Register to kubelet with endpoint vcore.sock
I0628 11:55:32.194889  168134 server.go:334] Register to kubelet with endpoint vmemory.sock
I0628 11:55:32.195089  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.195323  168134 vcore.go:93] ListAndWatch request for vcore
I0628 11:55:32.195413  168134 vmemory.go:97] ListAndWatch request for vmemory
I0628 11:55:45.591024  168134 vmemory.go:87] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-memory-268435456-69 tencent.com/vcuda-memory-268435456-21 tencent.com/vcuda-memory-268435456-14 tencent.com/vcuda-memory-268435456-26 tencent.com/vcuda-memory-268435456-58 tencent.com/vcuda-memory-268435456-18 tencent.com/vcuda-memory-268435456-45 tencent.com/vcuda-memory-268435456-48 tencent.com/vcuda-memory-268435456-23 tencent.com/vcuda-memory-268435456-39],},},} allocation request for vmemory
I0628 11:55:45.591388  168134 vcore.go:88] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-17 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1 tencent.com/vcuda-core-63],},},} allocation request for vcore
I0628 11:55:45.591426  168134 allocator.go:663] Request GPU device: tencent.com/vcuda-core-48,tencent.com/vcuda-core-58,tencent.com/vcuda-core-96,tencent.com/vcuda-core-87,tencent.com/vcuda-core-17,tencent.com/vcuda-core-99,tencent.com/vcuda-core-82,tencent.com/vcuda-core-66,tencent.com/vcuda-core-1,tencent.com/vcuda-core-63
I0628 11:55:45.617421  168134 allocator.go:1131] candidate pod tf-notebook in ns default with timestamp 1593345345000000000 is found.
I0628 11:55:45.617432  168134 allocator.go:715] Found candidate Pod 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook) with device count 10
I0628 11:55:45.617492  168134 allocator.go:618] Pods to be removed: []
I0628 11:55:45.624573  168134 tree.go:119] Update device information
I0628 11:55:45.631080  168134 allocator.go:375] Tree graph: ROOT:1
|---GPU0 (pids: [], usedMemory: 0, totalMemory: 24032378880, allocatableCores: 100, allocatableMemory: 24032378880)
I0628 11:55:45.631089  168134 allocator.go:386] Try allocate for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), vcore 10, vmemory 2684354560
I0628 11:55:45.631095  168134 share.go:58] Pick up 0 mask 1, cores: 100, memory: 24032378880
I0628 11:55:45.631101  168134 allocator.go:479] Allocate /run/nvidia/driver/dev/nvidia0 for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), Meta (0:0)
I0628 11:55:45.631108  168134 tree.go:491] Occupy /run/nvidia/driver/dev/nvidia0 with 10 2684354560, mask 1
I0628 11:55:45.631111  168134 tree.go:518] Occupy /run/nvidia/driver/dev/nvidia0 parent 1
I0628 11:55:45.631115  168134 tree.go:501] /run/nvidia/driver/dev/nvidia0 cores 100->90
I0628 11:55:45.631119  168134 tree.go:507] /run/nvidia/driver/dev/nvidia0 memory 24032378880->21348024320
I0628 11:55:47.875495  168134 vcore.go:103] PreStartContainer request for vcore
I0628 11:55:47.875514  168134 allocator.go:784] get preStartContainer call from k8s, req: &PreStartContainerRequest{DevicesIDs:[tencent.com/vcuda-core-17 tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-63 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1],}
I0628 11:55:47.875889  168134 manager.go:363] process 3d89b0e4-d5c1-43f4-bcc6-98650521894a
I0628 11:55:47.876058  168134 manager.go:352] Start vDevice server for /etc/gpu-manager/vm/3d89b0e4-d5c1-43f4-bcc6-98650521894a
I0628 11:55:47.889934  168134 vmemory.go:107] PreStartContainer request for vmemory
I0628 11:56:01.287588  168134 allocator.go:204] Checking allocation of pods on this node
W0628 11:56:31.212696  168134 manager.go:290] Find orphaned pod 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:56:31.212700  168134 manager.go:296] Remove directory 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:56:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:57:01.287570  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:57:31.212208  168134 manager.go:260] Close orphaned server /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:57:31.287580  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:58:01.287580  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:58:31.287573  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:59:01.287574  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:59:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:00:01.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:00:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:01:01.287578  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:01:31.287572  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:02:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:02:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:03:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:03:29.410276  168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 12 items received
I0628 12:03:31.287567  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:04:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:04:31.287508  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:05:01.287503  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:05:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:06:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:06:31.287525  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:07:01.287543  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:07:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:08:01.287570  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:08:31.287565  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:09:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:09:31.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:10:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:10:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:11:01.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:11:31.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:12:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:12:31.287561  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:13:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:13:11.412348  168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 0 items received
I0628 12:13:31.287564  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:14:01.287577  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:14:31.287561  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:15:01.287575  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:15:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:16:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:16:31.287560  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:17:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:17:31.287565  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:18:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:18:31.287587  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:19:01.287572  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:19:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:20:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:20:31.287563  168134 allocator.go:204] Checking allocation of pods on this node

@mYmNeo
I excluded all the logs from util

I0628 11:55:28.373296 168134 app.go:87] Wait for internal server ready
I0628 11:55:28.376088 168134 volume.go:133] Find binaries: [/usr/bin/gpu-client]
I0628 11:55:28.376139 168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376142 168134 volume.go:139] Find 64bit libraries: [/usr/lib64/libcuda-control.so]
I0628 11:55:28.376891 168134 volume.go:133] Find binaries: []
I0628 11:55:28.376927 168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376930 168134 volume.go:139] Find 64bit libraries: []
I0628 11:55:28.376946 168134 volume.go:176] Mirror /usr/bin/gpu-client to /etc/gpu-manager/vdriver/nvidia/bin
I0628 11:55:28.386992 168134 volume.go:176] Mirror /usr/lib64/libcuda-control.so to /etc/gpu-manager/vdriver/nvidia/lib64

The log has showed that gpu-manager only detect /usr/lib64/libcuda-control.so and /usr/bin/gpu-client. And coping them into /etc/gpu-manager/vdriver/nvidia/lib64, but the correct one should have a few nvidia libraries. Since you've changed the copy-lib.sh, the rebuild ldcache procedure doesn't find your changes.

from gpu-manager.

raz-bn commented on September 17, 2024

@mYmNeo is it solvable?

from gpu-manager.

Cri-o support about gpu-manager HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent