Giter VIP home page Giter VIP logo

Comments (23)

raz-bn avatar raz-bn commented on September 17, 2024

I finally was able to make it work, however i got this error:
image

so now I am pretty sure it cant be run with crio at the moment, any workaround to make it happen?

from gpu-manager.

mYmNeo avatar mYmNeo commented on September 17, 2024

I finally was able to make it work, however i got this error:
image

so now I am pretty sure it cant be run with crio at the moment, any workaround to make it happen?

gpu-manager want to connect docker to find some information to recover topology and usage.

from gpu-manager.

mYmNeo avatar mYmNeo commented on September 17, 2024

We released a new version which support CRI interface. Welcome to have a try

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024

@mYmNeo
When trying to run gpu-manager with crio as the default runtime I get this error
image

any details on how to fix it?

from gpu-manager.

mYmNeo avatar mYmNeo commented on September 17, 2024

@mYmNeo
When trying to run gpu-manager with crio as the default runtime I get this error
image

any details on how to fix it?

This error means the gpu-manager doesn't detect the gpu card on your machine. Did you install driver in your node?

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024

@mYmNeo
Thanks for the response.
Pretty sure I do.
when setting up Nvidia-docker runtime hook:

{
    "version": "1.0.0",
    "hook": {
        "path": "/usr/local/nvidia/toolkit/nvidia-container-toolkit",
        "args": ["nvidia-container-toolkit", "prestart"],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/nvidia/toolkit"
        ]
    },
    "when": {
        "always": true,
        "commands": [".*"]
    },
    "stages": ["prestart"]
}

I can run nvidia-smi from inside a simple Jupyter notebook container.

After removing the hook file, and running the gpu-manager daemon set I get the error I posted, also tried to add the path:

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/nvidia/toolkit

as env variable to the gpu-manager, got the same error.

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024

@mYmNeo
I've solved the problem, had to change:

        - name: usr-directory
          hostPath:
            type: Directory
            path: /usr

to:

        - name: usr-directory
          hostPath:
            type: Directory
            path: /run/nvidia/driver/usr

However, I came across a new problem when trying to observe the gpu-manager metrics.

I get this error:

E0625 11:52:57.438606  127961 runtime.go:110] can't read /sys/fs/cgroup/memory/kubepods/besteffort/pod6f88cfc6-a9b2-4d51-add4-35a588e4990c/6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1/cgroup.procs, open /sys/fs/cgroup/memory/kubepods/besteffort/pod6f88cfc6-a9b2-4d51-add4-35a588e4990c/6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1/cgroup.procs: no such file or directory

looking at me host I saw the path gpu-manager looking for is looking different:

/sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/crio-conmon-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs

is there any quick fix?

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024

I guess i found the "solution" by setting --cgroup-driver like mentioned in the FAQ.
However I still get an error message since there is a typo in

E0625 12:28:22.892393  358033 runtime.go:110] can't read /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/cri-o-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs, open /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/cri-o-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs: no such file or directory

the right location:

/sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/crio-conmon-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024

After fixing this line:

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), m.runtimeName, containerID), nil

to:

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil

the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:
image

Do you have any idea how to fix it?
@mYmNeo

p.s
sorry for the all the messages :(

from gpu-manager.

mYmNeo avatar mYmNeo commented on September 17, 2024
/sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6f88cfc6_a9b2_4d51_add4_35a588e4990c.slice/crio-conmon-6641b82b85803724d556d8d8cd39fa68857fa762826a0b854cbacfd01e486ee1.scope/cgroup.procs

Is your cri-o running with a systemd service with a name cri-common?

from gpu-manager.

mYmNeo avatar mYmNeo commented on September 17, 2024

After fixing this line:

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), m.runtimeName, containerID), nil

to:

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil

the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:
image

Do you have any idea how to fix it?
@mYmNeo

p.s
sorry for the all the messages :(

What's your pod yaml? The metrics will not report data point, if utilization is 0.

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024

After fixing this line:

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), m.runtimeName, containerID), nil

to:

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil

the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:
image
Do you have any idea how to fix it?
@mYmNeo
p.s
sorry for the all the messages :(

What's your pod yaml? The metrics will not report data point, if utilization is 0.
@mYmNeo

apiVersion: v1
kind: Service
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  type: NodePort
  ports:
  - port: 80
    name: http
    targetPort: 8888
    nodePort: 30001
  selector:
    app: tf-notebook
---
apiVersion: v1
kind: Pod
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  securityContext:
    fsGroup: 0
  containers:
  - name: tf-notebook
    image: tensorflow/tensorflow:latest-gpu-jupyter
    resources:
      requests:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 10
      limits:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 10
    env:
      - name: LOGGER_LEVEL
        value: "5"
    ports:
    - containerPort: 8888
      name: notebook

after more debuging I found my self-editing more of the gpu-manager source code in order to make it fit with my odd use case.
since my Nvidia drivers are located in a different place than usual, I need to modify a few paths in the gpu-manager code for example:
Original:

const (
	NvidiaCtlDevice    = "/dev/nvidiactl"
	NvidiaUVMDevice    = "/dev/nvidia-uvm"
	NvidiaFullpathRE   = `^/dev/nvidia([0-9]*)$`
	NvidiaDevicePrefix = "/dev/nvidia"
)

My version:

const (
	NvidiaCtlDevice    = "/run/nvidia/driver/dev/nvidiactl"
	NvidiaUVMDevice    = "/run/nvidia/driver/dev/nvidia-uvm"
	NvidiaFullpathRE   = `^/run/nvidia/driver/dev/nvidia([0-9]*)$`
	NvidiaDevicePrefix = "/run/nvidia/driver/dev/nvidia"
)

also edited the LD_LIBERY_PATH.

After doing so, my pod managed to see the GPU device and managed to used it. However, I've realized there was no enforcement on the memory limit.
I added the env variable LOGGER_LEVEL=5 to my pod, as you can see in the YAML file to try and debug the vcuda-controller, but there were no logs from it. As far as I understand, the vcuda-controller is triggered by a hook to the Cuda libraries, so there are few questions I want to ask to locate my problem:

  1. How can I verify vcuda-controller is present in my pod?
  2. How is the vcuda-controller ending up present in my pod?
  3. How I make Tensorflow, for example, use the vcuda-controller libraries?
  4. How the vcuda-controller working?

my assumptions:

  1. The vcuda-controller is not present in my pod
  2. Since I was editing all the path, I forgot something, and the TensorFlow app is using different libraries rather than the vcuda-controller

from gpu-manager.

mYmNeo avatar mYmNeo commented on September 17, 2024

After fixing this line:

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), m.runtimeName, containerID), nil

to:

return fmt.Sprintf("%s/%s-%s.scope", cgroupName.ToSystemd(), "crio-conmon", containerID), nil

the issues are fixed, but after looking at the metrics I realize my pods don't see any GPUs.
verified it using python and TensorFlow:
image
Do you have any idea how to fix it?
@mYmNeo
p.s
sorry for the all the messages :(

What's your pod yaml? The metrics will not report data point, if utilization is 0.
@mYmNeo

apiVersion: v1
kind: Service
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  type: NodePort
  ports:
  - port: 80
    name: http
    targetPort: 8888
    nodePort: 30001
  selector:
    app: tf-notebook
---
apiVersion: v1
kind: Pod
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  securityContext:
    fsGroup: 0
  containers:
  - name: tf-notebook
    image: tensorflow/tensorflow:latest-gpu-jupyter
    resources:
      requests:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 10
      limits:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 10
    env:
      - name: LOGGER_LEVEL
        value: "5"
    ports:
    - containerPort: 8888
      name: notebook

after more debuging I found my self-editing more of the gpu-manager source code in order to make it fit with my odd use case.
since my Nvidia drivers are located in a different place than usual, I need to modify a few paths in the gpu-manager code for example:
Original:

const (
	NvidiaCtlDevice    = "/dev/nvidiactl"
	NvidiaUVMDevice    = "/dev/nvidia-uvm"
	NvidiaFullpathRE   = `^/dev/nvidia([0-9]*)$`
	NvidiaDevicePrefix = "/dev/nvidia"
)

My version:

const (
	NvidiaCtlDevice    = "/run/nvidia/driver/dev/nvidiactl"
	NvidiaUVMDevice    = "/run/nvidia/driver/dev/nvidia-uvm"
	NvidiaFullpathRE   = `^/run/nvidia/driver/dev/nvidia([0-9]*)$`
	NvidiaDevicePrefix = "/run/nvidia/driver/dev/nvidia"
)

also edited the LD_LIBERY_PATH.

After doing so, my pod managed to see the GPU device and managed to used it. However, I've realized there was no enforcement on the memory limit.
I added the env variable LOGGER_LEVEL=5 to my pod, as you can see in the YAML file to try and debug the vcuda-controller, but there were no logs from it. As far as I understand, the vcuda-controller is triggered by a hook to the Cuda libraries, so there are few questions I want to ask to locate my problem:

  1. How can I verify vcuda-controller is present in my pod?
  2. How is the vcuda-controller ending up present in my pod?
  3. How I make Tensorflow, for example, use the vcuda-controller libraries?
  4. How the vcuda-controller working?

my assumptions:

  1. The vcuda-controller is not present in my pod
  2. Since I was editing all the path, I forgot something, and the TensorFlow app is using different libraries rather than the vcuda-controller

I don't know why your nvidia libraries are located at the tmpfs directory /run. The gpu-manager try to find nvidia libraries from the mounted directory named/usr/local/host in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.

For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so in the directory /etc/gpu-manager/vdriver.
For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
For 4, the README of the vcuda-controller project(https://github.com/tkestack/vcuda-controller) gives a link about the paper we released how vcuda-controller works.

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024

@mYmNeo Thank you for the replay!
few things I didn't understand from your answer

/usr/local/host in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.

This copy should take place only once? or every time a new pod which asks for gpu is starting?
When the gpu-manager pod is starting, I indeed see from the logs that files are being copied from /user/local/host to the path mentioned in this line:

readonly NV_DIR="/usr/local/nvidia"

For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so in the directory /etc/gpu-manager/vdriver.
In the host? gpu-manager pod? app pod?
since I think it only make sense to be on the host, I don't understand how it gonna end up in my app pod

For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
The LD_LIBRARY_PATH should be pointing to the location specified in this line?

readonly NV_DIR="/usr/local/nvidia"

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024

@mYmNeo
Is there any documentation/low-level design document you can share?
I've read your paper, it was a good high-level overview but it is not enough to understand what is going on in your code

from gpu-manager.

mYmNeo avatar mYmNeo commented on September 17, 2024

@mYmNeo Thank you for the replay!
few things I didn't understand from your answer

/usr/local/host in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.

This copy should take place only once? or every time a new pod which asks for gpu is starting?
When the gpu-manager pod is starting, I indeed see from the logs that files are being copied from /user/local/host to the path mentioned in this line:

readonly NV_DIR="/usr/local/nvidia"

For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so in the directory /etc/gpu-manager/vdriver.
In the host? gpu-manager pod? app pod?
since I think it only make sense to be on the host, I don't understand how it gonna end up in my app pod

For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
The LD_LIBRARY_PATH should be pointing to the location specified in this line?

readonly NV_DIR="/usr/local/nvidia"

/etc/gpu-manager/vdriver can be found both in the gpu-manager pod and your host, so libcuda-control.so should be found at the sub-directory nvidia in /etc/gpu-manager/vdriver in your case, if not, that means gpu-manager doesn't located the necessary libraries.
Before your application pod is running, gpu-manager will bind mount the driver directory located in either /etc/gpu-manager/vdriver/nvidia for fraction request or /etc/gpu-manager/vdriver/origin. Since your application pod uses fraction request, you should find your necessary libraries in /usr/local/nvidia/lib64 or /usr/local/nvidia/lib of your application pod. After that, you have to check whether if you have override the default LD_LIBRARY_PATH environment which gpu-manager set for your application. For your situation, before your app is running, the wrapped libraries libcuda-control.so try to register some information to gpu-manager, then the control system can be running correctly. To see the libcuda-control.so log, set environment LOGGER_LEVEL=5 before launching your app.

As your description of your problem that no log found after setting LOGGER_LEVEL=5 , I think the problem may be that your app loaded libraries from another path not the ones gpu-manager provides.

from gpu-manager.

mYmNeo avatar mYmNeo commented on September 17, 2024

@mYmNeo
Is there any documentation/low-level design document you can share?
I've read your paper, it was a good high-level overview but it is not enough to understand what is going on in your code

The vcuda-controller project is very simple only a few of source code, so no more low-level documents are provided.

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024

@mYmNeo
I am adding some screenshots and logs, hopefully, it will guide me to the problem
gpu-manger logs:

copy /usr/local/host/lib/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1.1.4 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/bin/nvidia-cuda-mps-control to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-cuda-mps-server to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-debugdump to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-persistenced to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-smi to /run/nvidia/driver/usr/local/nvidia/bin/
rebuild ldcache
launch gpu manager
E0628 11:55:28.387789  168134 server.go:131] Unable to set Type=notify in systemd service file?
E0628 11:55:31.219430  168134 tree.go:337] No topology level found at 0

gpu node:
image

gpu pod:
image

gpu pod - tensorflow logs:

2020-06-28 11:57:24.301082: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-28 11:57:24.343386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2020-06-28 11:57:24.343661: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.345313: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-28 11:57:24.346878: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-28 11:57:24.347176: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-28 11:57:24.348832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-28 11:57:24.349628: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-28 11:57:24.353395: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-28 11:57:24.356657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-28 11:57:24.356990: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-28 11:57:24.364042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2593995000 Hz
2020-06-28 11:57:24.364771: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f30e0000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-28 11:57:24.364796: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-28 11:57:24.478234: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4a46e40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-28 11:57:24.478270: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P40, Compute Capability 6.1
2020-06-28 11:57:24.479479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2020-06-28 11:57:24.479532: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.479547: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-28 11:57:24.479559: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-28 11:57:24.479576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-28 11:57:24.479588: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-28 11:57:24.479599: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-28 11:57:24.479611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-28 11:57:24.481683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-28 11:57:24.481736: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.483290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-28 11:57:24.483315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2020-06-28 11:57:24.483323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2020-06-28 11:57:24.485578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21397 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0001:00:00.0, compute capability: 6.1)
2020-06-28 11:57:25.656053: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10

from gpu-manager.

mYmNeo avatar mYmNeo commented on September 17, 2024

@mYmNeo
I am adding some screenshots and logs, hopefully, it will guide me to the problem
gpu-manger logs:

copy /usr/local/host/lib/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ml.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ml.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ml.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libcuda.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libcuda.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libcuda.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ptxjitcompiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-fatbinaryloader.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-opencl.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opencl.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-compiler.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libvdpau_nvidia.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/vdpau/libvdpau_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-encode.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-encode.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-encode.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvcuvid.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvcuvid.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvcuvid.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-fbc.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-fbc.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-fbc.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-ifr.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ifr.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-ifr.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLX_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLX_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libEGL_nvidia.so.0 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libEGL_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLESv2_nvidia.so.2 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLESv2_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libGLESv1_CM_nvidia.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-eglcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-egl-wayland.so.1.1.4 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-glcore.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-tls.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-glsi.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib
copy /usr/local/host/lib64/libnvidia-opticalflow.so to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opticalflow.so.1 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/lib64/libnvidia-opticalflow.so.440.64.00 to /run/nvidia/driver/usr/local/nvidia/lib64
copy /usr/local/host/bin/nvidia-cuda-mps-control to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-cuda-mps-server to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-debugdump to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-persistenced to /run/nvidia/driver/usr/local/nvidia/bin/
copy /usr/local/host/bin/nvidia-smi to /run/nvidia/driver/usr/local/nvidia/bin/
rebuild ldcache
launch gpu manager
E0628 11:55:28.387789  168134 server.go:131] Unable to set Type=notify in systemd service file?
E0628 11:55:31.219430  168134 tree.go:337] No topology level found at 0

gpu node:
image

gpu pod:
image

gpu pod - tensorflow logs:

2020-06-28 11:57:24.301082: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-28 11:57:24.343386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2020-06-28 11:57:24.343661: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.345313: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-28 11:57:24.346878: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-28 11:57:24.347176: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-28 11:57:24.348832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-28 11:57:24.349628: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-28 11:57:24.353395: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-28 11:57:24.356657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-28 11:57:24.356990: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-28 11:57:24.364042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2593995000 Hz
2020-06-28 11:57:24.364771: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f30e0000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-28 11:57:24.364796: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-28 11:57:24.478234: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4a46e40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-28 11:57:24.478270: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P40, Compute Capability 6.1
2020-06-28 11:57:24.479479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2020-06-28 11:57:24.479532: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.479547: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-28 11:57:24.479559: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-28 11:57:24.479576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-28 11:57:24.479588: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-28 11:57:24.479599: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-28 11:57:24.479611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-28 11:57:24.481683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-28 11:57:24.481736: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-28 11:57:24.483290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-28 11:57:24.483315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2020-06-28 11:57:24.483323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2020-06-28 11:57:24.485578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21397 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0001:00:00.0, compute capability: 6.1)
2020-06-28 11:57:25.656053: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10

The log you provided is just not completed since you just gives the screen output. The log should be found at /etc/gpu-manager/logs

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024

@mYmNeo Thank you for the replay!
few things I didn't understand from your answer
/usr/local/host in the gpu-manager pod. After that, gpu-manager will report mirror libraries in the info log. Since you changed the source code, you need to check whether the gpu-manager has detected and copied the libraries.
This copy should take place only once? or every time a new pod which asks for gpu is starting?
When the gpu-manager pod is starting, I indeed see from the logs that files are being copied from /user/local/host to the path mentioned in this line:

readonly NV_DIR="/usr/local/nvidia"

For 1,2, if the gpu-manager has detected and copied the libraries, you will find libcuda-control.so in the directory /etc/gpu-manager/vdriver.
In the host? gpu-manager pod? app pod?
since I think it only make sense to be on the host, I don't understand how it gonna end up in my app pod
For 3, the gpu-manager set LD_LIBRARY_PATH to load vcuda-controller library, since you changed the code, you need to guarantee the correct library path.
The LD_LIBRARY_PATH should be pointing to the location specified in this line?

readonly NV_DIR="/usr/local/nvidia"

/etc/gpu-manager/vdriver can be found both in the gpu-manager pod and your host, so libcuda-control.so should be found at the sub-directory nvidia in /etc/gpu-manager/vdriver in your case, if not, that means gpu-manager doesn't located the necessary libraries.
Before your application pod is running, gpu-manager will bind mount the driver directory located in either /etc/gpu-manager/vdriver/nvidia for fraction request or /etc/gpu-manager/vdriver/origin. Since your application pod uses fraction request, you should find your necessary libraries in /usr/local/nvidia/lib64 or /usr/local/nvidia/lib of your application pod. After that, you have to check whether if you have override the default LD_LIBRARY_PATH environment which gpu-manager set for your application. For your situation, before your app is running, the wrapped libraries libcuda-control.so try to register some information to gpu-manager, then the control system can be running correctly. To see the libcuda-control.so log, set environment LOGGER_LEVEL=5 before launching your app.

As your description of your problem that no log found after setting LOGGER_LEVEL=5 , I think the problem may be that your app loaded libraries from another path not the ones gpu-manager provides.

@mYmNeo
before your app is running, the wrapped libraries libcuda-control.sotry to register some information togpu-manager``
I think this part is not working for me

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024
[root@ocp4-krpkk-worker-gpu-rnw9j log]# cat gpu-manager.INFO | grep -v "util"
Log file created at: 2020/06/28 11:55:28
Running on machine: gpu-manager-daemonset-pjf5q
Binary: Built with gc go1.14.3 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0628 11:55:28.373296  168134 app.go:87] Wait for internal server ready
I0628 11:55:28.376088  168134 volume.go:133] Find binaries: [/usr/bin/gpu-client]
I0628 11:55:28.376139  168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376142  168134 volume.go:139] Find 64bit libraries: [/usr/lib64/libcuda-control.so]
I0628 11:55:28.376891  168134 volume.go:133] Find binaries: []
I0628 11:55:28.376927  168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376930  168134 volume.go:139] Find 64bit libraries: []
I0628 11:55:28.376946  168134 volume.go:176] Mirror /usr/bin/gpu-client to /etc/gpu-manager/vdriver/nvidia/bin
I0628 11:55:28.386992  168134 volume.go:176] Mirror /usr/lib64/libcuda-control.so to /etc/gpu-manager/vdriver/nvidia/lib64
I0628 11:55:28.387769  168134 volume.go:152] Volume manager is running
E0628 11:55:28.387789  168134 server.go:131] Unable to set Type=notify in systemd service file?
W0628 11:55:28.388430  168134 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0628 11:55:28.390318  168134 logs.go:79] parsed scheme: ""
I0628 11:55:28.390323  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:28.390327  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/crio/crio.sock 0  <nil>}] <nil>}
I0628 11:55:28.390338  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:28.390391  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, CONNECTING
I0628 11:55:28.390739  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, READY
I0628 11:55:28.391357  168134 runtime.go:69] Container runtime is cri-o
I0628 11:55:28.391364  168134 server.go:155] Container runtime manager is running
I0628 11:55:28.391442  168134 reflector.go:150] Starting reflector *v1.Pod (1m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
I0628 11:55:28.391452  168134 reflector.go:185] Listing and watching *v1.Pod from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
I0628 11:55:29.374625  168134 app.go:87] Wait for internal server ready
I0628 11:55:29.391501  168134 watchdog.go:64] Pod cache is running
I0628 11:55:31.196446  168134 server.go:158] Watchdog is running
I0628 11:55:31.196455  168134 label.go:102] Labeler for hostname ocp4-krpkk-worker-gpu-rnw9j
I0628 11:55:31.211512  168134 label.go:153] Auto label is running
I0628 11:55:31.211559  168134 manager.go:195] Start vDevice watcher
I0628 11:55:31.211912  168134 manager.go:244] Recover vDevice server for /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:55:31.211929  168134 manager.go:191] Virtual manager is running
I0628 11:55:31.211961  168134 manager.go:269] Starting garbage directory collector
I0628 11:55:31.211987  168134 manager.go:360] Starting process vm events
I0628 11:55:31.213970  168134 tree.go:187] Detect 1 gpu cards
E0628 11:55:31.219430  168134 tree.go:337] No topology level found at 0
I0628 11:55:31.219448  168134 tree.go:340] Only one card topology
I0628 11:55:31.219990  168134 tree.go:119] Update device information
I0628 11:55:31.226564  168134 allocator.go:263] Load extra config from /etc/gpu-manager/extra-config.json
W0628 11:55:31.226608  168134 allocator.go:1209] Failed to read from checkpoint due to key is not found
I0628 11:55:31.226652  168134 allocator.go:618] Pods to be removed: []
I0628 11:55:31.249243  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:55:31.287340  168134 allocator.go:978] failed to get pod 2f7545ea-77d8-4e8e-81ef-9135740843bf from allocatedPod cache
I0628 11:55:31.287343  168134 allocator.go:223] failed to get ready annotations for pod 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:55:31.287387  168134 server.go:182] Starting the GRPC server, driver nvidia, queryPort 9400
I0628 11:55:31.287445  168134 server.go:236] Server tencent.com/vcuda-core is running
I0628 11:55:31.287448  168134 server.go:236] Server tencent.com/vcuda-memory is running
I0628 11:55:31.287498  168134 logs.go:79] parsed scheme: ""
I0628 11:55:31.287503  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:31.287512  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/gpu-manager.sock 0  <nil>}] <nil>}
I0628 11:55:31.287526  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:31.287546  168134 server.go:250] Server is ready at /var/run/gpu-manager.sock
I0628 11:55:31.287557  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, CONNECTING
I0628 11:55:31.287605  168134 vcore.go:81] Server tencent.com/vcuda-core is ready at /var/lib/kubelet/device-plugins/vcore.sock
I0628 11:55:31.287663  168134 vmemory.go:80] Server tencent.com/vcuda-memory is ready at /var/lib/kubelet/device-plugins/vmemory.sock
I0628 11:55:31.287765  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, READY
I0628 11:55:32.191243  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.191251  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.191257  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vcore.sock 0  <nil>}] <nil>}
I0628 11:55:32.191281  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.191326  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, CONNECTING
I0628 11:55:32.192059  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, READY
I0628 11:55:32.192073  168134 server.go:90] Server /var/lib/kubelet/device-plugins/vcore.sock is ready, readyServers: 1
I0628 11:55:32.192080  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.192088  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.192092  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vmemory.sock 0  <nil>}] <nil>}
I0628 11:55:32.192098  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.192124  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, CONNECTING
I0628 11:55:32.192138  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192206  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192595  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, READY
I0628 11:55:32.192613  168134 server.go:90] Server /var/lib/kubelet/device-plugins/vmemory.sock is ready, readyServers: 2
I0628 11:55:32.192625  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.192628  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.192631  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/kubelet.sock 0  <nil>}] <nil>}
I0628 11:55:32.192637  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.192657  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, CONNECTING
I0628 11:55:32.192669  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192768  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192899  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, READY
I0628 11:55:32.192909  168134 server.go:334] Register to kubelet with endpoint vcore.sock
I0628 11:55:32.194889  168134 server.go:334] Register to kubelet with endpoint vmemory.sock
I0628 11:55:32.195089  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.195323  168134 vcore.go:93] ListAndWatch request for vcore
I0628 11:55:32.195413  168134 vmemory.go:97] ListAndWatch request for vmemory
I0628 11:55:45.591024  168134 vmemory.go:87] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-memory-268435456-69 tencent.com/vcuda-memory-268435456-21 tencent.com/vcuda-memory-268435456-14 tencent.com/vcuda-memory-268435456-26 tencent.com/vcuda-memory-268435456-58 tencent.com/vcuda-memory-268435456-18 tencent.com/vcuda-memory-268435456-45 tencent.com/vcuda-memory-268435456-48 tencent.com/vcuda-memory-268435456-23 tencent.com/vcuda-memory-268435456-39],},},} allocation request for vmemory
I0628 11:55:45.591388  168134 vcore.go:88] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-17 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1 tencent.com/vcuda-core-63],},},} allocation request for vcore
I0628 11:55:45.591426  168134 allocator.go:663] Request GPU device: tencent.com/vcuda-core-48,tencent.com/vcuda-core-58,tencent.com/vcuda-core-96,tencent.com/vcuda-core-87,tencent.com/vcuda-core-17,tencent.com/vcuda-core-99,tencent.com/vcuda-core-82,tencent.com/vcuda-core-66,tencent.com/vcuda-core-1,tencent.com/vcuda-core-63
I0628 11:55:45.617421  168134 allocator.go:1131] candidate pod tf-notebook in ns default with timestamp 1593345345000000000 is found.
I0628 11:55:45.617432  168134 allocator.go:715] Found candidate Pod 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook) with device count 10
I0628 11:55:45.617492  168134 allocator.go:618] Pods to be removed: []
I0628 11:55:45.624573  168134 tree.go:119] Update device information
I0628 11:55:45.631080  168134 allocator.go:375] Tree graph: ROOT:1
|---GPU0 (pids: [], usedMemory: 0, totalMemory: 24032378880, allocatableCores: 100, allocatableMemory: 24032378880)
I0628 11:55:45.631089  168134 allocator.go:386] Try allocate for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), vcore 10, vmemory 2684354560
I0628 11:55:45.631095  168134 share.go:58] Pick up 0 mask 1, cores: 100, memory: 24032378880
I0628 11:55:45.631101  168134 allocator.go:479] Allocate /run/nvidia/driver/dev/nvidia0 for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), Meta (0:0)
I0628 11:55:45.631108  168134 tree.go:491] Occupy /run/nvidia/driver/dev/nvidia0 with 10 2684354560, mask 1
I0628 11:55:45.631111  168134 tree.go:518] Occupy /run/nvidia/driver/dev/nvidia0 parent 1
I0628 11:55:45.631115  168134 tree.go:501] /run/nvidia/driver/dev/nvidia0 cores 100->90
I0628 11:55:45.631119  168134 tree.go:507] /run/nvidia/driver/dev/nvidia0 memory 24032378880->21348024320
I0628 11:55:47.875495  168134 vcore.go:103] PreStartContainer request for vcore
I0628 11:55:47.875514  168134 allocator.go:784] get preStartContainer call from k8s, req: &PreStartContainerRequest{DevicesIDs:[tencent.com/vcuda-core-17 tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-63 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1],}
I0628 11:55:47.875889  168134 manager.go:363] process 3d89b0e4-d5c1-43f4-bcc6-98650521894a
I0628 11:55:47.876058  168134 manager.go:352] Start vDevice server for /etc/gpu-manager/vm/3d89b0e4-d5c1-43f4-bcc6-98650521894a
I0628 11:55:47.889934  168134 vmemory.go:107] PreStartContainer request for vmemory
I0628 11:56:01.287588  168134 allocator.go:204] Checking allocation of pods on this node
W0628 11:56:31.212696  168134 manager.go:290] Find orphaned pod 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:56:31.212700  168134 manager.go:296] Remove directory 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:56:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:57:01.287570  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:57:31.212208  168134 manager.go:260] Close orphaned server /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:57:31.287580  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:58:01.287580  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:58:31.287573  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:59:01.287574  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:59:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:00:01.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:00:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:01:01.287578  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:01:31.287572  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:02:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:02:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:03:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:03:29.410276  168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 12 items received
I0628 12:03:31.287567  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:04:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:04:31.287508  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:05:01.287503  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:05:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:06:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:06:31.287525  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:07:01.287543  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:07:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:08:01.287570  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:08:31.287565  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:09:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:09:31.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:10:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:10:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:11:01.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:11:31.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:12:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:12:31.287561  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:13:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:13:11.412348  168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 0 items received
I0628 12:13:31.287564  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:14:01.287577  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:14:31.287561  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:15:01.287575  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:15:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:16:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:16:31.287560  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:17:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:17:31.287565  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:18:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:18:31.287587  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:19:01.287572  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:19:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:20:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:20:31.287563  168134 allocator.go:204] Checking allocation of pods on this node

@mYmNeo
I excluded all the logs from util

from gpu-manager.

mYmNeo avatar mYmNeo commented on September 17, 2024
[root@ocp4-krpkk-worker-gpu-rnw9j log]# cat gpu-manager.INFO | grep -v "util"
Log file created at: 2020/06/28 11:55:28
Running on machine: gpu-manager-daemonset-pjf5q
Binary: Built with gc go1.14.3 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0628 11:55:28.373296  168134 app.go:87] Wait for internal server ready
I0628 11:55:28.376088  168134 volume.go:133] Find binaries: [/usr/bin/gpu-client]
I0628 11:55:28.376139  168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376142  168134 volume.go:139] Find 64bit libraries: [/usr/lib64/libcuda-control.so]
I0628 11:55:28.376891  168134 volume.go:133] Find binaries: []
I0628 11:55:28.376927  168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376930  168134 volume.go:139] Find 64bit libraries: []
I0628 11:55:28.376946  168134 volume.go:176] Mirror /usr/bin/gpu-client to /etc/gpu-manager/vdriver/nvidia/bin
I0628 11:55:28.386992  168134 volume.go:176] Mirror /usr/lib64/libcuda-control.so to /etc/gpu-manager/vdriver/nvidia/lib64
I0628 11:55:28.387769  168134 volume.go:152] Volume manager is running
E0628 11:55:28.387789  168134 server.go:131] Unable to set Type=notify in systemd service file?
W0628 11:55:28.388430  168134 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0628 11:55:28.390318  168134 logs.go:79] parsed scheme: ""
I0628 11:55:28.390323  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:28.390327  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/crio/crio.sock 0  <nil>}] <nil>}
I0628 11:55:28.390338  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:28.390391  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, CONNECTING
I0628 11:55:28.390739  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8b0, READY
I0628 11:55:28.391357  168134 runtime.go:69] Container runtime is cri-o
I0628 11:55:28.391364  168134 server.go:155] Container runtime manager is running
I0628 11:55:28.391442  168134 reflector.go:150] Starting reflector *v1.Pod (1m0s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
I0628 11:55:28.391452  168134 reflector.go:185] Listing and watching *v1.Pod from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105
I0628 11:55:29.374625  168134 app.go:87] Wait for internal server ready
I0628 11:55:29.391501  168134 watchdog.go:64] Pod cache is running
I0628 11:55:31.196446  168134 server.go:158] Watchdog is running
I0628 11:55:31.196455  168134 label.go:102] Labeler for hostname ocp4-krpkk-worker-gpu-rnw9j
I0628 11:55:31.211512  168134 label.go:153] Auto label is running
I0628 11:55:31.211559  168134 manager.go:195] Start vDevice watcher
I0628 11:55:31.211912  168134 manager.go:244] Recover vDevice server for /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:55:31.211929  168134 manager.go:191] Virtual manager is running
I0628 11:55:31.211961  168134 manager.go:269] Starting garbage directory collector
I0628 11:55:31.211987  168134 manager.go:360] Starting process vm events
I0628 11:55:31.213970  168134 tree.go:187] Detect 1 gpu cards
E0628 11:55:31.219430  168134 tree.go:337] No topology level found at 0
I0628 11:55:31.219448  168134 tree.go:340] Only one card topology
I0628 11:55:31.219990  168134 tree.go:119] Update device information
I0628 11:55:31.226564  168134 allocator.go:263] Load extra config from /etc/gpu-manager/extra-config.json
W0628 11:55:31.226608  168134 allocator.go:1209] Failed to read from checkpoint due to key is not found
I0628 11:55:31.226652  168134 allocator.go:618] Pods to be removed: []
I0628 11:55:31.249243  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:55:31.287340  168134 allocator.go:978] failed to get pod 2f7545ea-77d8-4e8e-81ef-9135740843bf from allocatedPod cache
I0628 11:55:31.287343  168134 allocator.go:223] failed to get ready annotations for pod 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:55:31.287387  168134 server.go:182] Starting the GRPC server, driver nvidia, queryPort 9400
I0628 11:55:31.287445  168134 server.go:236] Server tencent.com/vcuda-core is running
I0628 11:55:31.287448  168134 server.go:236] Server tencent.com/vcuda-memory is running
I0628 11:55:31.287498  168134 logs.go:79] parsed scheme: ""
I0628 11:55:31.287503  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:31.287512  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/run/gpu-manager.sock 0  <nil>}] <nil>}
I0628 11:55:31.287526  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:31.287546  168134 server.go:250] Server is ready at /var/run/gpu-manager.sock
I0628 11:55:31.287557  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, CONNECTING
I0628 11:55:31.287605  168134 vcore.go:81] Server tencent.com/vcuda-core is ready at /var/lib/kubelet/device-plugins/vcore.sock
I0628 11:55:31.287663  168134 vmemory.go:80] Server tencent.com/vcuda-memory is ready at /var/lib/kubelet/device-plugins/vmemory.sock
I0628 11:55:31.287765  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00019a8d0, READY
I0628 11:55:32.191243  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.191251  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.191257  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vcore.sock 0  <nil>}] <nil>}
I0628 11:55:32.191281  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.191326  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, CONNECTING
I0628 11:55:32.192059  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000470690, READY
I0628 11:55:32.192073  168134 server.go:90] Server /var/lib/kubelet/device-plugins/vcore.sock is ready, readyServers: 1
I0628 11:55:32.192080  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.192088  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.192092  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/vmemory.sock 0  <nil>}] <nil>}
I0628 11:55:32.192098  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.192124  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, CONNECTING
I0628 11:55:32.192138  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192206  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192595  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc00072a680, READY
I0628 11:55:32.192613  168134 server.go:90] Server /var/lib/kubelet/device-plugins/vmemory.sock is ready, readyServers: 2
I0628 11:55:32.192625  168134 logs.go:79] parsed scheme: ""
I0628 11:55:32.192628  168134 logs.go:79] scheme "" not registered, fallback to default scheme
I0628 11:55:32.192631  168134 logs.go:79] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/kubelet.sock 0  <nil>}] <nil>}
I0628 11:55:32.192637  168134 logs.go:79] ClientConn switching balancer to "pick_first"
I0628 11:55:32.192657  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, CONNECTING
I0628 11:55:32.192669  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192768  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.192899  168134 logs.go:79] pickfirstBalancer: HandleSubConnStateChange: 0xc000512120, READY
I0628 11:55:32.192909  168134 server.go:334] Register to kubelet with endpoint vcore.sock
I0628 11:55:32.194889  168134 server.go:334] Register to kubelet with endpoint vmemory.sock
I0628 11:55:32.195089  168134 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0628 11:55:32.195323  168134 vcore.go:93] ListAndWatch request for vcore
I0628 11:55:32.195413  168134 vmemory.go:97] ListAndWatch request for vmemory
I0628 11:55:45.591024  168134 vmemory.go:87] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-memory-268435456-69 tencent.com/vcuda-memory-268435456-21 tencent.com/vcuda-memory-268435456-14 tencent.com/vcuda-memory-268435456-26 tencent.com/vcuda-memory-268435456-58 tencent.com/vcuda-memory-268435456-18 tencent.com/vcuda-memory-268435456-45 tencent.com/vcuda-memory-268435456-48 tencent.com/vcuda-memory-268435456-23 tencent.com/vcuda-memory-268435456-39],},},} allocation request for vmemory
I0628 11:55:45.591388  168134 vcore.go:88] &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-17 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1 tencent.com/vcuda-core-63],},},} allocation request for vcore
I0628 11:55:45.591426  168134 allocator.go:663] Request GPU device: tencent.com/vcuda-core-48,tencent.com/vcuda-core-58,tencent.com/vcuda-core-96,tencent.com/vcuda-core-87,tencent.com/vcuda-core-17,tencent.com/vcuda-core-99,tencent.com/vcuda-core-82,tencent.com/vcuda-core-66,tencent.com/vcuda-core-1,tencent.com/vcuda-core-63
I0628 11:55:45.617421  168134 allocator.go:1131] candidate pod tf-notebook in ns default with timestamp 1593345345000000000 is found.
I0628 11:55:45.617432  168134 allocator.go:715] Found candidate Pod 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook) with device count 10
I0628 11:55:45.617492  168134 allocator.go:618] Pods to be removed: []
I0628 11:55:45.624573  168134 tree.go:119] Update device information
I0628 11:55:45.631080  168134 allocator.go:375] Tree graph: ROOT:1
|---GPU0 (pids: [], usedMemory: 0, totalMemory: 24032378880, allocatableCores: 100, allocatableMemory: 24032378880)
I0628 11:55:45.631089  168134 allocator.go:386] Try allocate for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), vcore 10, vmemory 2684354560
I0628 11:55:45.631095  168134 share.go:58] Pick up 0 mask 1, cores: 100, memory: 24032378880
I0628 11:55:45.631101  168134 allocator.go:479] Allocate /run/nvidia/driver/dev/nvidia0 for 3d89b0e4-d5c1-43f4-bcc6-98650521894a(tf-notebook), Meta (0:0)
I0628 11:55:45.631108  168134 tree.go:491] Occupy /run/nvidia/driver/dev/nvidia0 with 10 2684354560, mask 1
I0628 11:55:45.631111  168134 tree.go:518] Occupy /run/nvidia/driver/dev/nvidia0 parent 1
I0628 11:55:45.631115  168134 tree.go:501] /run/nvidia/driver/dev/nvidia0 cores 100->90
I0628 11:55:45.631119  168134 tree.go:507] /run/nvidia/driver/dev/nvidia0 memory 24032378880->21348024320
I0628 11:55:47.875495  168134 vcore.go:103] PreStartContainer request for vcore
I0628 11:55:47.875514  168134 allocator.go:784] get preStartContainer call from k8s, req: &PreStartContainerRequest{DevicesIDs:[tencent.com/vcuda-core-17 tencent.com/vcuda-core-48 tencent.com/vcuda-core-58 tencent.com/vcuda-core-96 tencent.com/vcuda-core-87 tencent.com/vcuda-core-63 tencent.com/vcuda-core-99 tencent.com/vcuda-core-82 tencent.com/vcuda-core-66 tencent.com/vcuda-core-1],}
I0628 11:55:47.875889  168134 manager.go:363] process 3d89b0e4-d5c1-43f4-bcc6-98650521894a
I0628 11:55:47.876058  168134 manager.go:352] Start vDevice server for /etc/gpu-manager/vm/3d89b0e4-d5c1-43f4-bcc6-98650521894a
I0628 11:55:47.889934  168134 vmemory.go:107] PreStartContainer request for vmemory
I0628 11:56:01.287588  168134 allocator.go:204] Checking allocation of pods on this node
W0628 11:56:31.212696  168134 manager.go:290] Find orphaned pod 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:56:31.212700  168134 manager.go:296] Remove directory 2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:56:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:57:01.287570  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:57:31.212208  168134 manager.go:260] Close orphaned server /etc/gpu-manager/vm/2f7545ea-77d8-4e8e-81ef-9135740843bf
I0628 11:57:31.287580  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:58:01.287580  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:58:31.287573  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:59:01.287574  168134 allocator.go:204] Checking allocation of pods on this node
I0628 11:59:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:00:01.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:00:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:01:01.287578  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:01:31.287572  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:02:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:02:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:03:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:03:29.410276  168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 12 items received
I0628 12:03:31.287567  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:04:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:04:31.287508  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:05:01.287503  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:05:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:06:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:06:31.287525  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:07:01.287543  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:07:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:08:01.287570  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:08:31.287565  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:09:01.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:09:31.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:10:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:10:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:11:01.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:11:31.287569  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:12:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:12:31.287561  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:13:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:13:11.412348  168134 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 0 items received
I0628 12:13:31.287564  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:14:01.287577  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:14:31.287561  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:15:01.287575  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:15:31.287568  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:16:01.287571  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:16:31.287560  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:17:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:17:31.287565  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:18:01.287581  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:18:31.287587  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:19:01.287572  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:19:31.287566  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:20:01.287576  168134 allocator.go:204] Checking allocation of pods on this node
I0628 12:20:31.287563  168134 allocator.go:204] Checking allocation of pods on this node

@mYmNeo
I excluded all the logs from util

I0628 11:55:28.373296 168134 app.go:87] Wait for internal server ready
I0628 11:55:28.376088 168134 volume.go:133] Find binaries: [/usr/bin/gpu-client]
I0628 11:55:28.376139 168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376142 168134 volume.go:139] Find 64bit libraries: [/usr/lib64/libcuda-control.so]
I0628 11:55:28.376891 168134 volume.go:133] Find binaries: []
I0628 11:55:28.376927 168134 volume.go:138] Find 32bit libraries: []
I0628 11:55:28.376930 168134 volume.go:139] Find 64bit libraries: []
I0628 11:55:28.376946 168134 volume.go:176] Mirror /usr/bin/gpu-client to /etc/gpu-manager/vdriver/nvidia/bin
I0628 11:55:28.386992 168134 volume.go:176] Mirror /usr/lib64/libcuda-control.so to /etc/gpu-manager/vdriver/nvidia/lib64


The log has showed that gpu-manager only detect /usr/lib64/libcuda-control.so and /usr/bin/gpu-client. And coping them into /etc/gpu-manager/vdriver/nvidia/lib64, but the correct one should have a few nvidia libraries. Since you've changed the copy-lib.sh, the rebuild ldcache procedure doesn't find your changes.

from gpu-manager.

raz-bn avatar raz-bn commented on September 17, 2024

@mYmNeo is it solvable?

from gpu-manager.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.