tkestack / gpu-manager Goto Github PK

License: Other

Makefile 0.14% Dockerfile 0.70% Shell 2.31% Go 96.86%

gpu-manager's Introduction

GPU Manager

GPU Manager is used for managing the nvidia GPU devices in Kubernetes cluster. It implements the DevicePlugin interface of Kubernetes. So it's compatible with 1.9+ of Kubernetes release version.

To compare with the combination solution of nvidia-docker and nvidia-k8s-plugin, GPU manager will use native runc without modification but nvidia solution does. Besides we also support metrics report without deploying new components.

To schedule a GPU payload correctly, GPU manager should work with gpu-admission which is a kubernetes scheduler plugin.

GPU manager also supports the payload with fraction resource of GPU device such as 0.1 card or 100MiB gpu device memory. If you want this kind feature, please refer to vcuda-controller project.

Build

1. Build binary

Prerequisite
- CUDA toolkit

make

2. Build image

Prerequisite
- Docker

make img

Prebuilt image

Prebuilt image can be found at thomassong/gpu-manager

Deploy

GPU Manager is running as daemonset, and because of the RABC restriction and hydrid cluster, you need to do the following steps to make this daemonset run correctly.

service account and clusterrole

kubectl create sa gpu-manager -n kube-system
kubectl create clusterrolebinding gpu-manager-role --clusterrole=cluster-admin --serviceaccount=kube-system:gpu-manager

label node with nvidia-device-enable=enable

kubectl label node <node> nvidia-device-enable=enable

submit daemonset yaml

kubectl create -f gpu-manager.yaml

Pod template example

There is nothing special to submit a Pod except the description of GPU resource is no longer 1 . The GPU resources are described as that 100 tencent.com/vcuda-core for 1 GPU and N tencent.com/vcuda-memory for GPU memory (1 tencent.com/vcuda-memory means 256Mi GPU memory). And because of the limitation of extend resource validation of Kubernetes, to support GPU utilization limitation, you should add tencent.com/vcuda-core-limit: XX in the annotation field of a Pod.

Notice: the value of tencent.com/vcuda-core is either the multiple of 100 or any value smaller than 100.For example, 100, 200 or 20 is valid value but 150 or 250 is invalid

Submit a Pod with 0.3 GPU utilization and 7680MiB GPU memory with 0.5 GPU utilization limit

apiVersion: v1
kind: Pod
metadata:
  name: vcuda
  annotations:
    tencent.com/vcuda-core-limit: 50
spec:
  restartPolicy: Never
  containers:
  - image: <test-image>
    name: nvidia
    command:
    - /usr/local/nvidia/bin/nvidia-smi
    - pmon
    - -d
    - 10
    resources:
      requests:
        tencent.com/vcuda-core: 50
        tencent.com/vcuda-memory: 30
      limits:
        tencent.com/vcuda-core: 50
        tencent.com/vcuda-memory: 30

Submit a Pod with 2 GPU card

apiVersion: v1
kind: Pod
metadata:
  name: vcuda
spec:
  restartPolicy: Never
  containers:
  - image: <test-image>
    name: nvidia
    command:
    - /usr/local/nvidia/bin/nvidia-smi
    - pmon
    - -d
    - 10
    resources:
      requests:
        tencent.com/vcuda-core: 200
        tencent.com/vcuda-memory: 60
      limits:
        tencent.com/vcuda-core: 200
        tencent.com/vcuda-memory: 60

FAQ

If you have some questions about this project, you can first refer to FAQ to find a solution.

gpu-manager's People

Contributors

Stargazers

Watchers

Forkers

carmark solarisyan riverzhang ruyingzhe lovejoy ppomelo huanwei kevinfeng eric-918 konnase shanshuiguochu luoyancn zerosnake0 longjohncoder yzs981130 xq2005 oneandwholly linquanisaac zhangdanyangcherry joyme123 huifu1018 olli-ai lebinhe xiyudaolang wackxu bigdimple chenyu85 msj905 eecsseudl davidmr001 pengtaoww megaele jingcheng88 barthv cuisongliu zhouzijiang mxc8996 edenbuaa kingofstormland raz-bn aland-zhang archichris wikix githubstack 5l1v3r1 fighterhit jyf2100 bnulwh ai-cloud-kubernetes ideenfix moying050 muxun stpolar slshan my-pleasure box9527 wenxinax zhaohb iwuch jing42 pokerfacesad leo-xukang liuopyt threestoneliu charles820 lxyzhangqing tencent-tke signcl xuguangzhao ack-lcn kanglanglang jadeluo sirzen97 herobcat mr-nineteen javyxu niwei2008 qifengz zhubingbing chapter09 tesladw felix0080 heluocs jiangxiaobin96 501176225 infichen horizen volcano-sh-fork zhengya-wu genie88 hwfan lzhang-hub yinmao cjb806172600 zionwu tomatoares jackjiang-hpc shenchucheng lwangbm victorgkang

gpu-manager's Issues

[email protected] 下载错误

执行make的时候，多次尝试下载cni失败。

hack/build.sh manager client go: downloading google.golang.org/grpc v1.25.1 go: downloading github.com/gogo/protobuf v1.1.1 go: downloading github.com/grpc-ecosystem/grpc-gateway v1.12.1 go: downloading golang.org/x/net v0.0.0-20191109021931-daa7c04131f5 go: downloading github.com/golang/protobuf v1.3.2 go: downloading github.com/prometheus/client_golang v1.2.1 go: downloading golang.org/x/sys v0.0.0-20191010194322-b09406accb47 go: downloading github.com/docker/go-connections v0.3.0 go: downloading github.com/kubernetes/kubernetes/staging/src/k8s.io/apimachinery v0.0.0-20190816231410-2d3c76f9091b go: downloading github.com/kubernetes/kubernetes/staging/src/k8s.io/client-go v0.0.0-20190816231410-2d3c76f9091b go: downloading github.com/docker/docker v0.7.3-0.20190327010347-be7ac8be2ae0 go: downloading github.com/tkestack/go-nvml v0.0.0-20191217064248-7363e630a33e go: downloading github.com/kubernetes/kubernetes/staging/src/k8s.io/api v0.0.0-20190816231410-2d3c76f9091b go: downloading github.com/kubernetes/kubernetes/staging/src/k8s.io/cri-api v0.0.0-20190816231410-2d3c76f9091b go: downloading github.com/kubernetes/kubernetes/staging/src/k8s.io/apiserver v0.0.0-20190816231410-2d3c76f9091b go: downloading github.com/opencontainers/runc v0.0.0-20181113202123-f000fe11ece1 go: downloading github.com/docker/distribution v0.0.0-20170726174610-edc3ab29cdff go: downloading github.com/spf13/afero v0.0.0-20160816080757-b28a7effac97 go: downloading google.golang.org/genproto v0.0.0-20191108220845-16a3f7862a1a go: downloading github.com/google/cadvisor v0.33.2-0.20190411163913-9db8c7dee20a go: downloading github.com/kubernetes/kubernetes/staging/src/k8s.io/apiextensions-apiserver v0.0.0-20190816231410-2d3c76f9091b go: downloading github.com/containernetworking/cni v0.6.0 go: downloading github.com/emicklei/go-restful v0.0.0-20170410110728-ff4f55a20633 go: downloading gopkg.in/square/go-jose.v2 v2.0.0-20180411045311-89060dee6a84 go: finding github.com/coreos/go-systemd v0.0.0-20180511133405-39ca1b05acc7 go: finding github.com/opencontainers/go-digest v0.0.0-20170106003457-a6d0ee40d420 go: finding github.com/vishvananda/netlink v0.0.0-20171020171820-b2de5d10e38e go: finding k8s.io/utils v0.0.0-20190221042446-c2654d5206da go: finding github.com/kubernetes/kubernetes/staging/src/k8s.io/cloud-provider v0.0.0-20190816231410-2d3c76f9091b go: finding github.com/godbus/dbus v0.0.0-20151105175453-c7fdd8b5cd55 go: finding github.com/vishvananda/netns v0.0.0-20171111001504-be1fbeda1936 go: finding github.com/golang/groupcache v0.0.0-20160516000752-02826c3e7903 go: finding github.com/opencontainers/selinux v0.0.0-20170621221121-4a2974bf1ee9 build tkestack.io/gpu-manager/cmd/manager: cannot load github.com/containernetworking/cni/libcni: github.com/containernetworking/[email protected]: Get https://proxy.golang.org/github.com/containernetworking/cni/@v/v0.6.0.zip: dial tcp 216.58.200.241:443: i/o timeout Makefile:3: recipe for target 'all' failed make: *** [all] Error 1

操作系统：ubuntu 18.04
go版本：1.13.3

Cannot submit a Pod

After executing make and meke img, I set gpu-manager as daemonset, and follow the README to set the environment.

When I submit a pod, I get an error as followed:

error: unable to decode "test01.yaml": resource.metadataOnlyObject.ObjectMeta: v1.ObjectMeta.Annotations: ReadString: expects " or n, but found 5, error found in #10 byte of ...|e-limit":50},"name":|..., bigger context ...|":{"annotations":{"tencent.com/vcuda-core-limit":50},"name":"vcuda"},"spec":{"containers":[{"command|...

Pod .yaml

apiVersion: v1
kind: Pod
metadata:
  name: vcuda
  annotations:
    tencent.com/vcuda-core-limit: 50
spec:
  restartPolicy: Never
  containers:
  - image: tf1.15-backend:1.0
    name: nvidia
    command:
    - /usr/local/nvidia/bin/nvidia-smi
    - pmon
    - -d
    - 10
    resources:
      requests:
        tencent.com/vcuda-core: 50
        tencent.com/vcuda-memory: 30
      limits:
        tencent.com/vcuda-core: 50
        tencent.com/vcuda-memory: 30

I run the conmand as non-root because my account is without root

Lack of developer guide

Describe the bug
There is no document to describe how to set up a development environment, as well as how to build and test the code.

Environment

OS: Linux VM_149_11_centos 3.10.107-1-tlinux2_kvm_guest-0049 #1 SMP Tue Jul 30 23:46:29 CST 2019 x86_64 x86_64 x86_64 GNU/Linux
golang: go version go1.12.4 linux/amd64

What is the role of gpu-client?

I see that gpu-client is copied to /etc/gpu-manager/vdriver, and the flow chart in the manager.go file indicates that the container will fork call gpu-client, but I am confused when to call gpu-client ? Why does it happen?

pkg/services/virtual-manager/manager.go

//                Host                     |                Container
//                                         |
//                                         |
//  .-----------.                          |
//  | allocator |----------.               |             ___________
//  '-----------'   PodUID |               |             \          \
//                         v               |              ) User App )--------.
//                .-----------------.      |             /__________/         |
//     .----------| virtual-manager |      |                                  |
//     |          '-----------------'      |                                  |
// $VirtualManagerPath/PodUID              |                                  |
//     |                                   |       read /proc/self/cgroup     |
//     |  .------------------.             |       to get PodUID, ContainerID |
//     '->| create directory |------.      |                                  |
//        '------------------'      |      |                                  |
//                                  |      |                                  |
//                 .----------------'      |       .----------------------.   |
//                 |                       |       | fork call gpu-client |<--'
//                 |                       |       '----------------------'
//                 v                       |                   |
//    .------------------------.           |                   |
//   ( wait for client register )<-------PodUID, ContainerID---'
//    '------------------------'           |
//                 |                       |
//                 v                       |
//   .--------------------------.          |
//   | locate pod and container |          |
//   '--------------------------'          |
//                 |                       |
//                 v                       |
//   .---------------------------.         |
//   | write down configure and  |         |
//   | pid file with containerID |         |
//   | as name                   |         |
//   '---------------------------'         |
//                                         |
//                                         |
//                                         v

When I launch gpu-manager.The vcuda-memory limit is invalid.

when I launch gpu-manager，the log error:Unable to set Type=notify in systemd service file? and I read the issue https://github.com/tkestack/gpu-manager/issues/7 .The difference is that the resources have been exist on the node. and I create a pod through kubectl create -f test1.yaml successfully， it can find in the follow picture.

The problem is when the pod is running ,the vcuda-memory limit is invalid. Fro example :when the pod request vcuda-memory is 1, the actual resources used exceed this value.

Memory allocation exception

Describe

When I give pod "1" vcuda-memory , it will use each gpu-device 256Mi gpu-memory . Does it limit the use of each GPU card or the total number of GPUs on the node?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40	  Driver Version: 430.40       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    Off  | 00000000:65:00.0 Off |                  N/A |
| 41%   31C    P2    33W / 225W |    267MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 2080    Off  | 00000000:B3:00.0 Off |                  N/A |
| 41%   31C    P2    46W / 225W |    243MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Environment

GPU Info :

GeForce RTX 2080 - 8G * 2

Kubernetes

v1.13.2

Test file

import tensorflow as tf
import timeit

with tf.device('/cpu:0'):
	cpu_a = tf.random.normal([10000, 1000])
	cpu_b = tf.random.normal([1000, 2000])
	print(cpu_a.device, cpu_b.device)

with tf.device('/gpu:0'):
	gpu_a = tf.random.normal([10000, 1000])
	gpu_b = tf.random.normal([1000, 2000])
	print(gpu_a.device, gpu_b.device)

def cpu_run():
	with tf.device('/cpu:0'):
		c = tf.matmul(cpu_a, cpu_b)
	return c

def gpu_run():
	with tf.device('/gpu:0'):
		c = tf.matmul(gpu_a, gpu_b)
	return c
# warm up
cpu_time = timeit.timeit(cpu_run, number=10)
gpu_time = timeit.timeit(gpu_run, number=10)
print('warmup:', cpu_time, gpu_time)

cpu_time = timeit.timeit(cpu_run, number=10)
gpu_time = timeit.timeit(gpu_run, number=10)
print('run time:', cpu_time, gpu_time)

Resource Limit

        resources:
          limits:
            tencent.com/vcuda-core: "30"
            tencent.com/vcuda-memory: "1"
          requests:
            tencent.com/vcuda-core: "30"
            tencent.com/vcuda-memory: "1"

请问下有相应的实现文档吗？

初略的看了一下yaml文件，大概了解了运行架构。但是请问具体实现有相应的介绍吗？

请问如何实现文章中提到的 Elastic Resource Allocation

如下图所示：

文章中提及了这个动态资源分配的能力，可是在 gpu-manager 的文档里需要保证 requests 和 limits 是固定的，是这个组件目前还不支持这个高级功能吗？

is this plugin only for TKE clusters?

I've installed gpu-manager, and gpu-admission per documentaion, but it seems i just can't get it to work.Is this plugin only to be used for TKE clusters? or is it intended for use in any Kubernetes cluster? thank you.

How is the gpu-client in being called?

According to this diagram presented in the gpu-manager/pkg/services/virtual-manager/manager.go file:

//                Host                     |                Container
//                                         |
//                                         |
//  .-----------.                          |
//  | allocator |----------.               |             ___________
//  '-----------'   PodUID |               |             \          \
//                         v               |              ) User App )--------.
//                .-----------------.      |             /__________/         |
//     .----------| virtual-manager |      |                                  |
//     |          '-----------------'      |                                  |
// $VirtualManagerPath/PodUID              |                                  |
//     |                                   |       read /proc/self/cgroup     |
//     |  .------------------.             |       to get PodUID, ContainerID |
//     '->| create directory |------.      |                                  |
//        '------------------'      |      |                                  |
//                                  |      |                                  |
//                 .----------------'      |       .----------------------.   |
//                 |                       |       | fork call gpu-client |<--'
//                 |                       |       '----------------------'
//                 v                       |                   |
//    .------------------------.           |                   |
//   ( wait for client register )<-------PodUID, ContainerID---'
//    '------------------------'           |
//                 |                       |
//                 v                       |
//   .--------------------------.          |
//   | locate pod and container |          |
//   '--------------------------'          |
//                 |                       |
//                 v                       |
//   .---------------------------.         |
//   | write down configure and  |         |
//   | pid file with containerID |         |
//   | as name                   |         |
//   '---------------------------'         |
//                                         |
//                                         |
//                                         v

somehow when the user app (pod) is being started, the gpu-client is called.
Can you please describe how the gpu-client code ending up in the user app (pod), and what is the process that results in it being run.
I went over the gpu-manager repo in-order to find more details about it, but couldn't.

TensorFlow program hung When I use a fraction gpu resource

pod.yaml

Here is my .yaml file for creating pod

apiVersion: v1
kind: Pod
metadata:
  name: tf-vcuda-pod
spec:
  restartPolicy: Never
  hostNetwork: true
  containers:
  - image: tensorflow/tensorflow:1.13.1-gpu-py3
    name: tensorflow-vcuda-test
    command: ["/bin/bash", "-ce", "tail -f /dev/null"]
    volumeMounts:
          - mountPath: /home/gpu
            name: tf-code
    resources:
      requests:
        tencent.com/vcuda-core: 90
        tencent.com/vcuda-memory: 30
      limits:
        tencent.com/vcuda-core: 90
        tencent.com/vcuda-memory: 30

training code

Here is my tensorflow code, just a simple CNN

import tensorflow as tf
from numpy.random import RandomState

batch_size = 8

w1 = tf.Variable(tf.random_normal([2,3],stddev=1,seed=1))
w2 = tf.Variable(tf.random_normal([3,1],stddev=1,seed=1))

x = tf.placeholder(tf.float32,shape=(None,2),name='x-input')
y_ = tf.placeholder(tf.float32,shape=(None,1),name='y-input')

a = tf.matmul(x,w1)
y = tf.matmul(a,w2)

cross_entropy = -tf.reduce_mean(y_ * tf.log(tf.clip_by_value(y,1e-10,1.0)))
train_step = tf.train.AdamOptimizer(0.001).minimize(cross_entropy)

rdm = RandomState(1)
dataset_size = 128000
X = rdm.rand(dataset_size,2)
Y = [[int(x1+x2 < 1)] for (x1,x2) in X]

with tf.Session() as sess:
    init_op = tf.global_variables_initializer()
    sess.run(init_op)

    print(sess.run(w1))
    print(sess.run(w2))

    STEPS = 900000
    for i in range(STEPS):
        start = (i * batch_size) % dataset_size
        end = min(start+batch_size,dataset_size)

        sess.run(train_step,feed_dict={x:X[start:end],y_:Y[start:end]})

        if i%1000 == 0:
            total_cross_entropy = sess.run(cross_entropy,feed_dict={x:X,y_:Y})
            print("After %d training step(s),cross entropy on all data is %g" % (i,total_cross_entropy))

    print(sess.run(w1))
    print(sess.run(w2))

problem

The program hung up after output Created TensorFlow device

$ python CNN_TensorFlow.py 
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-08 02:28:04.238654: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-08 02:28:04.416921: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x41c9670 executing computations on platform CUDA. Devices:
2020-07-08 02:28:04.416985: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-07-08 02:28:04.422450: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599995000 Hz
2020-07-08 02:28:04.427351: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x431e600 executing computations on platform Host. Devices:
2020-07-08 02:28:04.427406: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-08 02:28:04.438859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:05:00.0
totalMemory: 11.75GiB freeMemory: 11.69GiB
2020-07-08 02:28:04.438910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-08 02:28:04.443735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-08 02:28:04.443779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-07-08 02:28:04.443794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-07-08 02:28:04.452388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11376 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)

log

Here is the log, it repeat output Hijacking nvml...

python CNN_TensorFlow.py 
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-08 02:31:11.106661: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
/tmp/cuda-control/src/loader.c:941 config file: /etc/vcuda/a2371bd1e50fd3d2fe175bdeeb21df2727149cf29d4ed12f4fd3fc737fb7f163/vcuda.config
/tmp/cuda-control/src/loader.c:942 pid file: /etc/vcuda/a2371bd1e50fd3d2fe175bdeeb21df2727149cf29d4ed12f4fd3fc737fb7f163/pids.config
/tmp/cuda-control/src/loader.c:946 register to remote: pod uid: 24993e70-c0c2-11ea-97bf-40167e346bb0, cont id: a2371bd1e50fd3d2fe175bdeeb21df2727149cf29d4ed12f4fd3fc737fb7f163
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
/tmp/cuda-control/src/loader.c:767 Start hijacking
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuEGLInit
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuDeviceGetNvSciSyncAttributes
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecHostNodeSetParams
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecMemcpyNodeSetParams
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecMemsetNodeSetParams
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecUpdate
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemAddressFree
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemAddressReserve
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemCreate
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemExportToShareableHandle
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemGetAccess
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemGetAllocationGranularity
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemGetAllocationPropertiesFromHandle
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemImportFromShareableHandle
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemMap
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemRelease
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemSetAccess
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemUnmap
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlDeviceGetGridLicensableFeatures_v3
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlDeviceGetHostVgpuMode
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlDeviceGetPgpuMetadataString
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlVgpuInstanceGetEccMode
/tmp/cuda-control/src/hijack_call.c:500 total cuda cores: 851968
/tmp/cuda-control/src/hijack_call.c:217 start utilization_watcher
/tmp/cuda-control/src/hijack_call.c:218 sm: 13, thread per sm: 2048
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
2020-07-08 02:31:11.291318: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4f81f30 executing computations on platform CUDA. Devices:
2020-07-08 02:31:11.291408: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-07-08 02:31:11.296666: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599995000 Hz
2020-07-08 02:31:11.302402: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x50d6ed0 executing computations on platform Host. Devices:
2020-07-08 02:31:11.302447: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
/tmp/cuda-control/src/hijack_call.c:399 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:402 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:412 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:425 summary: 27275 used 58982400
/tmp/cuda-control/src/hijack_call.c:432 27275 use memory: 58982400
/tmp/cuda-control/src/hijack_call.c:437 total used memory: 58982400
/tmp/cuda-control/src/hijack_call.c:440 Hijacking nvmlShutdown

2020-07-08 02:31:11.313704: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:05:00.0
totalMemory: 11.75GiB freeMemory: 11.69GiB
2020-07-08 02:31:11.313754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
2020-07-08 02:31:11.315833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-08 02:31:11.315868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-07-08 02:31:11.315882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
/tmp/cuda-control/src/hijack_call.c:399 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:402 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:412 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:425 summary: 27275 used 58982400
/tmp/cuda-control/src/hijack_call.c:432 27275 use memory: 58982400
/tmp/cuda-control/src/hijack_call.c:437 total used memory: 58982400
/tmp/cuda-control/src/hijack_call.c:440 Hijacking nvmlShutdown

/tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization

/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown

2020-07-08 02:31:11.324331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11376 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization

/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown

/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization

/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown

when I deployed the gpu-manager, I created one pod as README，the pod's status is UnexpectedAdmissionError!

gpu-manager failed

I got some problems when running gpu-manager.
Error like this:
copy /usr/local/host/lib/nvidia-440/bin/nvidia-smi to /usr/local/nvidia/bin/
cp: not writing through dangling symlink '/usr/local/nvidia/bin/nvidia-smi'

gpu-manager will copy some Nvidia* file, but some of these files are link type, so it failed.

Related files:
nvidia-cuda-mps-control
nvidia-cuda-mps-server
nvidia-debugdump
nvidia-persistenced
nvidia-smi

CUDA version: 10.2
gpu-manager version: master
OS version: ubuntu 16.04

vcuda-controller can not limit GPU utilization

GPU-manager 1.1.0 can limit GPU utilization, i have installed vcuda-controller.

| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:18:00.0 Off |                    0 |
| N/A   69C    P0    54W /  70W |   2102MiB / 15079MiB |     42%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:86:00.0 Off |                    0 |
| N/A   65C    P0    58W /  70W |   2110MiB / 15079MiB |     70%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0   3460176      C   python3                                      523MiB |
|    0   3460231      C   python3                                      523MiB |
|    0   3460358      C   python3                                      523MiB |
|    0   3460379      C   python3                                      523MiB |
|    1   3460340      C   python3                                      523MiB |
|    1   3460448      C   python3                                      523MiB |
|    1   3460506      C   python3                                      531MiB |
|    1   3460583      C   python3                                      523MiB |
+-----------------------------------------------------------------------------+

The log of Pod:

/tmp/cuda-control/src/hijack_call.c:277 util: 85, up_limit: 24,  share: 929690, cur: 1310720
/tmp/cuda-control/src/hijack_call.c:277 util: 85, up_limit: 24,  share: 739175, cur: 1310720
/tmp/cuda-control/src/hijack_call.c:277 util: 85, up_limit: 24,  share: 548660, cur: 1096358
/tmp/cuda-control/src/hijack_call.c:277 util: 85, up_limit: 24,  share: 358145, cur: 671260
/tmp/cuda-control/src/hijack_call.c:277 util: 85, up_limit: 24,  share: 167630, cur: 164177
/tmp/cuda-control/src/hijack_call.c:277 util: 85, up_limit: 24,  share: 0, cur: -7459
/tmp/cuda-control/src/hijack_call.c:277 util: 0, up_limit: 24,  share: 29491, cur: 22032
/tmp/cuda-control/src/hijack_call.c:277 util: 0, up_limit: 24,  share: 58982, cur: 58557
/tmp/cuda-control/src/hijack_call.c:277 util: 0, up_limit: 24,  share: 88473, cur: 83669
/tmp/cuda-control/src/hijack_call.c:277 util: 43, up_limit: 24,  share: 571337, cur: 568058
/tmp/cuda-control/src/hijack_call.c:277 util: 43, up_limit: 24,  share: 552854, cur: 552652
/tmp/cuda-control/src/hijack_call.c:277 util: 43, up_limit: 24,  share: 534371, cur: 533840
/tmp/cuda-control/src/hijack_call.c:277 util: 43, up_limit: 24,  share: 515888, cur: 507754
/tmp/cuda-control/src/hijack_call.c:277 util: 43, up_limit: 24,  share: 497405, cur: 490251
/tmp/cuda-control/src/hijack_call.c:277 util: 43, up_limit: 24,  share: 478922, cur: 478331
/tmp/cuda-control/src/hijack_call.c:277 util: 43, up_limit: 24,  share: 460439, cur: 460297
/tmp/cuda-control/src/hijack_call.c:277 util: 0, up_limit: 24,  share: 489930, cur: 489114
/tmp/cuda-control/src/hijack_call.c:277 util: 0, up_limit: 24,  share: 519421, cur: 518216

Nvidia node mismatch for pod, pick up:/dev/nvidia1 predicate: /dev/nvidia0, which is unexpected.

I got an error when running a pod, the error is

Nvidia node mismatch for pod example0(example0), pick up:/dev/nvidia1  predicate: /dev/nvidia0, which is unexpected.

It seems that the gpu-admission is assigned /dev/nvidia0 of a node, but the gpu-manager is assigned to the same node /dev/nvidia1 , two values are not equal, so ... ...

locate the code, as shown :

Please help analyze

example0.yaml

    ... ... 
    resources:
      requests:
        tencent.com/vcuda-core: 60
        tencent.com/vcuda-memory: 25
      limits:
        tencent.com/vcuda-core: 60
        tencent.com/vcuda-memory: 25
    ... ...

See below for more information:

kubectl decribe pod example0

[root@node3 truetest]# kubectl describe pods example0
Name:               example0
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               node3/
Start Time:         Tue, 14 Apr 2020 16:15:03 +0800
Labels:             <none>
Annotations:        kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"example0","namespace":"default"},"spec":{"containers":[{"env":[{"name...
                    tencent.com/gpu-assigned: false
                    tencent.com/predicate-gpu-idx-0: 0
                    tencent.com/predicate-node: node3
                    tencent.com/predicate-time: 1586852103661396020
Status:             Failed
Reason:             UnexpectedAdmissionError
Message:            Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod example0(example0), pick up:/dev/nvidia1  predicate: /dev/nvidia0, which is unexpected.
IP:                 
Containers:
  example0:
    Image:      test_gpu:v6.6
    Port:       <none>
    Host Port:  <none>
    Limits:
      tencent.com/vcuda-core:    60
      tencent.com/vcuda-memory:  25
    Requests:
      tencent.com/vcuda-core:    60
      tencent.com/vcuda-memory:  25
    Environment:
      LD_LIBRARY_PATH:  /usr/local/cuda-10.0/lib64:/usr/local/nvidia/lib64
      LOGGER_LEVEL:     5
    Mounts:
      /usr/local/cuda-10.0 from cuda-lib (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6jbrl (ro)
Volumes:
  cuda-lib:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/cuda-10.0
    HostPathType:  
  default-token-6jbrl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6jbrl
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                    Age               From               Message
  ----     ------                    ----              ----               -------
  Warning  FailedScheduling          20s               default-scheduler  0/3 nodes are available: 3 Insufficient tencent.com/vcuda-core, 3 Insufficient tencent.com/vcuda-memory.
  Normal   Scheduled                 20s               default-scheduler  Successfully assigned default/example0 to node3
  Warning  UnexpectedAdmissionError  20s               kubelet, node3  Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod example0(example0), pick up:/dev/nvidia1  predicate: /dev/nvidia0, which is unexpected.
  Warning  FailedMount               4s (x6 over 20s)  kubelet, node3  MountVolume.SetUp failed for volume "default-token-6jbrl" : object "default"/"default-token-6jbrl" not registered

at this time, GPU usage status of node3 :

[root@node3 test]# nvidia-smi
Tue Apr 14 16:18:28 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34       Driver Version: 430.34       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:15:00.0 Off |                  N/A |
| 22%   38C    P8    23W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:21:00.0 Off |                  N/A |
| 23%   42C    P8    11W / 250W |      0MiB / 10997MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Environment

Kubernetes version : v1.14.3
tenflow version: tensorflow_1.14_py3_gpu_cuda10.0:latest

Question about link algorithm

I have a question about link algorithm. I see that Gaia Scheduler paper said:

 Among the four types of communication methods, the communication overhead is SOC, followed by PXB, and then PHB. The communication overhead between GPUs in PIX communication mode is the smallest.

But gpu-manager define them as follows:

//types.go in go-nvml 
const (
	TOPOLOGY_INTERNAL   GpuTopologyLevel = iota
	TOPOLOGY_SINGLE                      = 10
	TOPOLOGY_MULTIPLE                    = 20
	TOPOLOGY_HOSTBRIDGE                  = 30
	TOPOLOGY_CPU                         = 40
	TOPOLOGY_SYSTEM                      = 50
	TOPOLOGY_UNKNOWN                     = 60
)


//tree_util.go
func parseToGpuTopologyLevel(str string) nvml.GpuTopologyLevel {
	switch str {
	case "PIX":
		return nvml.TOPOLOGY_SINGLE
	case "PXB":
		return nvml.TOPOLOGY_MULTIPLE
	case "PHB":
		return nvml.TOPOLOGY_HOSTBRIDGE
	case "SOC":
		return nvml.TOPOLOGY_CPU
	}

	if strings.HasPrefix(str, "GPU") {
		return nvml.TOPOLOGY_INTERNAL
	}

	return nvml.TOPOLOGY_UNKNOWN
}

It shows that PXB value is smaller than PHB, so when request multi card gpu, gpu-manager will use link algorithm and tend to select cards with PXB topology. Does this contradict the paper or the actual situation? Thanks!

Undefined symbol error was reported when I set up gpu-manager

Hi, I try to set up the GPU-manager in my environment, but I found some problems. When gpu-manager was run, it will invoke nvml.Init(), which is implemented in go-nvml/wrapper.go. But most of this file was commented.
I check the source and find the new implementation in vcuda-controller, but it does not reimplement NVML_D(xxx).
Do I miss something or some mismatch in the version ?

Version: master

Thank you.

Build error on 'make vendor'

Describe the bug
I use make vendor command to fetch the modules but errors output like below:

~/gopath/src/tkestack.io/tkestack/gpu-manager]# make vendor
……
[WARN]  Unable to checkout tkestack.io/tkestack/nvml
[ERROR] Update failed for tkestack.io/tkestack/nvml: Unable to get repository: Cloning into '/root/.glide/cache/src/http-tkestack.io-tkestack-go-nvml.git'...
fatal: http://tkestack.io/tkestack/go-nvml.git/info/refs not valid: is this a git repository?
: exit status 128
[INFO]  --> Fetching updates for golang.org/x/time
[INFO]  --> Fetching updates for gopkg.in/square/go-jose.v2
[INFO]  --> Fetching updates for github.com/modern-go/concurrent
[INFO]  --> Fetching updates for github.com/Rican7/retry
[INFO]  --> Fetching updates for github.com/mattn/go-shellwords
[INFO]  --> Fetching updates for github.com/mesos/mesos-go
[INFO]  --> Fetching updates for github.com/pquerna/ffjson
[INFO]  --> Fetching updates for github.com/containerd/console
[INFO]  --> Fetching updates for github.com/cyphar/filepath-securejoin
[ERROR] Failed to do initial checkout of config: Unable to get repository: Cloning into '/root/.glide/cache/src/http-tkestack.io-tkestack-go-nvml.git'...
fatal: http://tkestack.io/tkestack/go-nvml.git/info/refs not valid: is this a git repository?
: exit status 128
make: *** [vendor] Error 1

Environment

OS: Linux VM_149_11_centos 3.10.107-1-tlinux2_kvm_guest-0049 #1 SMP Tue Jul 30 23:46:29 CST 2019 x86_64 x86_64 x86_64 GNU/Linux
golang: go version go1.12.4 linux/amd64

make时，有以下问题。

[root@instance-kakwck8j ~]# cd gpu-manager-1.1.0/
[root@instance-kakwck8j gpu-manager-1.1.0]# make
hack/build.sh manager client
go: k8s.io/[email protected] requires
k8s.io/[email protected] requires
github.com/ghodss/[email protected]: invalid version: git fetch --unshallow -f https://github.com/ghodss/yaml in /root/go/pkg/mod/cache/vcs/5c75ad62eb9c289b6ed86c76998b4ab8c8545a841036e879d703a2bbc5fcfcea: exit status 128:
fatal: git fetch-pack: expected shallow list
make: *** [all] Error 1

CUDA 10.2 support

Any plans to support CUDA 10.2?

What is the difference between "tencent.com/vcuda-core" and "tencent.com/vcuda-core-limit"

the values of tencent.com/vcuda-core and tencent.com/vcuda-core-limit are the same.
If tencent.com/vcuda-core < 100, the gpu is shared mode, so tencent.com/vcuda-core-limit is no useful.

Curious about how to determine the pod container for Allocate RPC in gpu-manager

Hi guys,
I have just gone through the code of Allocate function in gpu-manager, and feel curious why the selected pod is the right one for allocating. The logic seems look like as follows:

List all pending pods which have GPU requirement.
Sort pods by its predicating time.
Find a pod which has container allocating the same count of GPU resources.

In my mind, the predicating time annotation can't guarantee as same order of pods to be bind to the node since the binding process runs concurrently. Besides kubelet should have its order to allocate resources for container(I'm not sure about it). So my doubt is that why your solution is right to select the corresponding pod.

Many thanks if I can get the answer.

Support TensorRT or not?

GPU usage is somewhat low when i use Tensor RT and vgpu.

Can not make different vcuda-core limits and requests

The article "GaiaGPU: Sharing GPUs in Container Clouds" shows a very great work in nvidia gpu virtualization. And here is a question about the part "Elastic Resource allocation".

In this part an experiment shows a container with 0.3 GPU can use more resources if the another part of gpu is idle, but in the pod configuration I can't make the requests.tencent.com/vcuda-core and limits.tencent.com/vcuda-core different. For example if I define a pod with limits.tencent.com/vcuda-core: 90 and requests.tencent.com/vcuda-core: 30 it will show error like this:

The Pod "vgpu-test-1" is invalid: spec.containers[0].resources.requests: Invalid value: "30": must be equal to tencent.com/vcuda-core limit

should gpu-manager deal with LD_LIBRARY_PATH more carefully?

	// LD_LIBRARY_PATH
	ctntResp.Envs["LD_LIBRARY_PATH"] = "/usr/local/nvidia/lib64"
	for _, env := range container.Env {
		if env.Name == "compat32" && strings.ToLower(env.Value) == "true" {
			ctntResp.Envs["LD_LIBRARY_PATH"] = "/usr/local/nvidia/lib"
		}
	}

this will cause some container with LD_LIBRARY_PATH does't work correctly.
For example:

I use tensorflow-serving image to serve model from HDFS. So I need to use LD_LIBRARY_PATH for libhdfs.so. my dockerfile likes this:

FROM tensorflow/serving:1.14.0-gpu
...

ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
ENV HADOOP_HDFS_HOME /root/hadoop
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:${JAVA_HOME}/jre/lib/amd64/server

...

Because of this, I have to rebuild these image, change it to:

RUN echo '/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server' > /etc/ld.so.conf.d/hdfs.conf

So i think if gpu-manager should deal with LD_LIBRARY_PATH more carefully. One possible way is that only overwrite part of LD_LIBRARY_PATH value: the path contains cuda or nvidia.

empty pids goroutine 1 [running]

Cannot deploy the gpu-manager with errors

I tried to deploy gpu-manager with command below:
kubectl apply -f gpu-manager.yaml .
There I got Error.
After read #7 , I mounted /root/.kube/ to deamonset,
and I got some additional information in logs, but still can't run properly.
Here are the informations of gpu-manager.WARRING:

E0609 07:18:41.624501 1373 server.go:132] Unable to set Type=notify in systemd service file?
E0609 07:18:42.676956 1373 tree.go:337] No topology level found at 0
W0609 07:18:42.688012 1373 allocator.go:1298] Failed to read from checkpoint due to key is not found

Our env:

CentOS Linux release 7.8.2003 (Core)
3.10.0-1127.8.2.el7.x86_64
kubernetes: 1.14.1
kubeflow:1.0
img: v1.0.0

touch me if more information is needed.

Nvidia-persistenced error

Hi,

I try to run plugin on AWS with T4 card, 440.33.01 driver and CUDA 10.2. My GPU manager version is 1.10. GPU-admission is runing, GPU-manager pod is starting and running but I don't have vcuda-memory and vcuda-core in my node resources so my pod is stuck in FailedScheduling and error "1 Insufficient tencent.com/vcuda-core, 1 Insufficient tencent.com/vcuda-memory".
When I look at GPU-manager container logs I see some problem with nvidia-persistenced:

rebuild ldcache
launch gpu manager
E1112 16:19:24.756540 24007 server.go:120] Can not start volume managerImpl, err /usr/local/nvidia/bin/nvidia-persistenced: bad magic number '[35 33 47 98]' in record at byte 0x0

When I enter into container and try to run /usr/local/nvidia/bin/nvidia-persistenced on my own it tries to access /usr/bin/nvidia-persistenced and stops. After creating symbolic link to /usr/bin there is no error when trying to run it manually but still error is the same.

Unable set type=notify, no topology level

Hi,
After solving previous issue, new one occured.

launch gpu manager
E1116 09:38:02.400951 10735 server.go:133] Unable to set Type=notify in systemd service file?
E1116 09:38:03.446978 10735 tree.go:337] No topology level found at 0

GPU manager seems to run but node description show vcuda-memory 0 and vcuda-core 0. When I run pod with requests for vcuda-core and vcuda-memory I get FailedScheduling and Message

0/1 nodes are available: 1 ExceedsGPUQuota.

I looked at similiar issue with type=notify and checked /var/lib/kubelet/device-plugins/ directory - it has vcore.sock and vmemory.sock.

failed to get pod from cache

After upgrade gpu-manager, I get error like this and cannot assign pod to this node even with an empty card

I1203 10:05:06.880207   23804 util.go:178] Pod csi-rbdplugin-8njc7 in namespace rook-ceph does not Request for GPU resource
I1203 10:05:06.880227   23804 allocator.go:978] failed to get pod 7ceaa552-d951-484c-8fcc-54156720a6de from allocatedPod cache
I1203 10:05:06.880241   23804 allocator.go:223] failed to get ready annotations for pod 7ceaa552-d951-484c-8fcc-54156720a6de
I1203 10:05:06.880252   23804 allocator.go:978] failed to get pod a45b5bc6-1b87-47b2-90a9-7d0a89baba92 from allocatedPod cache
I1203 10:05:06.880260   23804 allocator.go:223] failed to get ready annotations for pod a45b5bc6-1b87-47b2-90a9-7d0a89baba92
I1203 10:05:06.880270   23804 allocator.go:978] failed to get pod 5c44bd31-a8f1-494a-abb9-c6fc727489c8 from allocatedPod cache
I1203 10:05:06.880282   23804 allocator.go:223] failed to get ready annotations for pod 5c44bd31-a8f1-494a-abb9-c6fc727489c8
I1203 10:05:06.880293   23804 allocator.go:978] failed to get pod 2071a0cb-083f-47fc-8017-bebae32bbd6c from allocatedPod cache
I1203 10:05:06.880301   23804 allocator.go:223] failed to get ready annotations for pod 2071a0cb-083f-47fc-8017-bebae32bbd6c

nvidia-smi output:

Thu Dec  3 18:06:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 26%   43C    P5    13W / 250W |   4454MiB / 11177MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
| 32%   57C    P2    62W / 250W |   4250MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 30%   54C    P2    76W / 250W |   8688MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:0A:00.0 Off |                  N/A |
| 23%   23C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     24453      C   python                                      4444MiB |
|    1     24923      C   python                                      4240MiB |
|    2      8403      C   python                                      4434MiB |
|    2     25238      C   python                                      4244MiB |
+-----------------------------------------------------------------------------+

Unable to set Type=notify in systemd service file

Just like #7, I followed https://cloud.tencent.com/developer/article/1685122 this blog and everything works fine and all pods in running, while the gpu-manger-daemon pod logs shows it stuck at server.go: 132

I tried many solutions like change the drive from systemd to cgroup, it won't work. The pod is running with no response.
Meanwhile I found the extra flags that need to be set from https://github.com/tkestack/gpu-manager/blob/master/docs/faq.md, but when I configured the gpu-manager like this:

the pod can't get started, it seems like the extra flags here directly pass to the gpu-manager as a call option but it doesn't has such option named "cgroup-driver". What did I missing?

GPU card type list support?

non-root containers can not run nvidia-smi

Everything is ok when i use root container.
But "nvidia-smi" has no return when i use non-root container.

Cri-o support

Hi, I am using Openshift 4.3 which means my default container runtime is Cri-o.
When trying to deploy the GPU-manager daemonset, the pod which spawns on the GPU node fails.
How can I verify this issue occurs since I am using Cri-o and not Docker container runtime and if so, is there any way to solve it without replacing the default container runtime?

Also, can you please provide all the prerequisites needed in order to deploy the GPU-manager (Cuda-toolkit, drivers, etc..)

Can't Running: Unable to set Type=notify in systemd service file

Describe the bug

Container logs:

I1120 02:49:13.645198  262971 volume.go:152] Volume manager is running
E1120 02:49:13.645282  262971 server.go:132] Unable to set Type=notify in systemd service file?
I1120 02:49:14.019519  262971 app.go:87] Wait for internal server ready

And wait for internal server ready timeout 10s and will exit .

Environment

OS: centos 3.10.0-957.27.2.el7.x86_64
kubernetes: 1.13.2

GPU Manager not enforcing memory limits on gpu processes

Hello, I am running GPU manager and I set a max memory of tencent.com/vcuda-memory: 10 but this is not being enforced at the application level. My process can requests more gpu memory than this and nothing is done about it.
nvidia-smi shows
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 27678 C ...ath=/some/path 5231MiB |
+-----------------------------------------------------------------------------+

Should the process not be killed or otherwise limited? I am using a pytorch model process.

Nvidia node mismatch for pod, pick up:/dev/nvidia6 predicate: /dev/nvidia1, which is unexpected.

I got a similar problem when I create a pod like issue 18. Please help analyze.

  Warning  UnexpectedAdmissionError  16m   kubelet, ai-1080ti-62  Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6  predicate: /dev/nvidia1, which is unexpected.

test3.yaml

apiVersion: v1
kind: Pod 
metadata:
  name: test3
  namespace: danlu-efficiency
spec:
  restartPolicy: Never
  schedulerName: gpu-admission
  containers:
  - image: danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
    name: test3
    command:
    - /bin/bash
    - -c
    - sleep 100000000
    resources:
      requests:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 40
      limits:
        tencent.com/vcuda-core: 10
        tencent.com/vcuda-memory: 40

kubectl describe pods test3 -n danlu-efficiency

Name:               test3
Namespace:          danlu-efficiency
Priority:           0
PriorityClassName:  <none>
Node:               ai-1080ti-62/
Start Time:         Wed, 15 Jul 2020 14:54:42 +0800
Labels:             <none>
Annotations:        tencent.com/gpu-assigned: false
                    tencent.com/predicate-gpu-idx-0: 1
                    tencent.com/predicate-node: ai-1080ti-62
                    tencent.com/predicate-time: 1594796082180123795
Status:             Failed
Reason:             UnexpectedAdmissionError
Message:            Pod Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6  predicate: /dev/nvidia1, which is unexpected.
IP:                 
Containers:
  test3:
    Image:      danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
      sleep 100000000
    Limits:
      tencent.com/vcuda-core:    10
      tencent.com/vcuda-memory:  40
    Requests:
      tencent.com/vcuda-core:    10
      tencent.com/vcuda-memory:  40
    Environment:                 <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-p6lfp (ro)
Volumes:
  default-token-p6lfp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-p6lfp
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                    Age   From                   Message
  ----     ------                    ----  ----                   -------
  Normal   Scheduled                 17m   gpu-admission          Successfully assigned danlu-efficiency/test3 to ai-1080ti-62
  Warning  FailedScheduling          17m   gpu-admission          pod test3 had been predicated!
  Warning  UnexpectedAdmissionError  17m   kubelet, ai-1080ti-62  Update plugin resources failed due to rpc error: code = Unknown desc = Nvidia node mismatch for pod test3(test3), pick up:/dev/nvidia6  predicate: /dev/nvidia1, which is unexpected.

The information of ai-1080ti-62 node

Name:               ai-1080ti-62
Roles:              nvidia418
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    hardware=NVIDIAGPU
                    hardware-type=NVIDIAGPU
                    kubernetes.io/hostname=ai-1080ti-62
                    node-role.kubernetes.io/nvidia418=nvidia418
                    nvidia-device-enable=enable
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.90.1.131/24
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 29 May 2019 18:02:54 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 15 Jul 2020 15:14:58 +0800   Wed, 15 Jul 2020 11:30:46 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.90.1.131
  Hostname:    ai-1080ti-62
Capacity:
 cpu:                       56
 ephemeral-storage:         1152148172Ki
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    264029984Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
Allocatable:
 cpu:                       53
 ephemeral-storage:         1040344917078
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    251344672Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
System Info:
 Machine ID:                                           bf90cb25500346cb8178be49909651e4
 System UUID:                                          00000000-0000-0000-0000-ac1f6b93483c
 Boot ID:                                              97927469-0e92-4816-880c-243a64ef293a
 Kernel Version:                                       4.19.0-0.bpo.8-amd64
 OS Image:                                             Debian GNU/Linux 9 (stretch)
 Operating System:                                     linux
 Architecture:                                         amd64
 Container Runtime Version:                            docker://18.6.2
 Kubelet Version:                                      v1.13.5
 Kube-Proxy Version:                                   v1.13.5
PodCIDR:                                               192.168.20.0/24
Non-terminated Pods:                                   (58 in total)
  Namespace                                            Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                                            ----                                                               ------------  ----------  ---------------  -------------  ---

......

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                  Requests            Limits
  --------                  --------            ------
  cpu                       51210m (96%)        97100m (183%)
  memory                    105732569856 (41%)  250822036Ki (99%)
  ephemeral-storage         0 (0%)              0 (0%)
  nvidia.com/gpu            8                   8
  tencent.com/vcuda-core    60                  60
  tencent.com/vcuda-memory  30                  30
Events:                     <none>

Can't limit GPU utilization

Describe the bug

GPU-Manager can limit GPU memory but can't limit GPU utilization

Pod yaml

resources:
          requests:
            tencent.com/vcuda-core: 30
            tencent.com/vcuda-memory: 10
          limits:
            tencent.com/vcuda-core: 30
            tencent.com/vcuda-memory: 10

When i use tensorflow python code to test resource limit

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 430.40       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    Off  | 00000000:65:00.0 Off |                  N/A |
| 41%   36C    P2    33W / 225W |   2379MiB /  7982MiB |      82%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 2080    Off  | 00000000:B3:00.0 Off |                  N/A |
| 41%   32C    P8    18W / 225W |      0MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    30846      C   python                                      2369MiB |
+-----------------------------------------------------------------------------+

Environment

Kubernetes version : v1.13.2
TensorFlow version : 1.13.2-gpu

除了英伟达还支持其他厂商吗

比如Intel/AMD或寒武纪思元270

make img failed

It failed when I executed the make img command.
I have used a proxy: go env -w GO111MODULE=on
go env -w GOPROXY=https://goproxy.io,direct
How should I solve it, or can you provide an image for me to docker pull.

a question about Allocate for a container?

pkg/services/allocator/nvidia/allocator.go#L781

pods, err := getCandidatePods(ta.k8sClient, ta.config.Hostname)
if err != nil {
	msg := fmt.Sprintf("Failed to find candidate pods due to %v", err)
	glog.Infof(msg)
	return nil, fmt.Errorf(msg)
}

for _, pod := range pods {
	if found {
		break
	}
	for i, c := range pod.Spec.Containers {
		if !utils.IsGPURequiredContainer(&c) {
			continue
		}
		podCache := ta.allocatedPod.GetCache(string(pod.UID))
		if podCache != nil {
			if _, ok := podCache[c.Name]; ok {
				glog.Infof("container %s of pod %s has been allocate, continue to next", c.Name, pod.UID)
				continue
			}
		}
		// 这里只根据容器的 vcore 请求量对比，能保证当前的 pod 是 kubelet Allocate 的容器所属的Pod吗？
		if utils.GetGPUResourceOfContainer(&pod.Spec.Containers[i], types.VCoreAnnotation) == reqCount {
			glog.Infof("Found candidate Pod %s(%s) with device count %d", pod.UID, c.Name, reqCount)
			candidatePod = pod
			candidateContainer = &pod.Spec.Containers[i]
			found = true
			break
		}
	}
}

when kubelet call Allocate for a container，it sorted the candidate pods by predicatedTime or createTime，and find the container which vcore equals with request count. But if there are two pods: A and B. A is created before B. Their vcore request is same, but vmemory is different. If pod B bind node before A. the candidateContainer will be wrong.

整个调度流水线只有在 Scheduler Thread 阶段是串行的一个 Pod 一个 Pod 的进行调度，在 Wait 和 Bind 阶段 Pod 都是异步并行执行。

reference article: 从零开始入门 K8s：调度器的调度流程和算法介绍

Or does A always bind node before B?

如何收集GPU metrics

使用gpu-manager的情况下，用 NVIDIA Data Center GPU Manager (DCGM) 不能收集到 gpu 相关的metric,请问这种情况下应该如何采集 metric到 promethus？

Build error on 'make all'

Describe the bug
when I use make all to build the binaries, some errors occurred like below, looks like some modules can't be fetched.

go: modernc.org/[email protected]: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /go/pkg/mod/cache/vcs/3dac616a9d80602010c4792ef9c0e9d9812a1be8e70453e437e9792978075db6: exit status 128:
        error: RPC failed; result=22, HTTP code = 404
        fatal: The remote end hung up unexpectedly
go: modernc.org/[email protected]: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /go/pkg/mod/cache/vcs/9aae2d4c6ee72eb1c6b65f7a51a0482327c927783dea53d4058803094c9d8039: exit status 128:
        error: RPC failed; result=22, HTTP code = 404
        fatal: The remote end hung up unexpectedly
go: modernc.org/[email protected]: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /go/pkg/mod/cache/vcs/f48599000415ab70c2f95dc7528c585820ed37ee15d27040a550487e83a41748: exit status 128:
        error: RPC failed; result=22, HTTP code = 404
        fatal: The remote end hung up unexpectedly
go: finding github.com/ghodss/yaml v0.0.0-20180820084758-c7ce16629ff4
go: finding golang.org/x/time v0.0.0-20161028155119-f51c12702a4d
go: finding golang.org/x/image v0.0.0-20190227222117-0694c2d4d067
go: finding gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7
go: finding modernc.org/xc v1.0.0
go: finding github.com/google/gofuzz v0.0.0-20170612174753-24818f796faf
go: finding github.com/onsi/ginkgo v1.6.0
go: modernc.org/[email protected]: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /go/pkg/mod/cache/vcs/29fc2f846f24ce3630fdd4abfc664927c4ad22f98a3589050facafa0991faada: exit status 128:
        error: RPC failed; result=22, HTTP code = 404
        fatal: The remote end hung up unexpectedly
go: error loading module requirements
make: *** [all] Error 1

Environment

OS: Linux VM_149_11_centos 3.10.107-1-tlinux2_kvm_guest-0049 #1 SMP Tue Jul 30 23:46:29 CST 2019 x86_64 x86_64 x86_64 GNU/Linux
golang: go version go1.12.4 linux/amd64

How to use the gpu-manager metric?

I use the metric server, but there is no gpu info.

can't call gpu

Hi, i have deploy the gpu-manager in my k8s, after some tests, i find that if i use non root user in pod, i can't execute nvidia-smi command, only use root user can call it.

When i use non root user call nvidia-smi, i use ps -ef to see what occur, i see:

gpu-client --addr /etc/vcuda/vcuda.sock --bus-id  --pod-uid f6d322b6-5480-11ea-a8b9-02f1556084e1 --cont-id 26b311b0df61a7008d52b3

hang there.

Does the manager only support root user to use it in pod?

gpu manager log store cause disk full

now gpu manager log without clean feature that cause too many log files occupy disk.

log dir default is host path /etc/gpu-manager/log

so recommend adding log rotate, for example: I could config log could store one week, log belong one week ago can be delete automatically

pod pending with event: Insufficient tencent.com/vcuda-memory, etc.

Thanks for your work.
It occurs with
unknown field "annotation" in io.k8s.apimachinery.pkg.apis.meta.v1.ObjectMeta;
with yaml:

apiVersion: v1
kind: Pod
metadata:
  name: vcuda
  annotation:
    tencent.com/vcuda-core-limit: 100
spec:
  restartPolicy: Never
  hostNetwork: true
  containers:
  - image: tensorflow/tensorflow:latest-gpu-jupyter
    name: nvidia
    ports:
    - containerPort: 8888
      hostPort: 8888
    resources:
      requests:
        tencent.com/vcuda-core: 100
        tencent.com/vcuda-memory: 1
      limits:
        tencent.com/vcuda-core: 100
        tencent.com/vcuda-memory: 1

After remove the field annotations, cluster fail scheduling with error:

0/3 nodes are available: 3 Insufficient tencent.com/vcuda-core, 3 Insufficient tencent.com/vcuda-memory.

But I have 3 GPUs in that node and label this node.
Do you have some solutions?

Does this support cuda 11?

does this support cuda 11?

call gpu error

Everything seems fine except when i call gpu, error(info) message from gpu-manager.
Help!

I0924 14:14:40.490124   25430 manager.go:369] UID: 0a0a0d13-5f21-4d53-ac96-431360dade44, cont: c911e4bfc9241085239f5bbebdb1ffc3bf14ae45db632deb533d7744ec80018a want to registration
I0924 14:14:40.490156   25430 manager.go:455] Write /etc/gpu-manager/vm/0a0a0d13-5f21-4d53-ac96-431360dade44/c911e4bfc9241085239f5bbebdb1ffc3bf14ae45db632deb533d7744ec80018a/pids.config
I0924 14:14:40.493778   25430 runtime.go:139] Read from /sys/fs/cgroup/memory/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0a0a0d13_5f21_4d53_ac96_431360dade44.slice/docker-c911e4bfc9241085239f5bbebdb1ffc3bf14ae45db632deb533d7744ec80018a.scope/cgroup.procs, pids: [16132 21782 22634 26254]
I0924 14:14:40.494887   25430 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

I compile the master branch of gpu-manager and vcuda-controller, setup all things within readme

tkestack / gpu-manager Goto Github PK

gpu-manager's Introduction

GPU Manager

Build

Prebuilt image

Deploy

Pod template example

FAQ

gpu-manager's People

Contributors

Stargazers

Watchers

Forkers

gpu-manager's Issues

Describe

Environment

GPU Info :

Kubernetes

Test file

Resource Limit

pod.yaml

training code

problem

log

Environment

Describe the bug

Environment

Describe the bug

Environment

Recommend Projects

Recommend Topics

Recommend Org