Giter VIP home page Giter VIP logo

Comments (11)

pan87232494 avatar pan87232494 commented on August 19, 2024 2

你好, 我用kubespray 2.10 装的1.14的k8s, 使用nvidia device plugin beta2 可以用, 但是想多个container复用显卡, 所以用了现在这个插件, 但是也看到下面这个错误, 但是为什么这里是显示没有1GB

Error: failed to start container "binpack-1": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-1MiB-to-run\\\\n\\\"\"": unknown

并且我的显卡是1080TI 还有一个960, 这个显示不太正确吧?

  [bing@k8s-demo-master1-phycial aliyun_shared_gpu_demo]$ kubectl inspect gpushare
  NAME             IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  PENDING(Allocated)  GPU Memory(GiB)
  k8s-demo-slave2  192.168.2.140  0/1                    0/1                    1                   1/2
  
  [bing@k8s-demo-master1-phycial aliyun_shared_gpu_demo]$ kubectl-inspect-gpushare 
  NAME             IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU Memory(GiB)
  k8s-demo-slave2  192.168.2.140  0/1                    0/1                    0/2
  --------------------------------------------------------------
  Allocated/Total GPU Memory In Cluster:
  0/2 (0%)  
  
  nvidia-smi 
  Thu Oct 10 15:03:38 2019       
  +-----------------------------------------------------------------------------+
  | NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
  |-------------------------------+----------------------+----------------------+
  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  |===============================+======================+======================|
  |   0  GeForce GTX 960     Off  | 00000000:17:00.0 Off |                  N/A |
  | 36%   29C    P8     7W / 120W |      0MiB /  2002MiB |      0%      Default |
  +-------------------------------+----------------------+----------------------+
  |   1  GeForce GTX 108...  Off  | 00000000:66:00.0 Off |                  N/A |
  | 14%   37C    P8    25W / 270W |      0MiB / 11175MiB |      0%      Default |
  +-------------------------------+----------------------+----------------------+
                                                                                 
  +-----------------------------------------------------------------------------+
  | Processes:                                                       GPU Memory |
  |  GPU       PID   Type   Process name                             Usage      |
  |=============================================================================|
  |  No running processes found                                                 |
  +-----------------------------------------------------------------------------

from gpushare-scheduler-extender.

wzdutd avatar wzdutd commented on August 19, 2024

你好!我也碰到这个问题了,我试了下nvidia-device-plugin-daemonset,结果还是不行。如果我只运行nvidia-device-plugin-daemonset而不运行gpushare-device-plugin,执行 create -f 1.yaml (创建binpack pod)都没任何结果输出。

所以,你能否提供下比较详细的操作指导,以及的环境配置(GPU型号等)。非常感谢

from gpushare-scheduler-extender.

Sakuralbj avatar Sakuralbj commented on August 19, 2024

我之前也遇到过一样的问题,我当时是集群的调度器并没有启用gpushare-scheduler-extender,这样pod的annotation中就不会产生该pod应该分配的device ID,在device-plugin执行实际分配时,则会报unknown device id的错误。你可以describe一下你的pod,看annotation中是否有ALIYUN_COM_GPU_MEM_IDX的值.

from gpushare-scheduler-extender.

cicijohn1983 avatar cicijohn1983 commented on August 19, 2024

你好 我在运行示例的时候报这个错误nvidia-container-cli: device error: unknown device id: no-gpu-has-1024MiB-to-run,请问一下怎么解决,和显卡驱动有关系吗?谢谢

from gpushare-scheduler-extender.

illusion202 avatar illusion202 commented on August 19, 2024

@pan87232494, yaml贴出来看看呢

from gpushare-scheduler-extender.

HistoryGift avatar HistoryGift commented on August 19, 2024

我也遇到同样的问题了,我是在官方的demo上改的,你们谁解决了这个问题?
apiVersion: apps/v1
kind: Deployment

metadata:
name: binpack-1
labels:
app: binpack-1

spec:
replicas: 1
selector:
matchLabels:
app: binpack-1
template:
metadata:
labels:
app: binpack-1
spec:
nodeName: worker2.testgpu.testgpu.com
containers:
- name: binpack-1
image: nvidia/cuda:9.0-base
command: [ "/bin/bash", "-ce", "tail -f /dev/null" ]
resources:
limits:
#GiB
aliyun.com/gpu-mem: 1

from gpushare-scheduler-extender.

HistoryGift avatar HistoryGift commented on August 19, 2024

我这边将gpushare-scheduler-extender 启动到master节点上,kube-scheduler由于是命令启动的,所以修改了service,
ExecStart=/usr/local/bin/kube-scheduler
--address=0.0.0.0
--master=http://127.0.0.1:8080
--leader-elect=true
--v=2
--use-legacy-policy-config=true
--policy-config-file=/etc/kubernetes/scheduler-policy-config.json
并且将json中的127.0.0.1改成了master的IP
{
"kind": "Policy",
"apiVersion": "v1",
"extenders": [
{
"urlPrefix": "http://masterIP:32766/gpushare-scheduler",
"filterVerb": "filter",
"bindVerb": "bind",
"enableHttps": false,
"nodeCacheCapable": true,
"managedResources": [
{
"name": "aliyun.com/gpu-mem",
"ignoredByScheduler": false
}
],
"ignorable": false
}
]
}
其余的plugin插件都是按照install.md部署的,部署pod时还是报错
Error: failed to start container "binpack-1": Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run\\n\""": unknown
但是kubectl-inspect-gpushare-v2
能够看到资源分出去了,这个原因可能出现在哪里?
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) PENDING(Allocated) GPU Memory(GiB)
worker2.testgpu.testgpu.com worker2.testgpu.testgpu.com 0/11 0/11 10 10/22

Allocated/Total GPU Memory In Cluster:
10/22 (45%)

哪位大佬帮忙看看?

from gpushare-scheduler-extender.

baozhiming avatar baozhiming commented on August 19, 2024

大家解决这个问题了吗,好急呀

from gpushare-scheduler-extender.

debMan avatar debMan commented on August 19, 2024

Having same issue

from gpushare-scheduler-extender.

zhichenghe avatar zhichenghe commented on August 19, 2024

same issue

from gpushare-scheduler-extender.

zhichenghe avatar zhichenghe commented on August 19, 2024

Warning Failed 40s (x4 over 85s) kubelet Error: failed to start container "binpack-1": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: device error: no-gpu-has-6025MiB-to-run: unknown device: unknown
Normal Pulled 40s (x3 over 85s) kubelet Container image "reg.deeproute.ai/deeproute-simulation-services/gpu-player:v2" already present on machine

from gpushare-scheduler-extender.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.