Giter VIP home page Giter VIP logo

Comments (7)

LarryGF avatar LarryGF commented on August 23, 2024 1

I had a chance to test it all and after I removed some stuff that was running on the node (I wasn't able to solve the too many files open error without getting pods out of the node) and it's working now. This is the config I used:

intel-device-plugins-gpu:

  initImage:
    enable: false
    hub: intel
    tag: ""

  sharedDevNum: 2
  logLevel: 2
  resourceManager: false
  enableMonitoring: false
  allocationPolicy: "none"

  nodeSelector:
    intel.feature.node.kubernetes.io/gpu: 'true'

  nodeFeatureRule: true

I am going to close the issue now, thanks again for your help @tkatila

from intel-device-plugins-for-kubernetes.

eero-t avatar eero-t commented on August 23, 2024 1

I had a chance to test it all and after I removed some stuff that was running on the node (I wasn't able to solve the too many files open error without getting pods out of the node)

This gives rough idea of what procesess are using most FDs:

awk '
/^Name/ {name=$2}
/^Pid/ {pid=$2}
/^FDSize/ {printf("%5d [%d] %s\n", $2, pid, name); nextfile}
' /proc/*/status | sort -nr | head -20

And more details you get with this (much slower):

for i in /proc/*/fd/; do
    count=$(ls $i | wc -l);
    pid=$(echo $i|cut -d/ -f3);
    cmd=$(tr '\0' ' ' < ${i%/fd/}/cmdline);
    echo "$count [$pid] $cmd";
done | sort -nr | head -20

(You can just copy-paste above things to shell. You need root to see info for all commands.)

from intel-device-plugins-for-kubernetes.

tkatila avatar tkatila commented on August 23, 2024

Unless you plan to use GPU Aware Scheduling (GAS), you shouldn't enable resourceManager in the GPU plugin CR. In most cases, GAS is not needed. Though, if the kubelet.crt is missing I don't think it should prevent the plugin from running. The file should get created if it doesn't exist.

For the panic, I believe your host's settings are too tight. You should be able to increase the limit with ulimit -n <value>. Check the initial value and then double it. That should fix the issue.
EDIT: You can also use sysctl -w fs.file-max=<value> to achieve the same.

from intel-device-plugins-for-kubernetes.

tkatila avatar tkatila commented on August 23, 2024

Also remembered that for 0.28.0 there's no need for the initImage anymore. You can set intel-device-plugins-gpu.initImage.enable to false. The default should be false, I think.

from intel-device-plugins-for-kubernetes.

LarryGF avatar LarryGF commented on August 23, 2024

Thanks for the input @tkatila, I thought that it was supposed to create the kubelet.crt but I don't see it being created. I wanted to use resourceManager in case I wanted to run an additional pod in that node and give it access to the GPU.
You were right about the host's settings, I had a limit of 1024, I increased it to 9000 in intervals (2048.4096,9000) and the pod was able to start, I also checked sysctl and I have a fs.file-max = 9223372036854775807
. I thought 9000 might be a little too high and reduced it, and now the pod is unable to start again, it keeps crashing, I will have to check further into this, but once again, thanks for your help.

from intel-device-plugins-for-kubernetes.

tkatila avatar tkatila commented on August 23, 2024

I thought that it was supposed to create the kubelet.crt but I don't see it being created.

Kubelet.crt should be on the host if the host has kubelet running. That's common for vanilla k8s installations. It might be that k3s doesn't have kubelet or its functionality is part of some other entity.

I wanted to use resourceManager in case I wanted to run an additional pod in that node and give it access to the GPU.

Increase sharedDevNum to whatever number you desire to share a GPU. Your values set it to 2 so two containers can access the same GPU. Resource Manager is not needed for basic sharing.

from intel-device-plugins-for-kubernetes.

LarryGF avatar LarryGF commented on August 23, 2024

Kubelet.crt should be on the host if the host has kubelet running. That's common for vanilla k8s installations. It might be that k3s doesn't have kubelet or its functionality is part of some other entity.

If I remember correctly, k3s runs its own kubelet and handles it internally

Increase sharedDevNum to whatever number you desire to share a GPU. Your values set it to 2 so two containers can access the same GPU. Resource Manager is not needed for basic sharing.

Good to know, I hadn't tried that, I received an error when trying to run resourceManager with sharedDevNum: 1, so I assumed it wouldn't work the other way around

from intel-device-plugins-for-kubernetes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.