Comments (13)
Hi @cosandr and thanks for the issue!
For some plugins we support annotations
in the CR, this would be similar and definitely doable.
from intel-device-plugins-for-kubernetes.
@cosandr Are you running GPU workloads on control plane node in production, or is this just for being able to test things with single-node setup?
As that seems quite uncommon practice, do you have any other use-case where tolerations would be useful?
from intel-device-plugins-for-kubernetes.
@cosandr Are you running GPU workloads on control plane node in production, or is this just for being able to test things with single-node setup?
As that seems quite uncommon practice, do you have any other use-case where tolerations would be useful?
The example is not from production, no. I would say it's relatively common to taint specialized nodes (for example the nvidia.com/gpu:NoSchedule
taint is added by some cloud providers by default), it's conceivable someone would want to do a similar thing with Intel accelerators as well.
from intel-device-plugins-for-kubernetes.
Ok, that's a really good use-case, and NFD actually supports tainting nodes with specific devices.
I think we would want to support such option in the operator too:
- Adding NFD rule to taint nodes with given device type
- Adding toleration for that taint to corresponding device plugin deployment
@tkatila, any comments?
EDIT: I think it should be a per-plugin option, as some nodes might have multiple device types, and multiple different taints per node could be awkward.
from intel-device-plugins-for-kubernetes.
Ok, that's a really good use-case, and NFD actually supports tainting nodes with specific devices.
#1571 discussed this area too. perhaps we need to think through the cases
from intel-device-plugins-for-kubernetes.
- Adding NFD rule to taint nodes with given device type
I don't understand this. Can you clarify?
- Adding toleration for that taint to corresponding device plugin deployment
EDIT: I think it should be a per-plugin option, as some nodes might have multiple device types, and multiple different taints per node could be awkward.
Yup, making it per CR seems like a good solution to me.
from intel-device-plugins-for-kubernetes.
It's experimental NFD feature: https://nfd.sigs.k8s.io/usage/customization-guide#node-tainting
EDIT: Because it's still experimental, needs NFD to run with enabling flag, and NFD worker would need also toleration, it may be better to start just by supporting user specified toleration in the operator (and adding NFD node tainting once NFD has support for that enabled by default).
from intel-device-plugins-for-kubernetes.
It's experimental NFD feature: https://nfd.sigs.k8s.io/usage/customization-guide#node-tainting
NFD worker will then need toleration for that taint too though...
We can try them out, document the use and maybe create examples. But I would keep them as optional/advanced scenarios.
from intel-device-plugins-for-kubernetes.
I admit I haven't tried to patch the daemonset deployed by the operator, I assumed that's a bad idea and that it would eventually replace it.
@cosandr the operator takes the daemonset "base" during compile-time (see deployments/daemonsets.go
). if you edit the base daemonset and build your custom operator for testing purposes, it should work without risks getting the actual plugin deployment overwritten.
from intel-device-plugins-for-kubernetes.
We also have other taints we put on certain nodes to restrict scheduling for specific workloads. Adding tolerations is a must since the device is present on those nodes.
from intel-device-plugins-for-kubernetes.
@cosandr & @winromulus question or concern about this request. By the node having a taint and the plugin having a toleration, it would also mean that the workloads would require the same toleration. Compared to just having the resource request, it feels bad from an user experience point of view.
Is this something you'd be fine with?
from intel-device-plugins-for-kubernetes.
@tkatila this is actually very much intended. If you need a node to run only certain workloads, you apply taints and give the workloads tolerations.
I'll give a practical example: If I have a intel GPU only node and don't want any other kind of workload to be scheduled on that node, I apply the taint to the node and have the plugin and workload have tolerations. (This cannot be achieved with node affinity or alternatives because it does not prevent daemon sets or others from being scheduled to that node).
If you check the nvidia operator, it has full toleration to ANY taints, specifically for this reason. The plugin should start on any node where devices are found regardless of taints and the workloads can set their own tolerations to target that node.
from intel-device-plugins-for-kubernetes.
Thanks @winromulus
So to summary: run GPU plugin on all nodes with GPU hardware, regardless of the taints. Workloads request the GPU resource + have toleration(s) for the tainted node.
I will look into adding the tolerations support to the operator.
from intel-device-plugins-for-kubernetes.
Related Issues (20)
- [QUESTION] Is it possible to differentiate between heterogeneous i915 devices? HOT 9
- GPU crashing on 1 node. HOT 8
- Prepare 0.29.0 release HOT 2
- FPGA: Issue with .aocx bitstream parser HOT 3
- QAT: enable 420xx
- QAT VF kernel driver name is c62xvt, not c6xxvt for C62X device HOT 5
- Intel QAT service script issues and restriction HOT 3
- intel device plugins gpu : failed to call webhook, context deadline exceeded HOT 10
- GPU product node label supports only one product type HOT 17
- operator: service selectors are vague and overlap with other operators HOT 1
- e2e-sgx: admissionwebhook tests are not run anymore
- [QAT] Why kernel driver passthrough all uio files HOT 9
- QAT testing/deployment updates HOT 2
- GPU label not added to nodes on Talos because i915 is built into the kernel HOT 1
- Publish accel-config-demo image to docker hub HOT 2
- intel-idxd-config-initcontainer cannot use UBI base images. HOT 5
- SGX-enabled pods sometimes get created without SGX device mounted HOT 5
- [QAT] No devices found in container without privilege HOT 2
- move openssl-qat-engine image to Ubuntu 24.04
- Intel 14th-gen Meteor Lake HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from intel-device-plugins-for-kubernetes.