Light

Question: how does the GPU device number count in the container with "gpu-count"? about gpushare-scheduler-extender HOT 8 CLOSED

aliyuncontainerservice commented on August 19, 2024 1

Question: how does the GPU device number count in the container with "gpu-count"?

from gpushare-scheduler-extender.

Comments (8)

nareshganesan commented on August 19, 2024 1

I think we should use ALIYUN_COM_GPU_MEM_IDX env value inside the container to get the index of the assigned GPU (for the current container)

Reference from the here and here

from gpushare-scheduler-extender.

cheyang commented on August 19, 2024

Given a server with 8 GPUs, if we start a pod with "aliyun.com/gpu-count:2", and the scheduler assign GPU3 and GPU7 to this pod, what is the GPU number for these 2 GPU cards in the pod? 0 and 1?

aliyun.com/gpu-count is not used for scheduling, it's only for calculate the number of GPUs. And if you are concerned about GPU ID, as @nareshganesan said, it's environment variable ALIYUN_COM_GPU_MEM_IDX.

from gpushare-scheduler-extender.

tjliupeng commented on August 19, 2024

So, the value of ALIYUN_COM_GPU_MEM_IDX should be the actual ID of the host GPU server, such as 3 or 7 in my example, right? And in the container, the application will access the GPU according to the real GPU ID of the server.

from gpushare-scheduler-extender.

nareshganesan commented on August 19, 2024

@tjliupeng , yes exactly as per the code, will verify it once in container (inside a multi gpu host machine ) and confirm here 👍

from gpushare-scheduler-extender.

kmac8361 commented on August 19, 2024

Great work... We have servers with 8 GPUs which can be mixed variety (eg. P100, V100, RTX 6000). It would be a nice enhancement to be able to specify GPU model preference. Since P100 is not MPS architecture, by default we want pod to choose GPU model V100 or RTX 6000.

from gpushare-scheduler-extender.

cheyang commented on August 19, 2024

I think you can add node label to specify the GPU model, and use node selector to choose the node.

from gpushare-scheduler-extender.

alasdairtran commented on August 19, 2024

I think you can add node label to specify the GPU model, and use node selector to choose the node.

This would work if we have different GPUs residing in different nodes. If we have two types of GPUs residing in the same node/machine, is there currently any way to tell your pod which GPU to pick? Can we request GPU by their ID (ALIYUN_COM_GPU_MEM_IDX)?

from gpushare-scheduler-extender.

cheyang commented on August 19, 2024

I think the current implementation doesn't work for your scenario. The assumption of the design is the GPUs in the same node are the same types.

from gpushare-scheduler-extender.

Related Issues (20)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.