bartzi / labshare Goto Github PK
View Code? Open in Web Editor NEWDjango Tool that helps everyone to get their fair share of GPU time
License: GNU General Public License v2.0
Django Tool that helps everyone to get their fair share of GPU time
License: GNU General Public License v2.0
We should also display fan speed and gpu temperature in our new nvidia-smi viewer
Hey,
it would be great to be notified by email when the exclamation mark appears on the server.
Best,
Goncalo
In light of recent cluster restructurings, we will not need the main functionality of LabShare anymore.
Instead, we might need a functionality that allows us to do system monitoring.
One of the most important parts is to monitor the output of nvidia-smi
.
nvidia-smi
data to the server.Our device_query script needs to be rewritten to not be a server anymore, but rather a tool that pushes to the server automatically. Furthermore, Labshare needs a new POST
endpoint that takes nvidia-smi
data (this will need some kind of authentication, maybe an API-Key?). We then need a push logic to push new gpu data via Websockets, (similar to this code).
We'll also need a user interface for this.
E.g. saying "ETA in 3 days". Could be rendered as an info-(i) next to the current users name.
fix timing problem in test_template_tag_position_in_queue
by adding time.sleep(0.1)
or s.th. like that
Add information on how to set up bower components to readme
should be something like: 27.11 10:05 p.m.
We need to have tests for LDAP authentication. In order to achieve this heavy usage of the mock
library is required.
right now we only check if the number of mail addresses changed when authenticating a user via mail, but we rather should check whether any mail address changed at all!
@hendraet It seems to me that the correct endpoint is not configured in the example.ini
file for device_query
. Is this intended? If not, we either need to add this /gpu/update
to the file example.ini
or add it to the README
Awesome django app, which helps a lot ;)
We should add a hint that a user might need to login, if there is no gpu data available.
Since, we have continuous updates, there is actually no need to persist GPU state information anymore, since we do not query it at all.
It would be good to get rid of all this unnecessary saving and directly push the updated GPU info to the clients.
Sometimes, a GPU utilization close to zero means that the code is not working correctly. It would be nice to add an indicator to the UI that is shown when the utilization falls below a certain threshold.
This might be accomplished by allowing more than one E-Mail per user object, or by using Groups
Hey @Bartzi
Good job man! Thank you for sharing on github!
I am wondering how to show cpu usage memory in grid as well as gpu!
Thanks in advance!
M.R
We should have a functionality where an Admins can send a notification to users, if they are sitting on a gpu and doing nothing.
Right now, it is quite tedious to add a new Device.
You need to go into the admin, create a new Device and a new user.
We could use a script that does both, by just saying manage.py create_device <devicename>
A new LDAP user should be assigned to the correct group, based on his group membership in the LDAP directory.
If a user reserves a spot on a GPU he gets this spot forever. This leads to problems as oftentimes the users do not use the GPU and other people have to wait and loose precious time for their experiments.
We should change the reservation behavior in such a way that a reservation is based on a predefined time slot that can also be adjusted at the time of reservation by the user. Once the time slot is over the GPU will be freed and the next user is invited to perform his/her experiments. This should not include the forceful shutdown of any trainings, but rather make sure that people are not blocking a GPU.
Furthermore, we should compute statistics for the reservation period and show them to the admins. With this we might encourage users to really use their GPU time instead of just idling around.
It would be nice to have an API for (a subset of):
the date in the last update field (sometimes?) misses a dot in the end. e.g. "2 p.m. 07.1"
device_query
should also read the RAM usage of the servers and report the process with highes RAM usage. We'll need to following things:
Usage of a GPU should be determined based on the telemetry gained from the clients
Similar to the sinfo
command for GresUsed we could try to integrate the console output:
GRES_USED NODELIST
gpu:1080ti:2(IDX:0-1) fb10dl03
gpu:1080ti:3(IDX:0-2) fb10dl06
gpu:1080ti:4(IDX:0-3) fb10dl07
gpu:2080ti:1(IDX:0) fb10dl08
gpu:3090:3(IDX:0-2) fb10dl09
gpu:1080ti:1(IDX:0),gpu:980gtx:1(IDX:1) resterampe
gpu:titanx:0(IDX:N/A) fb10dl[04-05]
add ldap authentication, maybe using this lib: https://django-auth-ldap.readthedocs.io/en/latest/
Such sinfo outputs do not work, yet: fb10dl09 gpu:3090:3(IDX:0-1,3)
Variable time lengths should probably be constants stored in the settings.py
.
This should actually be user/group is allowed to use that device
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.