run-ai / genv Goto Github PK
View Code? Open in Web Editor NEWGPU environment and cluster management with LLM support
Home Page: https://www.genv.dev
License: GNU Affero General Public License v3.0
GPU environment and cluster management with LLM support
Home Page: https://www.genv.dev
License: GNU Affero General Public License v3.0
it does not seem like genv is actually compatible with conda. if you do genv activate then activate your conda environment, your genv reservation just dies. similarly, if you first activate your conda environment, then you install genv, you cannot even use genv activate anymore as it rejects the command. it's kinda messy.
As part of an AI focused institute, genv
is a great tool to handle single machines. However, we have a parc composed of many machines. Are there any plans on your side to support/develop features going in the direction of handling the GPU availability in a cluster or simply multiple machines?
Upon executing genv activate
I get:
Traceback (most recent call last):
File "/mnt/mass_disk1/e.abc/miniconda3/bin/genv", line 10, in <module>
sys.exit(main())
File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/cli/__main__.py", line 116, in main
activate.run(args.shell, args)
File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/cli/activate.py", line 67, in run
genv.core.envs.activate(
File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/envs.py", line 88, in activate
with State() as envs:
File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/utils.py", line 56, in __enter__
return self.load()
File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/utils.py", line 35, in load
self._state = genv.utils.load_state(
File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/utils/utils.py", line 45, in load_state
o = json.load(f, cls=json_decoder)
File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/__init__.py", line 359, in loads
return cls(**kw).decode(s)
File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
No idea where it stems from, but it looks like it's some genv and/or environment issue, hence the issue here. I have genv installed via conda command and it's version is 1.2.0.
@genv.ray.remote
does not work on classes but only on methods.
When it is used before the functions, it works:
@genv.ray.remote(num_gpus=1)
def train():
But when it is used with classes;
@genv.ray.remote(num_gpus=1)
class Trainer:
def train(self):
it throws the following error:
2023-07-21 08:44:38,183 INFO worker.py:1636 -- Started a local Ray instance.
Traceback (most recent call last):
File "/home/ekinkarabulut/genv/ray_genv.py", line 105, in <module>
main_task = trainer.train.remote()
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ray._raylet.ObjectRef' object has no attribute 'train'
From the documentation it should be possible to serve an LLM as one user "user1" and then have other users, e.g. "user2", attach to it via genv llm attach modelname.
However, in practice this fails on Linux hosts because genv, when run as "user2", cannot determine the ollama port from the ollama process id if "user1" hosts the model. This is because /proc/processid/fd is only readable by the user who owns the process, in this case "user1".
A workaround is to punch holes in Linux' process isolation, but that's far from ideal. Ideally, genv could track the ollama port besides the process id and make it available to other users, or solve this differently altogether.
I just added a fourth GPU to my desktop. genv devices
does not show the new index, and can't attach to --index 3
.
Tried pip uninstall and pip install (latest release), but that didn't work. Any suggestions will be very appreciated.
(base) anindya@SGPUW2:~$ nvidia-smi
Tue Mar 19 15:33:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:04:00.0 Off | N/A |
| 0% 34C P8 2W / 165W | 19MiB / 16380MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4080 Off | 00000000:17:00.0 Off | N/A |
| 0% 42C P8 19W / 320W | 9MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 4060 Ti Off | 00000000:65:00.0 Off | N/A |
| 0% 35C P8 3W / 165W | 8MiB / 16380MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 Quadro RTX 8000 Off | 00000000:B3:00.0 Off | Off |
| 33% 31C P8 6W / 260W | 5MiB / 49152MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2737 G /usr/lib/xorg/Xorg 14MiB |
| 1 N/A N/A 2737 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2737 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2737 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
(base) anindya@SGPUW2:~$ genv devices
ID ENV ID ENV NAME ATTACHED
0
1
2
(base) anindya@SGPUW2:~$
Hi,
Great project. I started to play with it, and seems to work well in bash, but it fails immediately in Zsh.
> genv activate
_genv_backup_env:2: bad substitution
Probably you used a bash-specific idiom which doesn't work in Zsh.
As a generic advice, maybe running shellcheck
could give you improvement ideas:
https://github.com/koalaman/shellcheck
When genv attach
fails, the shell thinks it's detached (i.e. CUDA_VISIBLE_DEVICES
is -1
) but it's still attached in Genv.
Hello, I'm in an academic AI lab that uses genv to manage access to our compute resources. The tool has been great so far, but one thing that would be really convenient for our lab is the ability to set different quotas/constraints for different users, e.g. setting a lower max_devices
for certain students in our lab (we want PhD students to have access to more GPUs than undergrads, for example). Having a simple config file for every user that defines their quotas would suffice.
This kind of feature would be greatly appreciated. Thanks!
Installed in a jupyterhub/singleuser:3.1.0
docker container with conda install -y -c conda-forge genv
running genv
, genv config
, genv activate
returns:
genv-help: command not found
Hey, first of all, really cool idea for a tool.
One of the most annoying aspects of sharing a workstation with others is managing different needs for library versions (cuDNN, CUDA Toolkit, etc.). While PyTorch handles this by wrapping the required libraries into the python module, TensorFlow still requires users to install the necessary library versions themselves.
Do you consider supporting the management of the different library/driver versions per environment through this tool?
I am running a script with genv and allocating 1 GPU. So, I started running the script and ran the enforcement command with 0 devices as the enforcement rule. Genv detects that I am using more than I am allowed to:
User ekinkarabulut is using 1 devices which is 1 more than the maximum allowed
Detaching environment 43155 of user ekinkarabulut from device 0
It detaches the genv environment from the device. I can't see any device attached when I run genv devices
:
ID ENV ID ENV NAME ATTACHED
0
1
However, it doesn’t terminate the process so my job is still running (I can see it running when I check nvidia-smi
):
Wed Aug 2 09:47:34 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 73C P0 75W / 149W | 505MiB / 11441MiB | 43% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:00:05.0 Off | 0 |
| N/A 38C P8 28W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 43155 C ray::_wrapper 502MiB |
+-----------------------------------------------------------------------------+
Enforcement with sudo using sudo -E env PATH="$PATH" genv enforce --interval 3 --max-devices-per-user 0
is giving the same result.
P.s.: To make sure, I also ran another script within a genv environment to make sure that it is not a general issue and enforced the same thing - it terminates the process smoothly with normal scripts without ray. It seems to be an issue for Ray integration
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.