run-ai / genv Goto Github PK

GPU environment and cluster management with LLM support

License: GNU Affero General Public License v3.0

Python 96.87% Shell 2.75% Dockerfile 0.38%

bash container-runtime containers data-science deep-learning docker gpu gpus jupyter-notebook jupyterlab-extension k8s kubernetes llm-inference llms nvidia-gpu ollama ray vscode vscode-extension zsh

genv's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger userbox020 eihabhala darth-veitcher sanyaade-projects uberizual brunoscaglione 5l1v3r1 wildcard vdolgushin razrotenberg mohankrishna225 daelyte wemersiveadmin

genv's Issues

Incompatible with Anaconda

it does not seem like genv is actually compatible with conda. if you do genv activate then activate your conda environment, your genv reservation just dies. similarly, if you first activate your conda environment, then you install genv, you cannot even use genv activate anymore as it rejects the command. it's kinda messy.

Multi-machine support

As part of an AI focused institute, genv is a great tool to handle single machines. However, we have a parc composed of many machines. Are there any plans on your side to support/develop features going in the direction of handling the GPU availability in a cluster or simply multiple machines?

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Upon executing genv activate I get:

Traceback (most recent call last):
  File "/mnt/mass_disk1/e.abc/miniconda3/bin/genv", line 10, in <module>
    sys.exit(main())
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/cli/__main__.py", line 116, in main
    activate.run(args.shell, args)
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/cli/activate.py", line 67, in run
    genv.core.envs.activate(
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/envs.py", line 88, in activate
    with State() as envs:
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/utils.py", line 56, in __enter__
    return self.load()
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/core/utils.py", line 35, in load
    self._state = genv.utils.load_state(
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/site-packages/genv/utils/utils.py", line 45, in load_state
    o = json.load(f, cls=json_decoder)
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/mnt/mass_disk1/e.abc/miniconda3/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

No idea where it stems from, but it looks like it's some genv and/or environment issue, hence the issue here. I have genv installed via conda command and it's version is 1.2.0.

gev.ray.remote does not work on classes

@genv.ray.remote does not work on classes but only on methods.

When it is used before the functions, it works:

@genv.ray.remote(num_gpus=1)
def train():

But when it is used with classes;

@genv.ray.remote(num_gpus=1)
class Trainer:
    def train(self):

it throws the following error:

2023-07-21 08:44:38,183	INFO worker.py:1636 -- Started a local Ray instance.
Traceback (most recent call last):
  File "/home/ekinkarabulut/genv/ray_genv.py", line 105, in <module>
    main_task = trainer.train.remote()
                ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ray._raylet.ObjectRef' object has no attribute 'train'

LLM attach fails in multi user scenarios because of Linux /proc permissions

From the documentation it should be possible to serve an LLM as one user "user1" and then have other users, e.g. "user2", attach to it via genv llm attach modelname.

However, in practice this fails on Linux hosts because genv, when run as "user2", cannot determine the ollama port from the ollama process id if "user1" hosts the model. This is because /proc/processid/fd is only readable by the user who owns the process, in this case "user1".

A workaround is to punch holes in Linux' process isolation, but that's far from ideal. Ideally, genv could track the ollama port besides the process id and make it available to other users, or solve this differently altogether.

New GPU Addition Not Showing

I just added a fourth GPU to my desktop. genv devices does not show the new index, and can't attach to --index 3.

Tried pip uninstall and pip install (latest release), but that didn't work. Any suggestions will be very appreciated.

(base) anindya@SGPUW2:~$ nvidia-smi
Tue Mar 19 15:33:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:04:00.0 Off |                  N/A |
|  0%   34C    P8              2W /  165W |      19MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4080        Off |   00000000:17:00.0 Off |                  N/A |
|  0%   42C    P8             19W /  320W |       9MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:65:00.0 Off |                  N/A |
|  0%   35C    P8              3W /  165W |       8MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Quadro RTX 8000                Off |   00000000:B3:00.0 Off |                  Off |
| 33%   31C    P8              6W /  260W |       5MiB /  49152MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                             14MiB |
|    1   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                              4MiB |
|    3   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+
(base) anindya@SGPUW2:~$ genv devices
ID      ENV ID      ENV NAME        ATTACHED
0
1
2
(base) anindya@SGPUW2:~$

zsh error

Hi,

Great project. I started to play with it, and seems to work well in bash, but it fails immediately in Zsh.

> genv activate                     
_genv_backup_env:2: bad substitution

Probably you used a bash-specific idiom which doesn't work in Zsh.

As a generic advice, maybe running shellcheck could give you improvement ideas:
https://github.com/koalaman/shellcheck

Failing `genv attach` commands detach shell

When genv attach fails, the shell thinks it's detached (i.e. CUDA_VISIBLE_DEVICES is -1) but it's still attached in Genv.

Feature request: support for per-user enforcement

Hello, I'm in an academic AI lab that uses genv to manage access to our compute resources. The tool has been great so far, but one thing that would be really convenient for our lab is the ability to set different quotas/constraints for different users, e.g. setting a lower max_devices for certain students in our lab (we want PhD students to have access to more GPUs than undergrads, for example). Having a simple config file for every user that defines their quotas would suffice.

This kind of feature would be greatly appreciated. Thanks!

genv-help: command not found

Installed in a jupyterhub/singleuser:3.1.0 docker container with conda install -y -c conda-forge genv

running genv, genv config, genv activate returns:

genv-help: command not found

Configure CUDA and other library versions

Hey, first of all, really cool idea for a tool.

One of the most annoying aspects of sharing a workstation with others is managing different needs for library versions (cuDNN, CUDA Toolkit, etc.). While PyTorch handles this by wrapping the required libraries into the python module, TensorFlow still requires users to install the necessary library versions themselves.

Do you consider supporting the management of the different library/driver versions per environment through this tool?

Document usage with tmux

genv enforce is not terminating the process when using ray

I am running a script with genv and allocating 1 GPU. So, I started running the script and ran the enforcement command with 0 devices as the enforcement rule. Genv detects that I am using more than I am allowed to:

User ekinkarabulut is using 1 devices which is 1 more than the maximum allowed
Detaching environment 43155 of user ekinkarabulut from device 0

It detaches the genv environment from the device. I can't see any device attached when I run genv devices:

ID      ENV ID      ENV NAME        ATTACHED
0
1

However, it doesn’t terminate the process so my job is still running (I can see it running when I check nvidia-smi):


Wed Aug  2 09:47:34 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    75W / 149W |    505MiB / 11441MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:05.0 Off |                    0 |
| N/A   38C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     43155      C   ray::_wrapper                     502MiB |
+-----------------------------------------------------------------------------+

Enforcement with sudo using sudo -E env PATH="$PATH" genv enforce --interval 3 --max-devices-per-user 0 is giving the same result.

P.s.: To make sure, I also ran another script within a genv environment to make sure that it is not a general issue and enforced the same thing - it terminates the process smoothly with normal scripts without ray. It seems to be an issue for Ray integration