Hi, I found something wired when training EfficientZero. I trained the agent on a P40

It seems that only one GPU is allocated. Have you set <code class="notran

It seems that only one GPU is allocated. Have you set <code

All memory seems on the first GPU about efficientzero HOT 9 OPEN

yewr commented on August 22, 2024

All memory seems on the first GPU

from efficientzero.

Comments (9)

YeWR commented on August 22, 2024

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28?

By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

from efficientzero.

geekyutao commented on August 22, 2024

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28?

By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and @ray.remote(num_cpus=0.5). but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.

from efficientzero.

geekyutao commented on August 22, 2024

I changed my machine to V100-16G server, but the problem is still there. It's really wired. I never met this before. I can share some screenshots.

from efficientzero.

geekyutao commented on August 22, 2024

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28?
By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and @ray.remote(num_cpus=0.5). but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.

btw, I modified this ("@ray.remote(num_cpus=0.5)") in

EfficientZero/core/reanalyze_worker.py

Line 14 in a0c0948

@ray.remote

. Is this place right?

from efficientzero.

YeWR commented on August 22, 2024

I noticed that you set --gpu_actor 4 here. That's why only one GPU is in use (each reanalyze gpu actor takes 0.125 gpu, 4 x 0.125 = 0.5). Could you use more actors and share the full screenshot of nvidia-smi?
Like this:

Furthermore, I am wondering whether the version of ray is 1.0.0. If you are using the latest version of ray, the main process will share GPU with the remote processes.

To figure out the GPU usages, you can refer to https://docs.ray.io/en/releases-1.0.0/using-ray-with-gpus.html.
Here is one easy code demo in python:

import os
import ray
ray.init(num_gpus=4)

@ray.remote(num_gpus=1)
def use_gpu():
    print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
    print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))

ray.get([use_gpu.remote() for _ in range(4)])

You will find that you are able to use multiple GPUs in ray.

Hope this will help you :)

from efficientzero.

YeWR commented on August 22, 2024

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28?
By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and @ray.remote(num_cpus=0.5). but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.

btw, I modified this ("@ray.remote(num_cpus=0.5)") in

EfficientZero/core/reanalyze_worker.py

Line 14 in a0c0948

@ray.remote

. Is this place right?

That's ok, and you can also modify the line 266 in reanalyze_worker.py to @ray.remote(num_gpus=0.125, num_cpus=0.5). But it seems that your issue is not attributed to this.

from efficientzero.

geekyutao commented on August 22, 2024

Thank you for your detailed reply. I really appreciate it. Here're some observations/facts:

I use ray=1.0.0 version as in your requirements.txt.

EfficientZero/requirements.txt

Line 2 in a0c0948

ray==1.0.0
I also upgraded the ray to 1.9 version. But met some new errors such as the follows.
The 1.0.0 ray version actually works well. I ran the code with 1.0.0 ray in my local machine (2 x 2080Ti) and the results seemed normal except for out of memory.
Unfortunately, my servers (such as 4 x 24G P40, 4x 16G V100, etc) in my GPU cluster cannot show the full nvidia-smi results due to some unknown mechanisms.
It's wired that local machines can allocate memory/workers while severs cannot. I'm still confused about the principle of ray.
I’ll try the demo in your reply.

Many thanks! It must take you a lot of time.

from efficientzero.

geekyutao commented on August 22, 2024

I would say this is a magic. Perhaps due to I used docker in my sever, in this case, some detection function in ray may not work. For example:

from efficientzero.

YeWR commented on August 22, 2024

It is possible when the remote functions are executed fast. Maybe you can try the remote class.

import os
import ray
import time
ray.init(num_gpus=4)

@ray.remote(num_gpus=1)
class Test():
    def __init__(self):
        pass

    def use_gpu(self):
        print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
        print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
        time.sleep(1)

testers = [Test.remote() for _ in range(4)]
ray.get([tester.use_gpu.remote() for tester in testers])

from efficientzero.

All memory seems on the first GPU about efficientzero HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent