Giter VIP home page Giter VIP logo

Comments (9)

YeWR avatar YeWR commented on August 22, 2024

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28?

By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

from efficientzero.

geekyutao avatar geekyutao commented on August 22, 2024

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28?

By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and @ray.remote(num_cpus=0.5). but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.

from efficientzero.

geekyutao avatar geekyutao commented on August 22, 2024

I changed my machine to V100-16G server, but the problem is still there. It's really wired. I never met this before. I can share some screenshots.
d3a69b45b1ee2d8812b1d42d71552c2
ab5fae6fcae5721c2a980f53c5b3b79
image

from efficientzero.

geekyutao avatar geekyutao commented on August 22, 2024

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28?
By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and @ray.remote(num_cpus=0.5). but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.

btw, I modified this ("@ray.remote(num_cpus=0.5)") in

. Is this place right?

from efficientzero.

YeWR avatar YeWR commented on August 22, 2024

I noticed that you set --gpu_actor 4 here. That's why only one GPU is in use (each reanalyze gpu actor takes 0.125 gpu, 4 x 0.125 = 0.5). Could you use more actors and share the full screenshot of nvidia-smi?
Like this:
image

Furthermore, I am wondering whether the version of ray is 1.0.0. If you are using the latest version of ray, the main process will share GPU with the remote processes.

To figure out the GPU usages, you can refer to https://docs.ray.io/en/releases-1.0.0/using-ray-with-gpus.html.
Here is one easy code demo in python:

import os
import ray
ray.init(num_gpus=4)

@ray.remote(num_gpus=1)
def use_gpu():
    print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
    print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))

ray.get([use_gpu.remote() for _ in range(4)])

You will find that you are able to use multiple GPUs in ray.
image

Hope this will help you :)

from efficientzero.

YeWR avatar YeWR commented on August 22, 2024

It seems that only one GPU is allocated. Have you set --num_gpus 4 --num_cpus 28?
By the way, if you find CPU resources are not enough, you can set @ray.remote(num_cpus=0.5).

Thank you for your reply. I set '--num_gpus 4 --num_cpus 28' and @ray.remote(num_cpus=0.5). but the problem is still there. Have you ever trained EfficientZero on other severs ? Is the GPU momenry distribution normal? I still cannot figure out why all the momery is on the first GPU. Thanks.

btw, I modified this ("@ray.remote(num_cpus=0.5)") in

. Is this place right?

That's ok, and you can also modify the line 266 in reanalyze_worker.py to @ray.remote(num_gpus=0.125, num_cpus=0.5). But it seems that your issue is not attributed to this.

from efficientzero.

geekyutao avatar geekyutao commented on August 22, 2024

Thank you for your detailed reply. I really appreciate it. Here're some observations/facts:

  1. I use ray=1.0.0 version as in your requirements.txt.
  2. I also upgraded the ray to 1.9 version. But met some new errors such as the follows.
    image
  3. The 1.0.0 ray version actually works well. I ran the code with 1.0.0 ray in my local machine (2 x 2080Ti) and the results seemed normal except for out of memory.
    image
    image
  4. Unfortunately, my servers (such as 4 x 24G P40, 4x 16G V100, etc) in my GPU cluster cannot show the full nvidia-smi results due to some unknown mechanisms.
  5. It's wired that local machines can allocate memory/workers while severs cannot. I'm still confused about the principle of ray.
  6. I’ll try the demo in your reply.

Many thanks! It must take you a lot of time.

from efficientzero.

geekyutao avatar geekyutao commented on August 22, 2024

I would say this is a magic. Perhaps due to I used docker in my sever, in this case, some detection function in ray may not work. For example:
image

from efficientzero.

YeWR avatar YeWR commented on August 22, 2024

It is possible when the remote functions are executed fast. Maybe you can try the remote class.

import os
import ray
import time
ray.init(num_gpus=4)

@ray.remote(num_gpus=1)
class Test():
    def __init__(self):
        pass

    def use_gpu(self):
        print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
        print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
        time.sleep(1)

testers = [Test.remote() for _ in range(4)]
ray.get([tester.use_gpu.remote() for tester in testers])

from efficientzero.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.