Giter VIP home page Giter VIP logo

Comments (12)

xuanlinli17 avatar xuanlinli17 commented on June 9, 2024 1

It's a resource limitation on scaling up envs. Which env are you running and what's the gpu VRAM usage, cpu usage, and DRAM usage right before the error occurs? It might not be due to gpu, but due to insufficient DRAM (64G DRAM might not be enough for 64 parallel envs for RL training)?

For visual environments (rgbd/pointcloud obs mode) it's challenging to scale to a large number of envs on a single gpu... @fbxiang

from maniskill.

fbxiang avatar fbxiang commented on June 9, 2024 1

It is most likely a GPU memory issue caused by either running out of VRAM or using an old GPU driver. However, I believe with 16 CPU cores, you do not gain much performance when scaling from 32 envs to 64 envs.

from maniskill.

xuanlinli17 avatar xuanlinli17 commented on June 9, 2024 1

Yes, you can run experiments on multiple GPUs. See ManiSkill2-Learn.

from maniskill.

fbxiang avatar fbxiang commented on June 9, 2024 1

I have run PickCube on 128 envs and PickSingleYCB on 64 envs on a single RTX 4090 GPU. However, I did run into issues with older NVIDIA drivers, so all tests are performed on the latest driver.
The precise issue would be quite hard to find as there are too many thing that could go wrong. For example, on a previous driver version, the application seems to hit a rendering semaphore/fence limit when using 128 environments, and such problem is very hard to understand even for us.
If you are doing state-based environments, there should be no difference using our parallel environments or some external parallel wrapper, since our implementation is optimized for rendering only.

from maniskill.

xuanlinli17 avatar xuanlinli17 commented on June 9, 2024 1

Could you compare the setup differences between the Xfast remote desktop and your local desktop, besides the existance of graphical windows? Some factors like nvidia driver versions could play a role, and I recommend upgrading them to the latest (530+)

from maniskill.

fbxiang avatar fbxiang commented on June 9, 2024 1

I would also like to point out the error

[2023-07-12 18:17:09.759] [svulkan2] [error] GLFW error: X11: The DISPLAY environment variable is missing
[2023-07-12 18:17:09.759] [svulkan2] [warning] Continue without GLFW.

should be completely harmless. This error means that the simulator automatically detects that it should not connect to a graphical window. If you do not want to see this error, you can pass offscreen_only to SapienRenderer, but this should not make a difference.

from maniskill.

yichao-liang avatar yichao-liang commented on June 9, 2024

Thanks so much for your prompt reply! I was running the PickCube environment in state-based mode. With 32 environments, it had about 47.90% of CPU Efficiency: 47.90% and 41.68% of Memory Efficiency. And I also tried to use more CPUs and memory (e.g., 32 CPUs and 128G of memory) but it still won't work. So I guess it's unlikely a CPU/dram issue.

Interestingly, I was able to run it successfully for once among dozens of runs, and it used about 4 out of 80 GB of GPU memory.

I can also request more GPUs. Is there a way to do this experiments on multiple GPUs?

Thanks so much for your help!

from maniskill.

yichao-liang avatar yichao-liang commented on June 9, 2024

Thanks, Xuan Lin, for your pointers! They are incredibly helpful. I'll look into it.

However, I am still experiencing difficulties in diagnosing the precise cause of the error. The majority of my tasks rely on state-based observations, which I believe to be not highly GPU memory-consuming. Furthermore, the GPU memory logs do seem to corroborate this presumption.

I was wondering if you could recommend any specific tests or diagnostics that could assist me in this process. I am eager to learn more about how to accurately determine the causes of these types of issues.

Additionally, I am curious about the scalability of the environments with regard to hardware resources. Have you, or anyone on the team, had the experience of successfully running 64 parallel environments on a single GPU, given adequate CPU and memory resources? Any insights or experiences you could share regarding this would be greatly appreciated.

Thank you once again for your assistance and support!

from maniskill.

xuanlinli17 avatar xuanlinli17 commented on June 9, 2024

@fbxiang

We haven't run 64 parallel envs on a single GPU (instead we split them up into different gpus). The job is also cpu-bound anyway when #total env is large

from maniskill.

yichao-liang avatar yichao-liang commented on June 9, 2024

Thanks so much for your help with the issue!

I assume your experiments were are using state-based environments?

I'll try to update the Nvidia driver and run it again. Meanwhile, I'll look into using Maniskill-learn. Thanks!

from maniskill.

xuanlinli17 avatar xuanlinli17 commented on June 9, 2024

Yes we are mainly using visual environments.

from maniskill.

yichao-liang avatar yichao-liang commented on June 9, 2024

Thank you once again for your prompt response. Upon discussing the issue with my system administrator, they highlighted that the error could be due to the simulator attempting to open graphical windows without having the necessary environment to do so, as raise as a part of the error message below.

[2023-07-12 18:17:09.759] [svulkan2] [error] GLFW error: X11: The DISPLAY environment variable is missing
[2023-07-12 18:17:09.759] [svulkan2] [warning] Continue without GLFW.

While I understand that in the case of state-based environments, our simulator should not need to open any graphical windows, I conducted an experiment using an Xfast remote desktop connected to the server to validate this. Interestingly, the experiment was successful in this setup.

Further, I attempted to scale the number of environments from 64 to 128. During this attempt, I encountered a "Maximum number of clients reached" error on the Xfast server. This led me to question whether we could bypass this issue if we could disable the connection to a graphical window, given that our environment does not seem to necessitate it.

Therefore, my query is: Is it feasible to disable the simulator from attempting to connect to a graphical window, considering the nature of our simulation? Would this approach make sense in our case?

Thank you very much for your invaluable assistance and guidance!

from maniskill.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.