Giter VIP home page Giter VIP logo

Comments (5)

ege-k avatar ege-k commented on July 17, 2024 1

Hi,

Thanks for the answer. After my colleague posted this question, we actually managed to add DataParallel logic to the learner, and get roughly 2.5x-3x speedup (when distributing the learner across 3 GPUs while using another GPU for the actor). We would be happy to issue a pull request if this would be useful to the community.

Having done that, we are looking for other ways to speed up training further, and would greatly appreciate feedback from your end on the following ideas:

  1. Switching from DataParallel to DistributedDataParallel, like you suggested. I haven't used DistributedDataParallel before, but from the docs, it seems that I would have to wrap the train() function from polybeast_learner.py with DDP. Could you confirm this intuition?
  2. Also distributing the actor model across multiple GPUs. We have already tried this, but currently it does not work due to dynamic batching. AFAIK, DataParallel should handle undivisible batch sizes, but in our case it seems to divide the batch incorrectly, leading to errors.
  3. Increasing num_actors. It seems like the current bottleneck for us is actually the number of actors. Is there a way to decide on the optimal value for this hyperparameter (in terms of runtime), or should I just try a bunch of different values and pick the fastest? For reference, our machine has 128 cores. Also, is it possible with the current code to run agents across multiple machines? Finally, does this hyperparameter affect the learning dyamics?

Thanks a lot for your time.

from torchbeast.

heiner avatar heiner commented on July 17, 2024

That's a great question, and doing so is not currently easy. One way to go about this would be to add PyTorch DistributedDataParallel logic to the learner. Another good (perhaps easier) way to use several GPUs is to run several experiments and have an additional "evolution" controller on top, which copies hyperparameters from one agent to the other.

In practise, we've used our GPU fleet to run more experiments in parallel instead. Later this year we also hope to share an update to TorchBeast that allows using more GPUs for a single learner, but it isn't quite ready yet.

from torchbeast.

heiner avatar heiner commented on July 17, 2024

Hey @ege-k,

That's great and we're more than happy to have a pull request for this.

As for your questions:

DataParallel vs DistributedDataParallel: I forgot about the exact details of these implementations, but I believe the main difference is that the latter uses multiprocessing as a way to get around contention on the Python GIL. This issue is more acute for RL agents, and slightly harder to deal with, since in this case the data itself is produced by the model and hence cannot just be loaded from disk to be ready when needed. Getting this to work in TorchBeast likely requires not using the top-level abstractions but the underlying tools.

Distribution the actor model: I'd assume a ratio of 1:3 or 1:4 for actor GPUs to learner GPUs is ideal in a typical setting. Once you want to use many more learner GPUs, distributing the actor model makes sense. This could be done by having different learners with different addresses and telling each actor which one to use. Dynamic batching would still happen, but only on that learner.

Your third question is the hardest one. Unfortunately, in RL often "everything depends on everything" so I cannot rule out that the number of actors influences the learning dynamics and therefore also changes the optimal hyperparameter setting. It certainly would if you also change batch sizes, which is likely required in order to find the best throughput. I don't think I know of a better way than to try various settings -- aiming to slightly overshoot as modern Linux kernels are quite efficient around thread/process scheduling and thus not a lot of waste is generated by the context switching.

As for the second part of your question: TorchBeast can be fully distributed if you find a way to tell each node the names of its peers. E.g., if you know about your setup and have fixed IP addresses, you could hardcode them. Often, that's not the case and you'll need some other means of communicating the names/addresses. E.g., you could use a shared file system (lots of issues around that, but it can work "in practise"), or a real name service, or a lock service on top of something like like etcd.

BTW, we are working on somewhat alternative designs currently and might have an update on that in a few weeks. Feel free to drop me an email if you would like to get an early idea of what we want to do.

from torchbeast.

kshitijkg avatar kshitijkg commented on July 17, 2024

Are there any updates on this @heiner?

I was also wondering if it would be possible for @ege-k @Batom3 to share their code with me?

from torchbeast.

heiner avatar heiner commented on July 17, 2024

Hey everyone!

I've since left Facebook, but my amazing colleagues are have written https://github.com/facebookresearch/moolib, which you should check out for a multi-GPU IMPALA.

from torchbeast.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.