Giter VIP home page Giter VIP logo

Comments (6)

Mirian-Hipolito avatar Mirian-Hipolito commented on May 14, 2024 1

Hello @AymenTlili131,

We really appreciate your feedback. I was able to reproduce the same error on my end and it seems to me that this is not a matter of NCCL setup but the numbers of GPUs you're trying to assign. When running on NCCL, torch distributed receives the argument --nproc_per_node as the number of GPUs you have available in your sytem to run the simulation, however FLUTE requires at least 2 in order to launch: 1 Server and 1 Worker that can execute many clients, but I can see you only have 1 available (GPU 0).

image

This is the stacktrace .. as you can see the problem occurs at the assignation time.

image

I took a look at NCCL test repo and noticed that the -g argument correspond to the number of available GPUs, this is the reason of the fail, given that you only have 1 available it's not able to run with a higher number.
image

You can find more information about FLUTE architecture here. There is one issue already open for this situation here: #15 , we apologize for the inconvenience at this moment.

Regarding the comments about the requirements/ python versions, we will make sure to update them during the next commit.

Let me know if this information is useful or if we can provide more support on this. 🙂

Thanks,
Mirian

from msrflute.

AymenTlili131 avatar AymenTlili131 commented on May 14, 2024 1

Hey @Mirian-Hipolito
Things are up and running on my end . I'm grateful for your explanation and support and hope you and the maintainers have a wonderful rest of week .
I'll make sure to cite the FLUTE team if I find anything useful !
thanks again
Kind regards

from msrflute.

Mirian-Hipolito avatar Mirian-Hipolito commented on May 14, 2024 1

Hello @AymenTlili131, we are happy to share that we have removed the restriction of minimum number of GPUs to run FLUTE in our latest release. For more documentation about how to run an experiments using a single GPU, please refer to the README.

from msrflute.

AymenTlili131 avatar AymenTlili131 commented on May 14, 2024 1

Hey Mirian ,
This is great news .I gained access to other GPUs meanwhile and did experiment with working on them remotely but thanks to your efforts and your colleagues' I can experiment with tweaks and proofs of ideas at a much smaller scale . Greatly appreciate and thanks to the entire Microsoft family

from msrflute.

AymenTlili131 avatar AymenTlili131 commented on May 14, 2024

Thanks for writing back so soon ,
I'll request access to a workstation with 2 or more GPUs and test it for myself but this is a solid and good explination to why the error was raised , thanks !
I'd still like to keep the issue open until I confirm that it indeed works (not more than a week ).
Reading the linked FLUTE architecture it should and will work but hopefully i won't take long with the environment setup and testing before I get back to you .

from msrflute.

Mirian-Hipolito avatar Mirian-Hipolito commented on May 14, 2024

Thanks @AymenTlili131! Let us know if this issue persists.

Regards,
Mirian.

from msrflute.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.