Giter VIP home page Giter VIP logo

Comments (6)

rashikakheria avatar rashikakheria commented on July 28, 2024 1

@mvpatel2000 We have followed up internally to get back an answer for you. Thanks for your patience.

from aws-ofi-nccl.

bwbarrett avatar bwbarrett commented on July 28, 2024 1

No problem at all; glad that all made some sense :). We'd love feedback if you find an approach that works in the long term.

from aws-ofi-nccl.

mvpatel2000 avatar mvpatel2000 commented on July 28, 2024

Maybe @rashikakheria?

from aws-ofi-nccl.

mvpatel2000 avatar mvpatel2000 commented on July 28, 2024

Thanks! I appreciate the help :)

from aws-ofi-nccl.

bwbarrett avatar bwbarrett commented on July 28, 2024

Unfortunately, it will not be easy to get what you're looking for. It's possible, but would require you to build some packages yourself, and likely skip at least one of the two installer scripts. The core issue is that operating systems are relatively slow to update the rdma-core package in their distributions, and there are new features both AWS and Mellanox are releasing all the time, which require updating rdma-core. The solution is that both of us ship an rdma-core in our installers, which creates a conflict.

The EFA installer currently ships rdma-core v43. Other than applying Ubuntu's packaging scripts, it is unmodified from the official upstream, and it includes all the providers (ie, drivers) for all NICs that are upstreamed. The EFA packaging names correspond to the packaging names used by Ubuntu when they package rdma-core.

The Mellanox installer currently ships rdma-core v40, although it appears that it includes Mellanox-specific patches. Unfortunately, it does not include all the providers in upstream, but only installes the mlx5 provider. Additionally (but generally not importantly) the Mellanox packaging seems to conform to the upstream dpkg files, rather than how Ubuntu packages things.

So the difference in package naming is why you get the weird conflict above (and part of why you likely can't use at least one of the installers). But you have another problem with rdma-core, in that the EFA-installer package includes the mlx5 provider (the one you need for modern IB) but does not include whatever patched Mellanox adds that aren't upstreamed. The Mellanox installer package includes the Mellanox patches, but not the efa provider. So neither built gets you entirely what you want in both platforms (although it is likely that if you are only using NCCL, you are using a subset of the Mellanox stack small enough that you aren't using anything that is patched, but I'm not going to be able to answer that question).

I can see two options for solving the rdma-core problem:

  1. Use the EFA installer rdma-core. You may miss out on a new feature of the Mellanox cards, I wouldn't recommend this if you're using really advanced users of the Mellanox stack like Open MPI, MVAPICH, or UCX, but for NCCL it should be fine. And obviously it will work on EFA>
  2. The source artifacts for MOFED do include the EFA provider, so you could build from source and end up with an rdma-core that worked with EFA and included all the Mellanox extensions on IB. Of course, support on EFA with that MOFED version would be difficult and certainly isn't something AWS tests on a regular basis.

Once you have a working MOFED, I think it's a matter of installing the other packages manually from the installer tarballs. That's a bit of a pain, but should be relatively straight-forward. The EFA Libfabric and Open MPI installs will end up in /opt/amazon and the mellanox packages in /opt/mellanox, and everything should be happy. But, like I said, the installers will go back to clobbering each other over rdma-core, so that has to be avoided.

from aws-ofi-nccl.

mvpatel2000 avatar mvpatel2000 commented on July 28, 2024

@bwbarrett thank you so much for the detailed answer and explanation -- this is incredibly helpful.

Given all this context, I think the best way to proceed is probably to build a different set of images for EFA and mellanox, and come back to merging them later given the challenges. So, I will close this issue in the meantime.

When I have some time, I would like to try to merge them as it solves some maintenance issues for us. Given your explanation, I will try to proceed by installing EFA (and keeping rdma-core unpatched) and then try to layer on top what I need for Mellanox. I will then benchmark NCCL and verify it doesn't cause any performance degradations. I'll share whatever findings I get in this thread later if you or anyone else looking at this in the future finds any need for it :)

Separately, I want to quickly thank you and the rest of the people who have helped me on this repo over Github issues. It's been incredibly invaluable in digging up some of the esoterics here, and I really appreciate it!

from aws-ofi-nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.