Giter VIP home page Giter VIP logo

Comments (14)

sjeaugey avatar sjeaugey commented on July 20, 2024

That would be heavily dependent on the PCI topology of your systems. I can't comment without a precise description or a NCCL topology dump (NCCL_TOPO_DUMP_FILE=system.xml)

from nccl-tests.

russilwvong avatar russilwvong commented on July 20, 2024

Thanks, Sylvain. Interesting. I've attached a NCCL topology dump.
system.xml.txt

from nccl-tests.

sjeaugey avatar sjeaugey commented on July 20, 2024

Thanks. It seems the GPU and NICs are attached directly to the CPU; so the GPU-NIC association isn't really direct. Also because there is no PCI direct connection between NICs and GPUs, PXN wouldn't be used.

So I would expect each GPU would pick one NIC (or maybe the two that are local) and send their data to all the others using that NIC. I don't see how the alltoall could complete otherwise.

from nccl-tests.

russilwvong avatar russilwvong commented on July 20, 2024

Hmm. Okay, so a GPU picks one of two local NICs when sending outgoing data. Can I ask how it determines what receiving NIC to send to, in order to reach a destination GPU?

from nccl-tests.

sjeaugey avatar sjeaugey commented on July 20, 2024

When a GPU picks a NIC to receive from, it will get the handle of that NIC and pass it to the other side which will connect to it.

Are you using RoCE? If so, how did you configure the IP addresses on the different interfaces? Did you use one IP subnet per NIC or did you put all of them in the same subnet?

from nccl-tests.

russilwvong avatar russilwvong commented on July 20, 2024

from nccl-tests.

sjeaugey avatar sjeaugey commented on July 20, 2024

Each NIC has its own subnet - 32.0.1.2/24,
32.0.2.2/24, and so on.

Thanks for confirming.

I had been assuming that each GPU would always use the same NIC to send and
receive packets, and it sounds like that's not the case at all.

As a general design, a GPU will use all NICs which are the most local in the topology, and round-robin on them based on various factors. If two GPUs share two NICs, then each GPU should start with a different NIC, then round-robin.

from nccl-tests.

russilwvong avatar russilwvong commented on July 20, 2024

from nccl-tests.

russilwvong avatar russilwvong commented on July 20, 2024

Hmm. Actually, there must be something I'm missing. Say that we have four NICs on one server, with IP addresses 32.0.1.2, 32.0.2.2, 32.0.3.2, and 32.0.4.2, and with four GPUs, two of them (A1 and A2) sharing 32.0.1.2 and 32.0.2.2, and two of them (A3 and A4) sharing 32.0.3.2 and 32.0.4.2.

Similarly we have a second server with four GPUs B1 to B4, and with four NICs with IP addresses 32.0.5.2 to 32.0.8.2.

If we run an all-to-all collective with all eight GPUs, we see that 32.0.1.2 is sending to 32.0.4.2, and 32.0.2.2 is sending to 32.0.3.2. So A1 can send to A3 and A4 (which could happen using either 32.0.1.2 or 32.0.2.2). Similarly A2 can send to A3 and A4 (using either 32.0.2.2 or 32.0.1.2).

But how does A1 send to A2? We don't see any packets going from 32.0.1.2 to 32.0.2.2, or from 32.0.2.2 to 32.0.1.2.

We've disabled NVLink, P2P, and shared memory. But maybe there's something I've missed.

from nccl-tests.

russilwvong avatar russilwvong commented on July 20, 2024

Going through the trace logs, it looks like A1 is sending to A2, sometimes with A1 sending on NIC 0 (NET/IB/0) and A2 receiving on NIC 1 (NET/IB/1), and sometimes with A1 sending on NIC 1 and A2 receiving on NIC 0 - but no packets appear on the wire.

lambda-server-1:13354:13387 [1] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [receive] via NET/IB/1/GDRDMA
lambda-server-1:13353:13386 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA
lambda-server-1:13354:13387 [1] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA
lambda-server-1:13353:13386 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [send] via NET/IB/1/GDRDMA
lambda-server-1:13354:13387 [1] NCCL INFO Channel 02/0 : 0[0] -> 1[1] [receive] via NET/IB/1/GDRDMA
lambda-server-1:13353:13386 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA
lambda-server-1:13354:13387 [1] NCCL INFO Channel 03/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA
lambda-server-1:13353:13386 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] [send] via NET/IB/1/GDRDMA
lambda-server-1:13353:13422 [0] NCCL INFO Channel 02/1 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA/Shared
lambda-server-1:13353:13422 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] [send] via NET/IB/1/GDRDMA/Shared
lambda-server-1:13354:13425 [1] NCCL INFO Channel 02/1 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA/Shared
lambda-server-1:13354:13425 [1] NCCL INFO Channel 03/1 : 0[0] -> 1[1] [receive] via NET/IB/1/GDRDMA/Shared

The last four lines appear odd: for channel 02/1, it looks like A1 is sending via NIC 0 and A2 is receiving on the same NIC, NIC 0 (!). Same for channel 03/1.

lambda-server-1:13353:13422 [0] NCCL INFO Channel 02/1 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA/Shared
lambda-server-1:13353:13422 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] [send] via NET/IB/1/GDRDMA/Shared
lambda-server-1:13354:13425 [1] NCCL INFO Channel 02/1 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA/Shared
lambda-server-1:13354:13425 [1] NCCL INFO Channel 03/1 : 0[0] -> 1[1] [receive] via NET/IB/1/GDRDMA/Shared

from nccl-tests.

sjeaugey avatar sjeaugey commented on July 20, 2024

it looks like A1 is sending to A2, sometimes with A1 sending on NIC 0 (NET/IB/0) and A2 receiving on NIC 1 (NET/IB/1), and sometimes with A1 sending on NIC 1 and A2 receiving on NIC 0 - but no packets appear on the wire.

I'm not expert enough to comment on that. RoCE relies on the linux kernel's rooting table and ARP to know how to reach a destination. There could be optimizations/bugs which would end up with this kind of behavior. I don't know how to debug that though.

it looks like A1 is sending via NIC 0 and A2 is receiving on the same NIC, NIC 0

Why is that odd? Round-robin may end up with the same NIC for both, which will just go through NIC loopback and not even reach the wire.

from nccl-tests.

russilwvong avatar russilwvong commented on July 20, 2024

Why is that odd? Round-robin may end up with the same NIC for both, which will just go through NIC loopback and not even reach the wire.

Very interesting, I hadn't realized this earlier. Thanks for taking the time to respond, this has been very illuminating.

Can I ask, what's the difference between Channel 00/0 and Channel 00/1? Is Channel 00/1 used for doing the actual data transfer?

We ran some tests with different versions of NCCL (2.18.1, 2.19.1, 2.20.3, 2.21.5). 2.18 is the only one which exhibits this behavior. For all the other versions, each rank talks to all other ranks in an eight-rank all-to-all collective.

Comparing the log files for 2.18.1 and 2.21.5, using a four-rank collective on a single server (to cut down the amount of data to look at), and focusing only on Channel XX/1 logs:

  • With 2.18.1, adjacent GPUs (0 and 1, 2 and 3) always select the same NIC for sending and receiving. So there's no packets on the wire. With 2.21.5, adjacent GPUs always select different NICs.
  • With 2.18.1, focusing on rank 0, GPUs which aren't adjacent always select one of the following combinations for sending and receiving: NIC 0 and NIC 2, or NIC 1 and NIC 3. So NIC 0 and NIC 2 never talk to NIC 1 or NIC 3. Presumably this is just an artifact of the way the first NIC is selected by each rank, and then round-robin means that the pattern is consistent. With 2.21.5, we see all four combinations - 1 and 2, 0 and 3, 0 and 2, 1 and 3.

So my guess at this point is

  • NCCL uses Channel XX/1 connections to transfer data.
  • On the servers where this test is running, each pair of GPUs shares a pair of NICs.
  • For NCCL 2.18 (but not subsequent versions), both GPUs in a pair end up choosing the same NIC to start with, for sending and for receiving. The result is that (a) adjacent GPUs never send any packets to each other on the wire, and (b) non-adjacent GPUs always transfer data using either the first NIC from a pair, or the second NIC. The first NIC in a pair never talks to the second NIC in the same pair or any other pair.

I don't suppose there's an option to tell NCCL to always assign a specific NIC to a specific GPU when sending or receiving?

from nccl-tests.

sjeaugey avatar sjeaugey commented on July 20, 2024

what's the difference between Channel 00/0 and Channel 00/1?

The second number is the connection index. connIndex 1 uses shared buffers and is used for send/recv operations, while connIndex 0 uses dedicated buffers and is used for Rings and Trees.

We ran some tests with different versions of NCCL (2.18.1, 2.19.1, 2.20.3, 2.21.5). 2.18 is the only one which exhibits this behavior.

Indeed at some point we changed the channel selection logic to use NICs in a more efficient manner and improve the round-robin. I can't recall exactly which version did that, but it could have been 2.19.

I don't suppose there's an option to tell NCCL to always assign a specific NIC to a specific GPU when sending or receiving?

Not really. Unless you want to cook up a topology file which declares that each GPU only has one local NIC. But that can cause trouble to close the rings, so it may have adverse consequences.

from nccl-tests.

russilwvong avatar russilwvong commented on July 20, 2024

The second number is the connection index. connIndex 1 uses shared buffers and is used for send/recv operations, while connIndex 0 uses dedicated buffers and is used for Rings and Trees.

Great, thanks for confirming. And of course, thank you for all your work on the nccl library!

from nccl-tests.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.