Giter VIP home page Giter VIP logo

Comments (21)

songh11 avatar songh11 commented on August 20, 2024

I found nccv-2.17.1 can work, but v2.22.3 will core dumped. I wonder what is the reason for this, I need your help.

from nccl-tests.

sjeaugey avatar sjeaugey commented on August 20, 2024

Can you analyze the coredump with:

gdb ./build/all_reduce_perf <core file>

Then inside gdb:

bt

That would help us a lot. Thanks!

from nccl-tests.

songh11 avatar songh11 commented on August 20, 2024

Can you analyze the coredump with:

gdb ./build/all_reduce_perf <core file>

Then inside gdb:

bt

That would help us a lot. Thanks!

Thank you for your reply, this is my output

Details

(gdb) set args -b 8 -e 128M -f 2 -g 4
(gdb) r
Starting program: /home/sh9/workspace/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

# Using devices
[New Thread 0x7fffe09aa000 (LWP 376734)]
# Rank 0 Group 0 Pid 376717 on iv-yd9wibvbpc5i3z3c67ow device 0 [0x65] NVIDIA L20
# Rank 1 Group 0 Pid 376717 on iv-yd9wibvbpc5i3z3c67ow device 1 [0x67] NVIDIA L20
# Rank 2 Group 0 Pid 376717 on iv-yd9wibvbpc5i3z3c67ow device 2 [0x69] NVIDIA L20
# Rank 3 Group 0 Pid 376717 on iv-yd9wibvbpc5i3z3c67ow device 3 [0x6b] NVIDIA L20
[New Thread 0x7fffcffff000 (LWP 376735)]
[New Thread 0x7fffcf7fe000 (LWP 376736)]
[New Thread 0x7fffce990000 (LWP 376737)]
[New Thread 0x7fffcdcee000 (LWP 376738)]
[New Thread 0x7fffcd04c000 (LWP 376739)]
[New Thread 0x7fffc1fff000 (LWP 376740)]
[New Thread 0x7fffc17fe000 (LWP 376741)]
[New Thread 0x7fffc0ffd000 (LWP 376742)]
[New Thread 0x7fff91fff000 (LWP 376743)]
[New Thread 0x7fff917fe000 (LWP 376744)]
[New Thread 0x7fff90ffd000 (LWP 376750)]
[Thread 0x7fffc1fff000 (LWP 376740) exited]

Thread 10 "all_reduce_perf" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc0ffd000 (LWP 376742)]
0x00007fffe97c4161 in addInterStep (system=system@entry=0x7ffefb432990, tx=tx@entry=3, ix=, t1=t1@entry=0, i1=i1@entry=1, t2=t2@entry=0, i2=0)
at graph/paths.cc:186
186 for (int i=0; ipaths[t2][i2].count; i++) srcNode->paths[t2][i2].list[l++] = cpuNode->paths[t2][i2].list[i];
(gdb) bt
#0 0x00007fffe97c4161 in addInterStep (system=system@entry=0x7ffefb432990, tx=tx@entry=3, ix=, t1=t1@entry=0, i1=i1@entry=1, t2=t2@entry=0, i2=0)
at graph/paths.cc:186
#1 0x00007fffe97c6e19 in ncclTopoComputePaths (system=0x7ffefb432990, comm=comm@entry=0x55555837a650) at graph/paths.cc:579
#2 0x00007fffe97893ef in initTransportsRank (comm=comm@entry=0x55555837a650, parent=0x0, timers=timers@entry=0x7fffc0ff67c0) at init.cc:811
#3 0x00007fffe978d5f9 in ncclCommInitRankFunc (job_=0x5555583b1890) at init.cc:1408
#4 0x00007fffe97824bc in ncclAsyncJobMain (arg=0x5555583b1890) at group.cc:68
#5 0x00007fffe970b609 in start_thread (arg=) at pthread_create.c:477
#6 0x00007fffe92ff353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

from nccl-tests.

sjeaugey avatar sjeaugey commented on August 20, 2024

Would you be able to dump the node topology with NCCL_TOPO_DUMP_FILE=system.txt and post the system.txt here? That should allow us to reproduce the issue.

from nccl-tests.

songh11 avatar songh11 commented on August 20, 2024

Would you be able to dump the node topology with NCCL_TOPO_DUMP_FILE=system.txt and post the system.txt here? That should allow us to reproduce the issue.

Here is the system.txt
system.txt

from nccl-tests.

kiskra-nvidia avatar kiskra-nvidia commented on August 20, 2024

Thank you! With the information you provided, we were able to reproduce the problem locally. This is an issue we've encountered before and we hope to include a fix in the next NCCL release!

from nccl-tests.

songh11 avatar songh11 commented on August 20, 2024

Thank you! With the information you provided, we were able to reproduce the problem locally. This is an issue we've encountered before and we hope to include a fix in the next NCCL release!

Thanks for your reply, could you tell me what caused the problem

from nccl-tests.

kiskra-nvidia avatar kiskra-nvidia commented on August 20, 2024

It's due to a somewhat unusual topology of your system. You have GPUs without P2P capability attached under one NUMA node and then the NIC is under another NUMA node. Our graph searching code gets confused by it, but the fix is actually a trivial one-liner:

--- a/src/graph/paths.cc
+++ b/src/graph/paths.cc
@@ -162,7 +162,7 @@ static ncclResult_t getLocalCpu(struct ncclTopoSystem* system, int gpu, int* ret
   struct ncclTopoLinkList* paths = system->nodes[GPU].nodes[gpu].paths[CPU];
   for (int c=0; c<system->nodes[CPU].count; c++) {
     int hops = paths[c].count;
-    if (minHops == 0 || hops < minHops) {
+    if ((minHops == 0 || hops < minHops) && paths[c].type < PATH_NET) {
       localCpu = c;
       minHops = hops;
     }

Can you try the above and see if it fixes things for you?

BTW, unusually, the numaid being reported in the topo file for the NUMA node with the NIC is -1. I was going to follow up with you on that, especially since your nvidia-smi topo -m output shows only one NUMA node. What does numactl -H show?

from nccl-tests.

kiskra-nvidia avatar kiskra-nvidia commented on August 20, 2024

Could you also post the output of lspci -tv? Thanks!

from nccl-tests.

songh11 avatar songh11 commented on August 20, 2024

Could you also post the output of lspci -tv? Thanks!

Thank you, and numactl -H & lspci -tv shows:
image

from nccl-tests.

songh11 avatar songh11 commented on August 20, 2024

It's due to a somewhat unusual topology of your system. You have GPUs without P2P capability attached under one NUMA node and then the NIC is under another NUMA node. Our graph searching code gets confused by it, but the fix is actually a trivial one-liner:

--- a/src/graph/paths.cc
+++ b/src/graph/paths.cc
@@ -162,7 +162,7 @@ static ncclResult_t getLocalCpu(struct ncclTopoSystem* system, int gpu, int* ret
   struct ncclTopoLinkList* paths = system->nodes[GPU].nodes[gpu].paths[CPU];
   for (int c=0; c<system->nodes[CPU].count; c++) {
     int hops = paths[c].count;
-    if (minHops == 0 || hops < minHops) {
+    if ((minHops == 0 || hops < minHops) && paths[c].type < PATH_NET) {
       localCpu = c;
       minHops = hops;
     }

Can you try the above and see if it fixes things for you?

BTW, unusually, the numaid being reported in the topo file for the NUMA node with the NIC is -1. I was going to follow up with you on that, especially since your nvidia-smi topo -m output shows only one NUMA node. What does numactl -H show?

I tried it, and it looks like the mistake went to another place.

Details

(base) sh9@iv-yd9wibvbpc5i3z3c67ow:/workspace/nccl-tests$ gdb ./build/all_reduce_perf
GNU gdb (Ubuntu 9.2-0ubuntu1
20.04.2) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./build/all_reduce_perf...
(gdb) set ar
architecture args
(gdb) set args -b 8 -e 128M -f 2 -g 4
(gdb) r
Starting program: /home/sh9/workspace/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

# Using devices
[New Thread 0x7fffe09aa000 (LWP 961886)]
# Rank 0 Group 0 Pid 961869 on iv-yd9wibvbpc5i3z3c67ow device 0 [0x65] NVIDIA L20
# Rank 1 Group 0 Pid 961869 on iv-yd9wibvbpc5i3z3c67ow device 1 [0x67] NVIDIA L20
# Rank 2 Group 0 Pid 961869 on iv-yd9wibvbpc5i3z3c67ow device 2 [0x69] NVIDIA L20
# Rank 3 Group 0 Pid 961869 on iv-yd9wibvbpc5i3z3c67ow device 3 [0x6b] NVIDIA L20
[New Thread 0x7fffcffff000 (LWP 961887)]
[New Thread 0x7fffcf7fe000 (LWP 961888)]
[New Thread 0x7fffce990000 (LWP 961889)]
[New Thread 0x7fffcdcee000 (LWP 961890)]
[New Thread 0x7fffcd04c000 (LWP 961891)]
[New Thread 0x7fffc1fff000 (LWP 961897)]
[New Thread 0x7fffc17fe000 (LWP 961898)]
[New Thread 0x7fffc0ffd000 (LWP 961899)]
[New Thread 0x7fff91fff000 (LWP 961900)]
[New Thread 0x7fff917fe000 (LWP 961901)]
[New Thread 0x7fff90ffd000 (LWP 961902)]
[Thread 0x7fffc1fff000 (LWP 961897) exited]

Thread 10 "all_reduce_perf" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc0ffd000 (LWP 961899)]
ncclTopoCheckNet (system=system@entry=0x7ffef3432930, id1=421904, id2=413712, net=net@entry=0x7fffc0ff6084) at graph/paths.cc:462
462 if (path->type <= PATH_PXB && path->bw > netSpeed1) netSpeed1 = path->bw;
(gdb) bt
#0 ncclTopoCheckNet (system=system@entry=0x7ffef3432930, id1=421904, id2=413712, net=net@entry=0x7fffc0ff6084) at graph/paths.cc:462
#1 0x00007fffe97e1ed6 in shmCanConnect (graph=, info2=0x7ffef3432700, info1=0x7ffef3432768, topo=0x7ffef3432930, ret=0x7fffc0ff6114)
at transport/shm.cc:58
#2 shmCanConnect (ret=0x7fffc0ff6114, topo=0x7ffef3432930, graph=, info1=0x7ffef3432768, info2=0x7ffef3432700)
at transport/shm.cc:51
#3 0x00007fffe97c6f94 in ncclTopoComputePaths (system=0x7ffef3432930, comm=comm@entry=0x55555837a620) at graph/paths.cc:593
#4 0x00007fffe97893ef in initTransportsRank (comm=comm@entry=0x55555837a620, parent=0x0, timers=timers@entry=0x7fffc0ff67c0) at init.cc:811
#5 0x00007fffe978d5f9 in ncclCommInitRankFunc (job_=0x5555583b1860) at init.cc:1408
#6 0x00007fffe97824bc in ncclAsyncJobMain (arg=0x5555583b1860) at group.cc:68
#7 0x00007fffe970b609 in start_thread (arg=) at pthread_create.c:477
#8 0x00007fffe92ff353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

from nccl-tests.

kiskra-nvidia avatar kiskra-nvidia commented on August 20, 2024

Right, sorry about that! I saw that the code was crashing due to a known bug but I haven't verified that it wouldn't crash due to another bug further down...

@sjeaugey, I could use your help to figure out the root cause here. Basically, NCCL is crashing all over the place because of many paths being missing in the graph -- including the paths between GPUs and the (only) NET. That NET is also showing up in the topo dump (see #233 (comment)) as being attached to a fake CPU numa node (id -1). Is this because this is being run in a VM and an appropriate topo file should always be loaded when invoking NCCL? Or is the VM misconfigured? In particular, lspci -tv is not showing the root PCI bridge (see #233 (comment)) -- is that what's causing all the paths to be SYS in the output of nvidia-smi topo -m (see #233 (comment))? What's the right way to address it?

from nccl-tests.

sjeaugey avatar sjeaugey commented on August 20, 2024

That's weird, I can't repro the issue with the topo file. @songh11 could you run outside of GDB with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH and provide the log?

Edit: I can repro actually, my bad. No need to provide the log.

@kiskra-nvidia devices being attached to CPU -1 is indeed common in VMs, for virtual devices. It should not be a problem. I see we're also failing to figure out the PCI width (width="0"), but again, that should fall back to default values.

from nccl-tests.

sjeaugey avatar sjeaugey commented on August 20, 2024

One weird thing though: the NIC shows up as:

      <pci busid="0000:00:00.0" class="0x060000" vendor="0x8086" device="0x1237" subsystem_vendor="0x1af4" subsystem_device="0x1100" link_speed="" link_width="0">

Usually NICs have a PCI class of 0x02XXXX, not 0x060000. That may be causing trouble down the line.

from nccl-tests.

sjeaugey avatar sjeaugey commented on August 20, 2024

Ok, it looks like all our problems come from the fact that the "unknown NUMA node" a.k.a. "-1" translates into "Node -1 / NUMA node -1" which is causing the new code to consider the NIC and GPU are on different nodes.

Using 0xffff instead of -1 should fix the problem:

diff --git a/src/graph/xml.cc b/src/graph/xml.cc
index c2c6a1c81..d6fd91dfe 100644
--- a/src/graph/xml.cc
+++ b/src/graph/xml.cc
@@ -678,13 +678,13 @@ ncclResult_t ncclTopoGetXmlFromSys(struct ncclXmlNode* pciNode, struct ncclXml*
       }
     } else {
       // No information on /sys, attach GPU to unknown CPU
-      NCCLCHECK(xmlFindTagKv(xml, "cpu", &parent, "numaid", "-1"));
+      NCCLCHECK(xmlFindTagKv(xml, "cpu", &parent, "numaid", "0xffff"));
       if (parent == NULL) {
         struct ncclXmlNode* topNode;
         NCCLCHECK(xmlFindTag(xml, "system", &topNode));
         NCCLCHECK(xmlAddNode(xml, topNode, "cpu", &parent));
         NCCLCHECK(xmlSetAttrLong(parent, "host_hash", getHostHash()));
-        NCCLCHECK(xmlSetAttr(parent, "numaid", "-1"));
+        NCCLCHECK(xmlSetAttr(parent, "numaid", "0xffff"));
         NCCLCHECK(ncclTopoGetXmlFromCpu(parent, xml));
       }
     }

from nccl-tests.

Tison-Liu avatar Tison-Liu commented on August 20, 2024

@songh11
By the way, I use the same GPU as you, and the environment is dual-machine and four-card, but my all reduce test can only reach about 22GB/s, I want to refer to your test results, is this normal?

from nccl-tests.

songh11 avatar songh11 commented on August 20, 2024

@songh11 By the way, I use the same GPU as you, and the environment is dual-machine and four-card, but my all reduce test can only reach about 22GB/s, I want to refer to your test results, is this normal?

I only used one machine and four L20, and I can run on nccl-version2.17.1. My all reduce test about 17GB/s, I think there might be something wrong with my results.

from nccl-tests.

songh11 avatar songh11 commented on August 20, 2024

That's weird, I can't repro the issue with the topo file. @songh11 could you run outside of GDB with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH and provide the log?

Edit: I can repro actually, my bad. No need to provide the log.

@kiskra-nvidia devices being attached to CPU -1 is indeed common in VMs, for virtual devices. It should not be a problem. I see we're also failing to figure out the PCI width (width="0"), but again, that should fall back to default values.

Thanks you, and this is my log:

Details

# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

# Using devices
# Rank 0 Group 0 Pid 1539966 on iv-yd9wibvbpc5i3z3c67ow device 0 [0x65] NVIDIA L20
# Rank 1 Group 0 Pid 1539966 on iv-yd9wibvbpc5i3z3c67ow device 1 [0x67] NVIDIA L20
# Rank 2 Group 0 Pid 1539966 on iv-yd9wibvbpc5i3z3c67ow device 2 [0x69] NVIDIA L20
# Rank 3 Group 0 Pid 1539966 on iv-yd9wibvbpc5i3z3c67ow device 3 [0x6b] NVIDIA L20
iv-yd9wibvbpc5i3z3c67ow:1539966:1539966 [0] NCCL INFO Bootstrap : Using eth0:10.41.72.35<0>
iv-yd9wibvbpc5i3z3c67ow:1539966:1539966 [0] NCCL INFO cudaDriverVersion 12020
iv-yd9wibvbpc5i3z3c67ow:1539966:1539966 [3] NCCL INFO NCCL version 2.22.3+cuda12.1
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eth0:10.41.72.35<0>
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Using network IB
iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Using network IB
iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Using network IB
iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Using network IB
iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO ncclCommInitRank comm 0x55b0a9764440 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 65010 commId 0xa3c2b2129f65fbe0 - Init START
iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO ncclCommInitRank comm 0x55b0a97d4820 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 69010 commId 0xa3c2b2129f65fbe0 - Init START
iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO ncclCommInitRank comm 0x55b0a980bb40 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 6b010 commId 0xa3c2b2129f65fbe0 - Init START
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO ncclCommInitRank comm 0x55b0a979d4c0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 67010 commId 0xa3c2b2129f65fbe0 - Init START
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:66/0000:66:00.0/0000:67:01.0/../max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:64/0000:64:00.0/0000:65:01.0/../max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:68/0000:68:00.0/0000:69:01.0/../max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:6a/0000:6a:00.0/0000:6b:01.0/../max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:64/0000:64:00.0/0000:65:01.0/../max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:66/0000:66:00.0/0000:67:01.0/../max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:68/0000:68:00.0/0000:69:01.0/../max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:6a/0000:6a:00.0/0000:6b:01.0/../max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_speed, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:00.0/../max_link_width, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO KV Convert to int : could not find value of '0x060000' in dictionary, falling back to 1
iv-yd9wibvbpc5i3z3c67ow:1539966:1539989 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO KV Convert to int : could not find value of '0x060000' in dictionary, falling back to 1
iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO KV Convert to int : could not find value of '0x060000' in dictionary, falling back to 1
iv-yd9wibvbpc5i3z3c67ow:1539966:1539988 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
iv-yd9wibvbpc5i3z3c67ow:1539966:1539991 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO KV Convert to int : could not find value of '0x060000' in dictionary, falling back to 1
iv-yd9wibvbpc5i3z3c67ow:1539966:1539990 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
Segmentation fault (core dumped)

from nccl-tests.

songh11 avatar songh11 commented on August 20, 2024

Ok, it looks like all our problems come from the fact that the "unknown NUMA node" a.k.a. "-1" translates into "Node -1 / NUMA node -1" which is causing the new code to consider the NIC and GPU are on different nodes.

Using 0xffff instead of -1 should fix the problem:

diff --git a/src/graph/xml.cc b/src/graph/xml.cc
index c2c6a1c81..d6fd91dfe 100644
--- a/src/graph/xml.cc
+++ b/src/graph/xml.cc
@@ -678,13 +678,13 @@ ncclResult_t ncclTopoGetXmlFromSys(struct ncclXmlNode* pciNode, struct ncclXml*
       }
     } else {
       // No information on /sys, attach GPU to unknown CPU
-      NCCLCHECK(xmlFindTagKv(xml, "cpu", &parent, "numaid", "-1"));
+      NCCLCHECK(xmlFindTagKv(xml, "cpu", &parent, "numaid", "0xffff"));
       if (parent == NULL) {
         struct ncclXmlNode* topNode;
         NCCLCHECK(xmlFindTag(xml, "system", &topNode));
         NCCLCHECK(xmlAddNode(xml, topNode, "cpu", &parent));
         NCCLCHECK(xmlSetAttrLong(parent, "host_hash", getHostHash()));
-        NCCLCHECK(xmlSetAttr(parent, "numaid", "-1"));
+        NCCLCHECK(xmlSetAttr(parent, "numaid", "0xffff"));
         NCCLCHECK(ncclTopoGetXmlFromCpu(parent, xml));
       }
     }

Thank you for your help, but It's weird that this one also gives an error.

from nccl-tests.

sjeaugey avatar sjeaugey commented on August 20, 2024

Do you mean that this change did not fix your problem? If that is the case, can you provide the log?

from nccl-tests.

kiskra-nvidia avatar kiskra-nvidia commented on August 20, 2024

Can you also provide the NCCL_TOPO_DUMP_FILE with @sjeaugey's fix in place?

from nccl-tests.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.