nvidia / nccl-tests Goto Github PK
View Code? Open in Web Editor NEWNCCL Tests
License: BSD 3-Clause "New" or "Revised" License
NCCL Tests
License: BSD 3-Clause "New" or "Revised" License
Hi @sjeaugey
Greetings from me! I am a newbie of GPU
programming and not sure whether this is the correct place to submit this issue here, if I am wrong, please forgive me, thanks!
After referring Fast Multi-GPU collectives with NCCL and using nccl-tests, I find NCCL
seems most powerful in following case:
A[0] = A[0] + B[0] +C[0] + D[0];
......
A[n] = A[n] + B[n] +C[n] + D[n];
where the array count is less than GPU
count, and n
is a large number. But not applicable for following case:
sum = a1 + a2 + .. +an;
Is my understanding correct? If I want to utilize multiple GPU
s to accelerate calculating sum = a1 + a2 + .. +an;
, is there any canonical method?
Thanks very much in advance!
Best Regards
Nan Xiao
I'm trying to run nccl-tests and then parse the output, but it's kind of a pain if debug output is enabled because it mixes with the actual output. Would it be possible to have an option to write to a CSV file instead of just stdout?
This may be incredibly dumb, but when I run mpirun -np 2 --hostfile hosts -mca btl ^openib -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=ens3 all_reduce_perf -b 8 -e 128M -f 2 -g 1 -c 1 -i 20
, with the host file:
172.31.12.213 max_slots=1
172.31.6.103 max_slots=1
, I can see the program finish normally,
134217728 33554432 float sum 2.237 60.00 0.00 0e+00 0.001 139715.53 0.00 0e+00
Out of bounds values : 0 OK
Avg bus bandwidth : 0
However, when I monitor 172.31.12.213's NIC bandwidth, it' all 0!
Please advice!
Thanks.
The following results are generated when I run quick test:
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
Cuda failure common.cu:891 'CUDA driver version is insufficient for CUDA runtime version'
**Running with NCCL_DEBUG=WARN produces the following results:**
nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
misc/ibvwrap.cu:60 WARN Failed to open libibverbs.so[.1]
Cuda failure common.cu:891 'CUDA driver version is insufficient for CUDA runtime version'
**Results from Nvidia-smi:**
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:06:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 10MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:07:00.0 Off | 0 |
| N/A 37C P0 41W / 300W | 10MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:0A:00.0 Off | 0 |
| N/A 35C P0 41W / 300W | 10MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:0B:00.0 Off | 0 |
| N/A 32C P0 43W / 300W | 10MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:85:00.0 Off | 0 |
| N/A 35C P0 43W / 300W | 10MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:86:00.0 Off | 0 |
| N/A 35C P0 41W / 300W | 10MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 10MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:8A:00.0 Off | 0 |
| N/A 34C P0 43W / 300W | 10MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
I am using docker container. What should be the fix for this problem?
I'm seeing 125 Gbps reported by all_reduce_perf
(log) which is surprising because AWS EFA network is advertised to be only 100 Gbps.
Does this mean 100 Gbps advertised limit is being exceeded? If not, how do I figure out the busbw achievable by all_reduce_perf
on a 100 Gbps net?
Linux: Ubuntu 20.04 LTS
GPU driver: newest NVidia driver for linux.
CUDA 10.1, CUDNN ,7.6.5
NCCL 2.6.4
Hardware :
CPU: Intel 9400f, MB: Z370,Ram : 64GB 2-channel, GPU: 2 2080ti on 2 PCIE 3.0 *8, with a NVlink bridge between them
I ran all nccl_tests, it seems NCCL is working. But when each test running(about 30 min for each test), the system freezes, I can't switch to browser or doing anything, I can only move the mouse, but the system doesn't respond to mouse-clicking or keyboard input.
when the test finishes running, system goes back to normal ,and LOG prints in console.
log is here:
# ./all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 3795 on w-system device 0 [0x01] GeForce RTX 2080 Ti
# Rank 1 Pid 3795 on w-system device 1 [0x02] GeForce RTX 2080 Ti
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 7.18 0.00 0.00 0e+00 7.02 0.00 0.00 0e+00
16 4 float sum 7.00 0.00 0.00 0e+00 7.02 0.00 0.00 0e+00
32 8 float sum 7.28 0.00 0.00 0e+00 7.19 0.00 0.00 0e+00
64 16 float sum 7.20 0.01 0.01 0e+00 7.05 0.01 0.01 0e+00
128 32 float sum 7.30 0.02 0.02 0e+00 7.19 0.02 0.02 0e+00
256 64 float sum 7.30 0.04 0.04 0e+00 7.20 0.04 0.04 0e+00
512 128 float sum 7.47 0.07 0.07 0e+00 7.12 0.07 0.07 0e+00
1024 256 float sum 8.14 0.13 0.13 0e+00 7.92 0.13 0.13 0e+00
2048 512 float sum 8.56 0.24 0.24 0e+00 8.43 0.24 0.24 0e+00
4096 1024 float sum 9.72 0.42 0.42 0e+00 9.49 0.43 0.43 0e+00
8192 2048 float sum 11.99 0.68 0.68 0e+00 11.92 0.69 0.69 0e+00
16384 4096 float sum 14.36 1.14 1.14 0e+00 14.21 1.15 1.15 0e+00
32768 8192 float sum 16.79 1.95 1.95 0e+00 16.64 1.97 1.97 0e+00
65536 16384 float sum 21.14 3.10 3.10 0e+00 20.55 3.19 3.19 0e+00
131072 32768 float sum 35.56 3.69 3.69 0e+00 35.43 3.70 3.70 0e+00
262144 65536 float sum 41.23 6.36 6.36 0e+00 41.21 6.36 6.36 0e+00
524288 131072 float sum 50.66 10.35 10.35 0e+00 50.82 10.32 10.32 0e+00
1048576 262144 float sum 72.54 14.45 14.45 0e+00 72.45 14.47 14.47 0e+00
2097152 524288 float sum 120.7 17.37 17.37 0e+00 118.4 17.71 17.71 0e+00
4194304 1048576 float sum 215.2 19.49 19.49 0e+00 214.7 19.53 19.53 0e+00
8388608 2097152 float sum 411.3 20.39 20.39 0e+00 399.1 21.02 21.02 0e+00
16777216 4194304 float sum 865.3 19.39 19.39 0e+00 779.6 21.52 21.52 0e+00
33554432 8388608 float sum 1547.9 21.68 21.68 0e+00 1699.3 19.75 19.75 0e+00
67108864 16777216 float sum 3115.1 21.54 21.54 0e+00 3007.4 22.31 22.31 0e+00
134217728 33554432 float sum 5994.3 22.39 22.39 0e+00 5991.9 22.40 22.40 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 7.43886
/all_gather_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 9119 on w-system device 0 [0x01] GeForce RTX 2080 Ti
# Rank 1 Pid 9119 on w-system device 1 [0x02] GeForce RTX 2080 Ti
#
# out-of-place in-place
# size count type time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 1 float 7.14 0.00 0.00 0e+00 7.06 0.00 0.00 0e+00
16 2 float 7.03 0.00 0.00 0e+00 7.00 0.00 0.00 0e+00
32 4 float 6.96 0.00 0.00 0e+00 7.07 0.00 0.00 0e+00
64 8 float 7.10 0.00 0.00 0e+00 7.07 0.00 0.00 0e+00
128 16 float 7.10 0.01 0.01 0e+00 7.14 0.01 0.01 0e+00
256 32 float 7.18 0.02 0.02 0e+00 7.23 0.02 0.02 0e+00
512 64 float 7.49 0.03 0.03 0e+00 7.47 0.03 0.03 0e+00
1024 128 float 7.03 0.07 0.07 0e+00 6.96 0.07 0.07 0e+00
2048 256 float 6.97 0.15 0.15 0e+00 6.97 0.15 0.15 0e+00
4096 512 float 7.41 0.28 0.28 0e+00 7.00 0.29 0.29 0e+00
8192 1024 float 9.59 0.43 0.43 0e+00 8.80 0.47 0.47 0e+00
16384 2048 float 11.41 0.72 0.72 0e+00 10.78 0.76 0.76 0e+00
32768 4096 float 13.39 1.22 1.22 0e+00 11.85 1.38 1.38 0e+00
65536 8192 float 16.57 1.98 1.98 0e+00 13.83 2.37 2.37 0e+00
131072 16384 float 23.07 2.84 2.84 0e+00 18.39 3.56 3.56 0e+00
262144 32768 float 31.38 4.18 4.18 0e+00 30.27 4.33 4.33 0e+00
524288 65536 float 36.00 7.28 7.28 0e+00 35.30 7.43 7.43 0e+00
1048576 131072 float 47.38 11.06 11.06 0e+00 46.84 11.19 11.19 0e+00
2097152 262144 float 70.44 14.89 14.89 0e+00 69.77 15.03 15.03 0e+00
4194304 524288 float 120.1 17.46 17.46 0e+00 115.5 18.16 18.16 0e+00
8388608 1048576 float 212.5 19.73 19.73 0e+00 210.2 19.95 19.95 0e+00
16777216 2097152 float 418.5 20.05 20.05 0e+00 414.0 20.26 20.26 0e+00
33554432 4194304 float 817.8 20.51 20.51 0e+00 785.1 21.37 21.37 0e+00
67108864 8388608 float 1568.3 21.40 21.40 0e+00 1560.9 21.50 21.50 0e+00
134217728 16777216 float 3298.6 20.34 20.34 0e+00 3070.3 21.86 21.86 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 6.6972
./broadcast_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 26256 on w-system device 0 [0x01] GeForce RTX 2080 Ti
# Rank 1 Pid 26256 on w-system device 1 [0x02] GeForce RTX 2080 Ti
#
# out-of-place in-place
# size count type root time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float 0 7.24 0.00 0.00 0e+00 7.50 0.00 0.00 0e+00
16 4 float 0 8.31 0.00 0.00 0e+00 7.69 0.00 0.00 0e+00
32 8 float 0 8.15 0.00 0.00 0e+00 8.23 0.00 0.00 0e+00
64 16 float 0 7.19 0.01 0.01 0e+00 7.13 0.01 0.01 0e+00
128 32 float 0 7.25 0.02 0.02 0e+00 7.45 0.02 0.02 0e+00
256 64 float 0 7.08 0.04 0.04 0e+00 7.16 0.04 0.04 0e+00
512 128 float 0 7.47 0.07 0.07 0e+00 7.39 0.07 0.07 0e+00
1024 256 float 0 7.19 0.14 0.14 0e+00 32.19 0.03 0.03 0e+00
2048 512 float 0 7.36 0.28 0.28 0e+00 7.03 0.29 0.29 0e+00
4096 1024 float 0 7.25 0.57 0.57 0e+00 7.07 0.58 0.58 0e+00
8192 2048 float 0 9.11 0.90 0.90 0e+00 8.10 1.01 1.01 0e+00
16384 4096 float 0 10.97 1.49 1.49 0e+00 10.52 1.56 1.56 0e+00
32768 8192 float 0 13.36 2.45 2.45 0e+00 11.73 2.79 2.79 0e+00
65536 16384 float 0 17.03 3.85 3.85 0e+00 14.24 4.60 4.60 0e+00
131072 32768 float 0 22.66 5.78 5.78 0e+00 22.60 5.80 5.80 0e+00
262144 65536 float 0 28.48 9.21 9.21 0e+00 28.45 9.21 9.21 0e+00
524288 131072 float 0 40.26 13.02 13.02 0e+00 40.08 13.08 13.08 0e+00
1048576 262144 float 0 63.48 16.52 16.52 0e+00 63.19 16.59 16.59 0e+00
2097152 524288 float 0 110.1 19.04 19.04 0e+00 109.3 19.19 19.19 0e+00
4194304 1048576 float 0 205.7 20.39 20.39 0e+00 237.1 17.69 17.69 0e+00
8388608 2097152 float 0 425.1 19.73 19.73 0e+00 386.7 21.69 21.69 0e+00
16777216 4194304 float 0 815.0 20.59 20.59 0e+00 824.0 20.36 20.36 0e+00
33554432 8388608 float 0 1536.8 21.83 21.83 0e+00 1508.2 22.25 22.25 0e+00
67108864 16777216 float 0 3139.2 21.38 21.38 0e+00 3124.3 21.48 21.48 0e+00
134217728 33554432 float 0 6283.5 21.36 21.36 0e+00 5873.1 22.85 22.85 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 7.99748
$ ./reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 4810 on w-system device 0 [0x01] GeForce RTX 2080 Ti
# Rank 1 Pid 4810 on w-system device 1 [0x02] GeForce RTX 2080 Ti
#
# out-of-place in-place
# size count type redop root time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 0 7.16 0.00 0.00 0e+00 7.35 0.00 0.00 0e+00
16 4 float sum 0 7.74 0.00 0.00 0e+00 7.67 0.00 0.00 0e+00
32 8 float sum 0 7.08 0.00 0.00 0e+00 7.07 0.00 0.00 0e+00
64 16 float sum 0 7.13 0.01 0.01 0e+00 7.14 0.01 0.01 0e+00
128 32 float sum 0 7.15 0.02 0.02 0e+00 7.06 0.02 0.02 0e+00
256 64 float sum 0 7.14 0.04 0.04 0e+00 7.12 0.04 0.04 0e+00
512 128 float sum 0 7.14 0.07 0.07 0e+00 7.11 0.07 0.07 0e+00
1024 256 float sum 0 7.09 0.14 0.14 0e+00 7.09 0.14 0.14 0e+00
2048 512 float sum 0 7.11 0.29 0.29 0e+00 7.12 0.29 0.29 0e+00
4096 1024 float sum 0 7.28 0.56 0.56 0e+00 7.20 0.57 0.57 0e+00
8192 2048 float sum 0 8.72 0.94 0.94 0e+00 8.59 0.95 0.95 0e+00
16384 4096 float sum 0 10.80 1.52 1.52 0e+00 10.78 1.52 1.52 0e+00
32768 8192 float sum 0 12.89 2.54 2.54 0e+00 12.64 2.59 2.59 0e+00
65536 16384 float sum 0 16.42 3.99 3.99 0e+00 15.88 4.13 4.13 0e+00
131072 32768 float sum 0 23.17 5.66 5.66 0e+00 23.27 5.63 5.63 0e+00
262144 65536 float sum 0 29.13 9.00 9.00 0e+00 28.88 9.08 9.08 0e+00
524288 131072 float sum 0 40.93 12.81 12.81 0e+00 40.93 12.81 12.81 0e+00
1048576 262144 float sum 0 64.30 16.31 16.31 0e+00 64.25 16.32 16.32 0e+00
2097152 524288 float sum 0 110.5 18.98 18.98 0e+00 110.6 18.97 18.97 0e+00
4194304 1048576 float sum 0 202.1 20.76 20.76 0e+00 202.1 20.76 20.76 0e+00
8388608 2097152 float sum 0 386.5 21.70 21.70 0e+00 386.3 21.71 21.71 0e+00
16777216 4194304 float sum 0 752.6 22.29 22.29 0e+00 752.5 22.30 22.30 0e+00
33554432 8388608 float sum 0 1485.2 22.59 22.59 0e+00 1529.3 21.94 21.94 0e+00
67108864 16777216 float sum 0 2947.4 22.77 22.77 0e+00 2945.2 22.79 22.79 0e+00
134217728 33554432 float sum 0 5873.8 22.85 22.85 0e+00 5873.8 22.85 22.85 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 8.22671
$ ./reduce_scatter_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 5435 on w-system device 0 [0x01] GeForce RTX 2080 Ti
# Rank 1 Pid 5435 on w-system device 1 [0x02] GeForce RTX 2080 Ti
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 1 float sum 7.21 0.00 0.00 0e+00 7.28 0.00 0.00 0e+00
16 2 float sum 7.12 0.00 0.00 0e+00 7.18 0.00 0.00 0e+00
32 4 float sum 7.14 0.00 0.00 0e+00 7.22 0.00 0.00 0e+00
64 8 float sum 7.20 0.00 0.00 0e+00 7.15 0.00 0.00 0e+00
128 16 float sum 7.14 0.01 0.01 0e+00 7.12 0.01 0.01 0e+00
256 32 float sum 7.16 0.02 0.02 0e+00 7.12 0.02 0.02 0e+00
512 64 float sum 7.18 0.04 0.04 0e+00 7.12 0.04 0.04 0e+00
1024 128 float sum 7.53 0.07 0.07 0e+00 7.27 0.07 0.07 0e+00
2048 256 float sum 7.28 0.14 0.14 0e+00 7.23 0.14 0.14 0e+00
4096 512 float sum 7.64 0.27 0.27 0e+00 7.57 0.27 0.27 0e+00
8192 1024 float sum 9.35 0.44 0.44 0e+00 9.24 0.44 0.44 0e+00
16384 2048 float sum 11.33 0.72 0.72 0e+00 11.23 0.73 0.73 0e+00
32768 4096 float sum 12.66 1.29 1.29 0e+00 12.62 1.30 1.30 0e+00
65536 8192 float sum 15.39 2.13 2.13 0e+00 15.31 2.14 2.14 0e+00
131072 16384 float sum 21.02 3.12 3.12 0e+00 21.35 3.07 3.07 0e+00
262144 32768 float sum 32.36 4.05 4.05 0e+00 31.98 4.10 4.10 0e+00
524288 65536 float sum 39.63 6.61 6.61 0e+00 39.76 6.59 6.59 0e+00
1048576 131072 float sum 57.11 9.18 9.18 0e+00 56.88 9.22 9.22 0e+00
2097152 262144 float sum 92.96 11.28 11.28 0e+00 92.54 11.33 11.33 0e+00
4194304 524288 float sum 166.4 12.60 12.60 0e+00 165.9 12.64 12.64 0e+00
8388608 1048576 float sum 308.5 13.59 13.59 0e+00 504.4 8.32 8.32 0e+00
16777216 2097152 float sum 1050.1 7.99 7.99 0e+00 693.5 12.10 12.10 0e+00
33554432 4194304 float sum 1533.4 10.94 10.94 0e+00 1414.8 11.86 11.86 0e+00
67108864 8388608 float sum 2529.2 13.27 13.27 0e+00 2314.2 14.50 14.50 0e+00
134217728 16777216 float sum 5619.2 11.94 11.94 0e+00 4905.4 13.68 13.68 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 4.44552
I first submit bug to TENSORFLOW , here is the link:tensorflow/tensorflow#40027
it shows when I remove NVLINK BRIDGE, the TF code runs well ,
and when I using NVLINK BRIDGE, but not using NCCL, the TF code runs well too.
but when I using NCCL and NVLINK BRIDGE, the system halt, make me have to reboot.
When compiling in a container without CUDA 11...
Step 11/28 : RUN make MPI=1 MPI_HOME=/usr/local/mpi
---> Running in bc5f12a8f538
make -C src build
make[1]: Entering directory '/nccl-tests/src'
Compiling all_reduce.cu > ../build/all_reduce.o
Compiling common.cu > ../build/common.o
Linking ../build/all_reduce.o > ../build/all_reduce_perf
Compiling all_gather.cu > ../build/all_gather.o
Linking ../build/all_gather.o > ../build/all_gather_perf
Compiling broadcast.cu > ../build/broadcast.o
Linking ../build/broadcast.o > ../build/broadcast_perf
Compiling reduce_scatter.cu > ../build/reduce_scatter.o
Linking ../build/reduce_scatter.o > ../build/reduce_scatter_perf
Compiling reduce.cu > ../build/reduce.o
Linking ../build/reduce.o > ../build/reduce_perf
Compiling alltoall.cu > ../build/alltoall.o
alltoall.cu(69): error: identifier "ncclSend" is undefined
alltoall.cu(70): error: identifier "ncclRecv" is undefined
2 errors detected in the compilation of "/tmp/tmpxft_00000461_00000000-10_alltoall.compute_70.cpp1.ii".
Makefile:71: recipe for target '../build/alltoall.o' failed
make[1]: *** [../build/alltoall.o] Error 1
make[1]: Leaving directory '/nccl-tests/src'
make: *** [src.build] Error 2
Makefile:17: recipe for target 'src.build' failed
The command '/bin/sh -c make MPI=1 MPI_HOME=/usr/local/mpi' returned a non-zero code: 2
Is it expected that we won't be able to compile this code without CUDA 11?
Hello everyone! I had two questions regarding nccl-tests:
1- why a shift is made in common.cu:375.
2- Is it enough to check the correctness of nccl by nccl-tests when the number of iters and number of warm-up iters inside nccl-tests are set to 0?
Best,
I've compiled nccl, then tried with the following command
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2
Then I see the following error. What's the problem?
[kyoungrok-ryzen:12576] *** Process received signal ***
[kyoungrok-ryzen:12576] Signal: Segmentation fault (11)
[kyoungrok-ryzen:12576] Signal code: Address not mapped (1)
[kyoungrok-ryzen:12576] Failing at address: 0x44000098
[kyoungrok-ryzen:12576] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f339f39c890]
[kyoungrok-ryzen:12576] [ 1] /usr/lib/x86_64-linux-gnu/libmpi.so.20(MPI_Comm_size+0x42)[0x7f33a4d353b2]
[kyoungrok-ryzen:12576] [ 2] ./build/all_reduce_perf[0x402101]
[kyoungrok-ryzen:12576] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f339e010b97]
[kyoungrok-ryzen:12576] [ 4] ./build/all_reduce_perf[0x40398a]
[kyoungrok-ryzen:12576] *** End of error message ***
[1] 12576 segmentation fault (core dumped) ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2
OS Platform: ubuntu 16.04
Cuda: 8.0
NCCL: 1.0
Driver Version: 384.66
Build
root@node03:~/nvidia/nccl/nccl-tests/src# make CUDA_HOME=/usr/local/cuda-8.0 NCCL_HOME=/root/nvidia/nccl/nccl-1.3.4-1/build
When I run the all_reduce_perf
on 8 GPUs, it got an error:
root@node03:~/nvidia/nccl/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
# NCCL Tests compiled with NCCL 1.0
# Using devices
# Rank 0 on node03 device 0 [0x00] Tesla P100-PCIE-16GB
# Rank 1 on node03 device 1 [0x00] Tesla P100-PCIE-16GB
# Rank 2 on node03 device 2 [0x00] Tesla P100-PCIE-16GB
# Rank 3 on node03 device 3 [0x00] Tesla P100-PCIE-16GB
# Rank 4 on node03 device 4 [0x00] Tesla P100-PCIE-16GB
# Rank 5 on node03 device 5 [0x00] Tesla P100-PCIE-16GB
# Rank 6 on node03 device 6 [0x00] Tesla P100-PCIE-16GB
# Rank 7 on node03 device 7 [0x00] Tesla P100-PCIE-16GB
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
NCCL failure all_reduce.cu:95 'unhandled cuda error'
Hi,
I am trying to run the distributed tests with MPI. I am using OpenMPI.
First Problem: I am able to run across nodes but not within the node using MPI. Is this expected behavior?
Second problem: Across the nodes, I see that I run with -n 2 but I see Rank 0,1, and 2 so my question is why there are 3 processes when I am launching only 2 processes?
The nodes have 1 K-80 so we should be able to use 2 CUDA devices within a node.
Any thoughts?
Within a Node:
mpirun -n 2 -npernode 2 -hostfile ./hosts -x LD_LIBRARY_PATH=/opt/nccl2/lib:$LD_LIBRARY_PATH ./build/all_reduce_perf -g 2 -c 0
nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 0
Cuda failure common.cu:891 'invalid device ordinal'
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[61045,1],1]
Exit code: 1
--------------------------------------------------------------------------
Across Nodes:
mpirun -n 2 -npernode 1 -hostfile ./hosts -x LD_LIBRARY_PATH=/opt/nccl2/lib:$LD_LIBRARY_PATH ./build/all_reduce_perf -g 2 -c 0
nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 0
# Using devices
# Rank 0 on gpu07 device 0 [0x05] Tesla K80
# Rank 1 on gpu07 device 1 [0x06] Tesla K80
# Rank 2 on gpu08 device 0 [0x05] Tesla K80
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
# Rank 3 on gpu08 device 1 [0x06] Tesla K80
33554432 8388608 float sum 8.642 3.88 5.82 N/A 8.153 4.12 6.17 N/A
Out of bounds values : 0 OK
Avg bus bandwidth : 5.99874
I see that NCCL has released send recv API. When will nccl-test include send/recv benchmarking code? Thanks!
I was trying this code using 2 nodes, each with 8 GPUs. The code ran well on single node, but when I tried two nodes, I am seeing the code either hangs or runs for infinite time. I tried three different cases as following, but were not successful.
nThread 1 nGpus 1 minBytes 33554432 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
Cuda failure common.cu:891 'invalid device ordinal'
nThread 1 nGpus 8 minBytes 33554432 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
nThread 8 nGpus 1 minBytes 33554432 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
I used SLURM to run the code using the following. In all the above cases, I set ntasks equals to np
#SBATCH --nodes=2
#SBATCH --ntasks=16
#SBATCH --gres=gpu:8
I set the following in the sbatch script.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/xyz/nccl/build/lib/
export LD_LIBRARY_PATH=/home/xyz/openmpi-2.0.1-sm-gcc48-cuda-8.0-slurm-14.11.7/lib:/home/xyz/cuda-8.0.61/lib64:$LD_LIBRARY_PATH
export PATH=/home/xyz/openmpi-2.0.1-sm-gcc48-cuda-8.0-slurm-14.11.7/bin:/home/xyz/cuda-8.0.61/bin:$PATH
I used openmpi-2.0.1 and cuda-8.0.61 during compilation and run. NCCL 2.0 were used. Thanks for your help.
Today I run all_reduce_perf with openmpi and GPUDirectRDMA support, and I have two GPU servers and each server install 4 NVIDIA V100 GPU and a Mellonx MT27800 NIC, the hardware is sure to support GPUDirectRDMA technique. And the error message is:
gpu5:35917:35994 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
gpu5:35917:35994 [0] NCCL INFO include/net.h:34 -> 2
gpu5:35917:35994 [0] NCCL INFO transport/net.cu:537 -> 2
gpu5:35917:35994 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
gpu4:35902:35977 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
gpu4:35902:35977 [0] NCCL INFO include/net.h:34 -> 2
gpu4:35902:35977 [0] NCCL INFO transport/net.cu:537 -> 2
gpu4:35902:35977 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
gpu5:35917:35917 [0] NCCL INFO Destroyed comm 0x7f24c4001af0 rank 4
gpu4:35902:35902 [0] NCCL INFO Destroyed comm 0x7f1de8001af0 rank 0
gpu5:35917:35917 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu5:35917:35917 [1] NCCL INFO init.cu:1143 -> 1
gpu5: Test NCCL failure common.cu:343 'unhandled cuda error'
.. gpu5: Test failure common.cu:393
.. gpu5: Test failure common.cu:492
.. gpu5: Test failure all_reduce.cu:103
.. gpu5: Test failure common.cu:518
.. gpu5: Test failure common.cu:839
gpu4:35902:35902 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu4:35902:35902 [1] NCCL INFO init.cu:1143 -> 1
gpu4: Test NCCL failure common.cu:343 'unhandled cuda error'
.. gpu4: Test failure common.cu:393
.. gpu4: Test failure common.cu:492
.. gpu4: Test failure all_reduce.cu:103
.. gpu4: Test failure common.cu:518
.. gpu4: Test failure common.cu:839
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[36982,1],1]
Exit code: 3
--------------------------------------------------------------------------
I check line 788 in transport/net_ib.cu
, it seems that return value of ibv_poll_cq
is bad.
command to run all_reduce_perf:
/usr/local/openmpi/bin/mpirun -np 2 -host 10.0.21.2:1,10.0.23.4:1 \
-x NCCL_SOCKET_IFNAME=rdma0 -x NCCL_DEBUG=INFO \
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 \
-mca orte_base_help_aggregate 0 \
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 -c 0
I have search this error in google and few people meet this problem, so can you help me to solve this problem?
environment:
gpu:V100-SXM2-32GB x 8 + NVLink
cuda driver version:418.87.01
cuda version: 10.0
nccl version: 2.4.2 2.5.6
test case:
all_gather_perf -g 8 -b 4294967296 -e 4294967296 -n 100
out-of-place in-place
size count type time algbw busbw error time algbw busbw error
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
4294967296 134217728 float 29122 129.05 129.05 1e+00 29169 128.84 128.84 1e+00
Out of bounds values : 2 FAILED
Avg bus bandwidth : 128.942
#nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity
GPU0 X NV1 NV1 NV2 SYS SYS NV2 SYS 0-95
GPU1 NV1 X NV2 NV1 SYS SYS SYS NV2 0-95
GPU2 NV1 NV2 X NV2 NV1 SYS SYS SYS 0-95
GPU3 NV2 NV1 NV2 X SYS NV1 SYS SYS 0-95
GPU4 SYS SYS NV1 SYS X NV2 NV1 NV2 0-95
GPU5 SYS SYS SYS NV1 NV2 X NV2 NV1 0-95
GPU6 NV2 SYS SYS SYS NV1 NV2 X NV1 0-95
GPU7 SYS NV2 SYS SYS NV2 NV1 NV1 X 0-95
~~# make
make -C src build
make[1]: Entering directory /root/nccl-tests/src' Compiling all_reduce.cu > ../build/all_reduce.o In file included from all_reduce.cu:8:0: common.h:9:18: fatal error: nccl.h: No such file or directory #include "nccl.h" ^ compilation terminated. make[1]: *** [../build/all_reduce.o] Error 1 make[1]: Leaving directory
/root/nccl-tests/src'
make: *** [src.build] Error 2
make MPI=1 MPI_HOME=/nasa/hpcx/2.5.0/ompi-pgi PREFIX=/nobackupp16/swbuild/dkokron/cuda/nccl-tests/install CUDA_HOME=/nasa/cuda/10.2 NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70" NCCL_HOME=/nobackupp16/swbuild/dkokron/cuda/nccl/build
src buildmake -C src build
make[1]: Entering directory '/nobackupp16/swbuild/dkokron/cuda/nccl-tests/src'
Compiling all_reduce.cu > ../build/all_reduce.o
/nasa/pgi/19.10/linux86-64-llvm/19.10/include/edg/xmmintrin.h(2514): internal error: assertion failed at: "/dvs/p4/build/sw/rel/gpu_drv/r440/TC440_70/drivers/compiler/edg/EDG_5.0/src/sys_predef.c", line 574
1 catastrophic error detected in the compilation of "/tmp/pbs.8178378.pbspl1.nas.nasa.gov/tmpxft_00005bae_00000000-4_all_reduce.cpp4.ii".
Compilation aborted.
nvcc error : 'cudafe++' died due to signal 6
On CentOS 7, openmpi-devel
have libmpi.so
in /usr/lib64/openmpi/lib/libmpi.so
and mpi.h
resides in /usr/include/openmpi-x86_64/mpi.h
Hello. I'm hanged when try to test MPI in ryzen 2700x system. Please refer to the comment below.
On two different GPU clusters, nv_peer_mem NCCL2 failed to pass nccl sanity tests.
MVAPICH2-GDR + gdrcopy passwd the tests with the Same HW/SW.
This is a mirror issue of https://github.com/Mellanox/nv_peer_memory/issues/38, and related to #7.
Anyone can help?
Thanks.
MPI: OpenMPI 1.8.8/2.1.3/3.0.1
CUDA lib: CUDA 8.0/9.0/9.1
NCCL lib: NCCL 2.0.5/2.1.15
GDR lib: nv_peer_memory master
OFED: MLNX_OFED_LINUX-4.2-1
OS: Ubuntu1604/CentOS7.4
GPU: Kepler K80/Pascal P100
Server: Supermicro 4028-TR/4028-TR2
Topo interconnect: PIX
Driver Version: 390.30
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -mca btl_openib_want_cuda_gdr 1
[15:37:58](root):~ # /root/mpi/cuda-9.0/ompi3-cuda/bin/mpirun -v --allow-run-as-root \
-x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=1 \
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -mca btl_openib_want_cuda_gdr 1 \
-x LD_LIBRARY_PATH=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/lib:/usr/local/cuda-9.0/lib64 \
-mca btl_openib_if_include mlx5_3:1 \
-np 2 -host clx-mld-45,clx-mld-46 -pernode --oversubscribe \
/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/ompi3tests/all_reduce_perf -b 9 -e 4M -g 1 -c 1 -z 0
nThread 1 nGpus 1 minBytes 9 maxBytes 4194304 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
# NCCL Tests compiled with NCCL 2.1
# Using devices
# Rank 0 on clx-mld-45 device 0 [0x04] Tesla P100-PCIE-16GB
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
# Rank 1 on clx-mld-46 device 0 [0x04] Tesla P100-PCIE-16GB
8 2 float sum 0.144 0.00 0.00 0e+00 0.015 0.00 0.00 0e+00
1048584 262146 float sum 0.212 4.95 4.95 2e+00 0.209 5.02 5.02 2e+00
2097160 524290 float sum 0.379 5.53 5.53 2e+00 0.379 5.53 5.53 2e+00
3145736 786434 float sum 0.549 5.73 5.73 2e+00 0.548 5.74 5.74 2e+00
Out of bounds values : 24 FAILED
Avg bus bandwidth : 4.06216
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[2940,1],0]
Exit code: 1
--------------------------------------------------------------------------
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 -mca btl_openib_want_cuda_gdr 1
[15:50:24](root):~/mpi # /root/mpi/cuda-9.0/ompi3-cuda/bin/mpirun -v --allow-run-as-root \
-x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=1 \
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 -mca btl_openib_want_cuda_gdr 1 \
-x LD_LIBRARY_PATH=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/lib:/usr/local/cuda-9.0/lib64 \
-mca btl_openib_if_include mlx5_3:1 \
-np 2 -host clx-mld-45,clx-mld-46 -pernode --oversubscribe \
/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/ompi3tests/all_reduce_perf -b 9 -e 4M -g 1 -c 1 -z 0
nThread 1 nGpus 1 minBytes 9 maxBytes 4194304 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
# NCCL Tests compiled with NCCL 2.1
# Using devices
# Rank 0 on clx-mld-45 device 0 [0x04] Tesla P100-PCIE-16GB
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
# Rank 1 on clx-mld-46 device 0 [0x04] Tesla P100-PCIE-16GB
8 2 float sum 0.087 0.00 0.00 0e+00 0.018 0.00 0.00 0e+00
1048584 262146 float sum 0.396 2.65 2.65 0e+00 0.394 2.66 2.66 0e+00
2097160 524290 float sum 0.772 2.72 2.72 0e+00 25.292 0.08 0.08 0e+00
3145736 786434 float sum 27.539 0.11 0.11 0e+00 69.042 0.05 0.05 0e+00
Out of bounds values : 0 OK
Avg bus bandwidth : 1.03398
cd /root/mpi/cuda-8.0/ompi3.0.1 && \
rm -fr /root/mpi/cuda-8.0/ompi3.0.1/* && git checkout v3.0.1 && git reset --hard && \
./autogen.pl && \
CC=/usr/bin/gcc CXX=/usr/bin/g++ FC=/usr/bin/gfortran ./configure --with-verbs --with-cuda=/usr/local/cuda-8.0 --prefix=/root/mpi/cuda-8.0/ompi3-cuda && \
time make -j $(nproc) install
cd /root/mpi/cuda-9.1/git/nccl-tests && \
make MPI=1 NCCL_HOME=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64 CUDA_HOME=/usr/local/cuda-9.1 MPI_HOME=/root/mpi/cuda-9.1/ompi1-cuda DST_DIR=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64/ompi1tests -j $(nproc) && \
make MPI=1 NCCL_HOME=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64 CUDA_HOME=/usr/local/cuda-9.1 MPI_HOME=/root/mpi/cuda-9.1/ompi2-cuda DST_DIR=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64/ompi2tests -j $(nproc) && \
make MPI=1 NCCL_HOME=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64 CUDA_HOME=/usr/local/cuda-9.1 MPI_HOME=/root/mpi/cuda-9.1/ompi3-cuda DST_DIR=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64/ompi3tests -j $(nproc)
-genv NCCL_IB_DISABLE=0 -genv NCCL_IB_CUDA_SUPPORT=1
[16:06:47](root):~/mpi # /opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5/bin/mpirun \
-genv LD_LIBRARY_PATH=/root/mpi/cuda-9.0/nccl_2.0.5-3+cuda9.0_amd64/lib:/usr/local/cuda-9.0/lib64:/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5/lib64 \
-genv MV2_GPUDIRECT_GDRCOPY_LIB=/root/mpi/cuda-9.0/gdr/lib64/libgdrapi.so \
-genv GDRCOPY_ENABLE_LOGGING=1 -genv GDRCOPY_LOG_LEVEL=5 -genv MV2_USE_GPUDIRECT=1 \
-genv NCCL_IB_DISABLE=0 -genv NCCL_IB_CUDA_SUPPORT=1 -genv NCCL_DEBUG=0 -genv NCCL_SOCKET_IFNAME=enp5s0f0 \
-np 2 -host clx-mld-45,clx-mld-46 /root/mpi/cuda-9.0/nccl_2.0.5-3+cuda9.0_amd64/mvapich2tests/all_reduce_perf -b 9 -e 4M -g 4 -c 1 -z 0
nThread 1 nGpus 4 minBytes 9 maxBytes 4194304 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
# NCCL Tests compiled with NCCL 2.0
# Using devices
# Rank 0 on clx-mld-45 device 0 [0x04] Tesla P100-PCIE-16GB
# Rank 1 on clx-mld-45 device 1 [0x06] Tesla P100-PCIE-16GB
# Rank 2 on clx-mld-45 device 2 [0x07] Tesla P100-PCIE-16GB
# Rank 3 on clx-mld-45 device 3 [0x08] Tesla P100-PCIE-16GB
# Rank 4 on clx-mld-46 device 0 [0x04] Tesla P100-PCIE-16GB
# Rank 5 on clx-mld-46 device 1 [0x06] Tesla P100-PCIE-16GB
# Rank 6 on clx-mld-46 device 2 [0x07] Tesla P100-PCIE-16GB
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
# Rank 7 on clx-mld-46 device 3 [0x08] Tesla P100-PCIE-16GB
8 2 float sum 0.149 0.00 0.00 0e+00 0.151 0.00 0.00 0e+00
1048584 262146 float sum 0.308 3.41 5.96 1e-06 0.304 3.45 6.04 1e-06
2097160 524290 float sum 0.491 4.27 7.48 1e-06 0.486 4.32 7.56 1e-06
3145736 786434 float sum 0.678 4.64 8.12 1e-06 0.678 4.64 8.12 1e-06
Out of bounds values : 0 OK
Avg bus bandwidth : 5.40981
cd /root/mpi/cuda-8.0/git/gdrcopy && \
make PREFIX=/root/mpi/cuda-8.0/gdr CUDA=/usr/local/cuda-8.0 -j $(nproc) all install
cd /root/mpi/cuda-9.0/git/gdrcopy && \
make PREFIX=/root/mpi/cuda-9.0/gdr CUDA=/usr/local/cuda-9.0 -j $(nproc) all install
cd /root/mpi/cuda-9.0/git/nccl-tests && \
make MPI=1 NCCL_HOME=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64 CUDA_HOME=/usr/local/cuda-9.0 MPI_HOME=/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5 LIBRARY_PATH=/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5/lib64 DST_DIR=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/mvapich2tests -j $(nproc)
hi, when I use command "mpirun -v --allow-run-as-root -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_IB_CUDA_SUPPORT=1 -mca btl_tcp_if_include ib0 -np 2 -host 192.168.1.14,192.168.1.13
-pernode ./build/all_reduce_perf -b 9 -e 32M -g 8 -c 1 -z 0" , the result show "Out of bounds values : 248 FAILED", I have no idea about this error, So It‘s Very grateful that you have any message to me about this error。
The result:
WARNING: There are more than one active ports on host 'seedsmed-f02hpc02', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.
Please see this FAQ entry for more details:
http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: [[62619,1],1] (PID 3535)
[seedsmed-f02hpc01:03622] 1 more process has sent help message help-mpi-btl-openib.txt / default subnet prefix
[seedsmed-f02hpc01:03622] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[seedsmed-f02hpc01:03622] 1 more process has sent help message help-opal-runtime.txt / opal_init:warn-fork
8 2 float sum 0.445 0.00 0.00 1e-06 0.106 0.00 0.00 0e+00
1048584 262146 float sum 0.624 1.68 3.15 1e+01 0.624 1.68 3.15 1e+01
2097160 524290 float sum 1.026 2.04 3.83 1e+01 1.019 2.06 3.86 1e+01
3145736 786434 float sum 1.359 2.31 4.34 1e+01 1.358 2.32 4.34 1e+01
4194312 1048578 float sum 1.732 2.42 4.54 1e+01 1.728 2.43 4.55 1e+01
5242888 1310722 float sum 2.070 2.53 4.75 1e+01 2.078 2.52 4.73 1e+01
6291464 1572866 float sum 2.447 2.57 4.82 1e+01 2.445 2.57 4.82 1e+01
7340040 1835010 float sum 2.794 2.63 4.93 1e+01 2.798 2.62 4.92 1e+01
8388616 2097154 float sum 3.165 2.65 4.97 1e+01 3.168 2.65 4.97 1e+01
9437192 2359298 float sum 3.529 2.67 5.01 1e+01 3.526 2.68 5.02 1e+01
10485768 2621442 float sum 3.918 2.68 5.02 1e+01 3.938 2.66 4.99 1e+01
11534344 2883586 float sum 4.308 2.68 5.02 1e+01 4.308 2.68 5.02 1e+01
12582920 3145730 float sum 4.682 2.69 5.04 1e+01 4.718 2.67 5.00 1e+01
13631496 3407874 float sum 5.051 2.70 5.06 1e+01 5.065 2.69 5.05 1e+01
14680072 3670018 float sum 5.455 2.69 5.05 1e+01 5.462 2.69 5.04 1e+01
15728648 3932162 float sum 5.864 2.68 5.03 1e+01 5.841 2.69 5.05 1e+01
16777224 4194306 float sum 6.239 2.69 5.04 1e+01 6.228 2.69 5.05 1e+01
17825800 4456450 float sum 6.612 2.70 5.06 1e+01 6.631 2.69 5.04 1e+01
18874376 4718594 float sum 7.052 2.68 5.02 1e+01 7.022 2.69 5.04 1e+01
19922952 4980738 float sum 7.419 2.69 5.03 1e+01 7.420 2.69 5.03 1e+01
20971528 5242882 float sum 7.826 2.68 5.02 1e+01 7.811 2.68 5.03 1e+01
22020104 5505026 float sum 8.214 2.68 5.03 1e+01 8.182 2.69 5.05 1e+01
23068680 5767170 float sum 8.629 2.67 5.01 1e+01 8.590 2.69 5.04 1e+01
24117256 6029314 float sum 9.007 2.68 5.02 1e+01 8.979 2.69 5.04 1e+01
25165832 6291458 float sum 9.368 2.69 5.04 1e+01 9.441 2.67 5.00 1e+01
26214408 6553602 float sum 9.750 2.69 5.04 1e+01 9.749 2.69 5.04 1e+01
27262984 6815746 float sum 10.181 2.68 5.02 1e+01 10.150 2.69 5.04 1e+01
28311560 7077890 float sum 10.546 2.68 5.03 1e+01 10.535 2.69 5.04 1e+01
29360136 7340034 float sum 10.940 2.68 5.03 1e+01 10.944 2.68 5.03 1e+01
30408712 7602178 float sum 11.317 2.69 5.04 1e+01 11.296 2.69 5.05 1e+01
31457288 7864322 float sum 11.695 2.69 5.04 1e+01 11.688 2.69 5.05 1e+01
32505864 8126466 float sum 12.052 2.70 5.06 1e+01 12.092 2.69 5.04 1e+01
Out of bounds values : 248 FAILED
Avg bus bandwidth : 4.72189
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
I use nccl 1.3 and cuda 9.2 and follow the instructions from readme but I can not run the test. Is there any additional advice someone can give me?
$ make MPI=1 MPI_HOME=/usr/local/mvapich2 CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/local/cuda/3rdparty/nccl1 -j64
make -C src build
make[1]: Entering directory '/home/somebody/source/nccl-tests/src'
Compiling all_reduce.cu > ../build/all_reduce.o
Compiling common.cu > ../build/common.o
Compiling all_gather.cu > ../build/all_gather.o
Compiling broadcast.cu > ../build/broadcast.o
Compiling reduce_scatter.cu > ../build/reduce_scatter.o
Compiling reduce.cu > ../build/reduce.o
Linking ../build/all_reduce.o > ../build/all_reduce_perf
Linking ../build/all_gather.o > ../build/all_gather_perf
Linking ../build/broadcast.o > ../build/broadcast_perf
Linking ../build/reduce_scatter.o > ../build/reduce_scatter_perf
Linking ../build/reduce.o > ../build/reduce_perf
$ export LD_LIBRARY_PATH=/usr/local/mvapich2/lib:/usr/local/cuda-9.2/lib64:/usr/local/cuda-9.2/3rdparty/nccl1/lib
$ NCCL_DEBUG=WARN ./build/all_reduce_perf -b 8 -e 128M -f2 -g4
nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
NCCL version 1.3.5 compiled with CUDA 9.2
# NCCL Tests compiled with NCCL 1.0
# Using devices
# Rank 0 on dell-gpu142 device 0 [0x1a] Tesla V100-SXM2-16GB
# Rank 1 on dell-gpu142 device 1 [0x1c] Tesla V100-SXM2-16GB
# Rank 2 on dell-gpu142 device 2 [0x1d] Tesla V100-SXM2-16GB
# Rank 3 on dell-gpu142 device 3 [0x1e] Tesla V100-SXM2-16GB
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
WARN src/all_reduce.cu:212 Cuda failure 'invalid resource handle'
NCCL failure all_reduce.cu:95 'unhandled cuda error'
Is there a special way of building the test if I want to run it on Ethernet? Trying to run it on AWS 100Gbps network and getting following error
ubuntu@ip-172-31-15-234:~$ /usr/local/mpi/bin/mpirun --host 172.31.15.234,172.31.3.83 -np 2 -N 1 -x LD_LIBRARY_PATH=~/nccl/nccl-2.3.7/nccl/build/lib:$LD_LIBRARY_PATH ~/nccl/nccl-2.3.7/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
--------------------------------------------------------------------------
[[33448,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: ip-172-31-3-83
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
[ip-172-31-15-234:73547] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[ip-172-31-15-234:73547] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I compiled with MPI=1, but it hangs when i run the following:
ubuntu@ip-172-32-45-72:~/latest-drivers/nccl-tests$ /opt/amazon/openmpi/bin/mpirun -np 2 -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# hang here
I see that the process is busy all the time from top:
12405 ubuntu 20 0 540652 18376 10436 R 100.5 0.0 0:04.27 all_reduce_perf
12404 ubuntu 20 0 540652 18292 10352 R 100.0 0.0 0:04.27 all_reduce_perf
it does not hang if I run
ubuntu@ip-172-32-45-72:~/latest-drivers/nccl-tests$ /opt/amazon/openmpi/bin/mpirun -np 1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 12275 on ip-172-32-45-72 device 0 [0x00] Tesla V100-SXM2-32GB
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 4.43 0.00 0.00 0e+00 0.31 0.03 0.00 0e+00
16 4 float sum 4.34 0.00 0.00 0e+00 0.31 0.05 0.00 0e+00
32 8 float sum 4.30 0.01 0.00 0e+00 0.30 0.11 0.00 0e+00
64 16 float sum 4.36 0.01 0.00 0e+00 0.30 0.21 0.00 0e+00
128 32 float sum 4.27 0.03 0.00 0e+00 0.30 0.42 0.00 0e+00
256 64 float sum 4.26 0.06 0.00 0e+00 0.30 0.85 0.00 0e+00
512 128 float sum 4.36 0.12 0.00 0e+00 0.30 1.70 0.00 0e+00
1024 256 float sum 4.33 0.24 0.00 0e+00 0.30 3.38 0.00 0e+00
2048 512 float sum 4.32 0.47 0.00 0e+00 0.31 6.66 0.00 0e+00
4096 1024 float sum 4.27 0.96 0.00 0e+00 0.30 13.47 0.00 0e+00
8192 2048 float sum 4.36 1.88 0.00 0e+00 0.30 26.96 0.00 0e+00
16384 4096 float sum 4.30 3.81 0.00 0e+00 0.30 53.84 0.00 0e+00
32768 8192 float sum 4.27 7.67 0.00 0e+00 0.30 108.23 0.00 0e+00
65536 16384 float sum 4.36 15.04 0.00 0e+00 0.30 215.47 0.00 0e+00
131072 32768 float sum 4.35 30.13 0.00 0e+00 0.30 430.24 0.00 0e+00
262144 65536 float sum 4.37 59.96 0.00 0e+00 0.30 864.73 0.00 0e+00
524288 131072 float sum 4.34 120.92 0.00 0e+00 0.30 1728.04 0.00 0e+00
1048576 262144 float sum 6.10 171.98 0.00 0e+00 0.30 3444.73 0.00 0e+00
2097152 524288 float sum 8.78 238.96 0.00 0e+00 0.31 6866.90 0.00 0e+00
4194304 1048576 float sum 14.64 286.48 0.00 0e+00 0.30 13751.82 0.00 0e+00
8388608 2097152 float sum 25.98 322.90 0.00 0e+00 0.30 27786.05 0.00 0e+00
16777216 4194304 float sum 47.48 353.35 0.00 0e+00 0.30 55942.70 0.00 0e+00
33554432 8388608 float sum 90.58 370.45 0.00 0e+00 0.30 111107.39 0.00 0e+00
67108864 16777216 float sum 176.7 379.73 0.00 0e+00 0.30 221517.95 0.00 0e+00
134217728 33554432 float sum 349.2 384.30 0.00 0e+00 0.30 443841.69 0.00 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0
I am able to ssh to localhost. How do i fix it?
The test rely on MPI to work on multiple processes. Does the multiple processes must reply on MPI? Can the NCCL2 work with rpc while using two or more machines?
I tested nccl-tests in nvidia-pytorch docker (downloaded from nvidia-cloud). The compiling went through. However, it failed when i tested "./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8".
The error msg is
"nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
NCCL failure common.cu:874 'unhandled system error'"
It seems that the problem caused by "NCCLCHECK(ncclGetUniqueId(&ncclId));"
Thanks for help.
Apologies if this is not the correct place to post this.
Sat in on the excellent Connect with the Experts NCCL session this morning. I may not have heard correctly but I was under the impression there was a page somewhere with expected performance numbers for comparison. What I thought I heard was to run the Perf tests from this github link and compare with the numbers at ????? "to check your hardware is working at the speed it should be". I was expecting to find some table with expected numbers for typical platforms but couldn't immediately find anything. Did I misunderstand something?
When I run the following command:
make MPI=1 MPI_HOME=/usr/include/mpi NCCL_HOME=/usr/local
I get the following error:
common.h:14:17: fatal error: mpi.h: No such file or directory
# include "mpi.h"
However I am able to verify that mpi.h is in that directory
Hi, I'm trying to building NCCL benchmark and I want to run some tests on multiple nodes but the present tests don't seem to support that. How can I change it?
I'm trying to build nccl-tests + nccl 2.7 and for some reason it ends up linking against
libmpi.so.12 whereas nccl 2.6 gets linked to libmpi.so.40. My build procedure is here, and I'm running it in identical way, except for specifying different checkout command for NCCL.
Any suggestions how to debug this? Who is bringing in libmpi.so.12?
NCCL version 2.6
[ec2-user@ip-172-31-24-194 ~]$ ldd $FOLDER_ROOT/nccl-tests/build/all_reduce_perf
linux-vdso.so.1 => (0x00007ffc9a35f000)
libcudart.so.10.0 => /usr/local/cuda-10.0/lib64/libcudart.so.10.0 (0x00007f6ba3f49000)
librt.so.1 => /lib64/librt.so.1 (0x00007f6ba3d41000)
libmpi.so.40 => /opt/amazon/efa/lib64/libmpi.so.40 (0x00007f6ba3a46000)
libcurand.so.10.0 => /usr/local/cuda-10.0/lib64/libcurand.so.10.0 (0x00007f6b9f8df000)
libnccl.so.2 => /home/ec2-user/nccl/nccl-2.4.6/nccl/build/lib/libnccl.so.2 (0x00007f6b9ae6a000)
libnvToolsExt.so.1 => /usr/local/cuda-10.0/lib64/libnvToolsExt.so.1 (0x00007f6b9ac61000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f6b9aa45000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f6b9a841000)
libstdc++.so.6 => /home/ec2-user/anaconda3/lib/libstdc++.so.6 (0x00007f6b9a507000)
libm.so.6 => /lib64/libm.so.6 (0x00007f6b9a205000)
libgcc_s.so.1 => /home/ec2-user/anaconda3/lib/libgcc_s.so.1 (0x00007f6ba43bf000)
libc.so.6 => /lib64/libc.so.6 (0x00007f6b99e38000)
libopen-rte.so.40 => /opt/amazon/efa/lib64/libopen-rte.so.40 (0x00007f6b99b85000)
libopen-pal.so.40 => /opt/amazon/efa/lib64/libopen-pal.so.40 (0x00007f6b9987f000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f6b9967c000)
libz.so.1 => /home/ec2-user/anaconda3/lib/libz.so.1 (0x00007f6b99465000)
/lib64/ld-linux-x86-64.so.2 (0x00007f6ba41c3000)
nccl version 2.7
[ec2-user@ip-172-31-24-194 ~]$ ldd $FOLDER_ROOT/nccl-tests/build/all_reduce_perf
linux-vdso.so.1 => (0x00007ffcf057d000)
libcudart.so.10.0 => /usr/local/cuda-10.0/lib64/libcudart.so.10.0 (0x00007f119bc4b000)
librt.so.1 => /lib64/librt.so.1 (0x00007f119ba43000)
libmpi.so.12 => not found
libcurand.so.10.0 => /usr/local/cuda-10.0/lib64/libcurand.so.10.0 (0x00007f11978dc000)
libnccl.so.2 => /home/ec2-user/nccl/nccl-2.4.6/nccl/build/lib/libnccl.so.2 (0x00007f1192e67000)
libnvToolsExt.so.1 => /usr/local/cuda-10.0/lib64/libnvToolsExt.so.1 (0x00007f1192c5e000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f1192a42000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f119283e000)
libstdc++.so.6 => /home/ec2-user/anaconda3/lib/libstdc++.so.6 (0x00007f1192504000)
libm.so.6 => /lib64/libm.so.6 (0x00007f1192202000)
libgcc_s.so.1 => /home/ec2-user/anaconda3/lib/libgcc_s.so.1 (0x00007f119c0c1000)
libc.so.6 => /lib64/libc.so.6 (0x00007f1191e35000)
/lib64/ld-linux-x86-64.so.2 (0x00007f119bec5000)
[ec2-user@ip-172-31-24-194 ~]$ export FOLDER_ROOT=/home/ec2-user/nccl/nccl-2.4.6
After build, /build
will show up in git:
$ git status
HEAD detached at c864b73
Untracked files:
(use "git add <file>..." to include in what will be committed)
build/
nothing added to commit but untracked files present (use "git add" to track)
This is particularly annoying when using nccl-tests as a submodule, as the submodule will be marked as "modified" in parent repo:
modified: nccl-tests (untracked content)
I have installed cuda 8.0.61, cudnn 5.1.10 and nccl 2.1.15 on Ubuntu 14.04. I have successfully verified cuda and cudnn using official examples.
However, I run into errors using nccl-tests
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
NCCL failure common.cu:908 'unhandled cuda error'.
I have tried to install nccl2 locally and using network repo. But both ways failed.
Anyone can help?
[barajasc@gpunode102 nccl-tests]$ make
make -C src build
make[1]: Entering directory `/users/barajasc/tests/nccl/nccl-tests/src'
Compiling all_reduce.cu > ../build/all_reduce.o
Compiling common.cu > ../build/common.o
Linking ../build/all_reduce.o > ../build/all_reduce_perf
ld: ../build/common.o: in function `threadRunTests(threadArgs*)':
/users/barajasc/tests/nccl/nccl-tests/src/common.cu:520: undefined reference to `ncclTestEngine'
ld: /users/barajasc/tests/nccl/nccl-tests/src/common.cu:520: undefined reference to `ncclTestEngine'
ld: ../build/common.o: in function `run()':
/users/barajasc/tests/nccl/nccl-tests/src/common.cu:763: undefined reference to `ncclTestEngine'
make[1]: *** [../build/all_reduce_perf] Error 1
make[1]: Leaving directory `/users/barajasc/tests/nccl/nccl-tests/src'
make: *** [src.build] Error 2
I have both CUDA_HOME
and NCCL_HOME
set. I'm currently using CUDA 10.1.243 and NCCL 2.7.6 . I've tried both icc
2019a and gcc
8.2.0 and it just will not compile. I cannot seem to find any other information related to this. I've tried both GNU make 3.0 and 3.4 just in case but nothing worked.
I'm hoping it is something simple but I cannot imagine what. I should also say that I have compiled NCCL from source but had no errors during compile time.
I run the test after install nccl from rpm
sudo yum install libnccl-2.4.8-1+cuda10.0 libnccl-devel-2.4.8-1+cuda10.0 libnccl-static-2.4.8-
1+cuda10.0
The error does not show any mean I understand.
nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
Using devices
458398: Test CUDA failure common.cu:730 'no CUDA-capable device is detected'
The machine I run contain 8 Tesla T4
Could you help me solve this?
I'm seeing the same problem as otherson two Amazon EC2 p3.2xlarge instances with V100 GPU. Here is the command I have:
mpirun -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME -np 2 -mca btl ^openib --hostfile hosts all_reduce_perf -b 1K -e 100M -c 0
where hosts
are simply:
ip1 max_slots=1
ip2 max_slots=1
and NCCL_SOCKET_IFNAME=ens3
The symptom is what I see:
nThread 1 nGpus 1 minBytes 1024 maxBytes 104857600 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 0
Nothing else gets printed. Any ideas?
Docker base image: nvidia/cuda:8.0-cudnn7-devel-ubuntu16.04
Cuda: 8.0
Driver Version: 375.39
NCCL: 2.1.15-1+cuda8.0
Has anyone encountered this failure in the output log: misc/nvmlwrap.cu:170 WARN nvmlDeviceSetCpuAffinity() failed: Unknown Error
Here is the log:
root@node4:/opt/nccl-tests# mpirun --allow-run-as-root -np 2 \
> -host host_6176,host_6177\
> -byslot -v -x NCCL_SHM_DISABLE=1 \
> -mca btl_openib_want_cuda_gdr 1 \
> -mca btl_tcp_if_include ib0 -x NCCL_DEBUG=INFO \
> build/all_reduce_perf -b 16M -e 32M -g 4 -c 0 -z 0 -d float
--------------------------------------------------------------------------
The following command line options and corresponding MCA parameter have
been deprecated and replaced as follows:
Command line options:
Deprecated: --byslot, -byslot
Replacement: --map-by slot
Equivalent MCA parameter:
Deprecated: rmaps_base_byslot
Replacement: rmaps_base_mapping_policy=slot
The deprecated forms *will* disappear in a future version of Open MPI.
Please update to the new syntax.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'node4', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.
Please see this FAQ entry for more details:
http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
nThread 1 nGpus 4 minBytes 16777216 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 0
[node4:01848] 1 more process has sent help message help-mpi-btl-openib.txt / default subnet prefix
[node4:01848] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
node4:1853:1853 [0] INFO NET : Using interface ib0:192.168.56.134<0>
node4:1853:1853 [0] INFO NET/IB : Using interface ib0 for sideband communication
node4:1853:1853 [0] INFO NET/IB: [0] mlx4_1:1/IB
node4:1853:1853 [0] INFO NET/IB: [1] mlx4_0:1/IB
node4:1853:1853 [0] INFO Using internal Network IB
node4:1854:1854 [4] INFO Using NCCL Low-latency algorithm for sizes below 16384
node4:1853:1880 [0] misc/nvmlwrap.cu:170 WARN nvmlDeviceSetCpuAffinity() failed: Unknown Error
node4:1853:1880 [0] init.cu:524 WARN Failed to set CPU affinity
node4:1853:1883 [1] misc/nvmlwrap.cu:170 WARN nvmlDeviceSetCpuAffinity() failed: Unknown Error
node4:1853:1883 [1] init.cu:524 WARN Failed to set CPU affinity
node4:1853:1880 [0] INFO NET/IB: Dev 0 Port 1 qpn 3970 mtu 5 LID 26
node4:1853:1883 [1] INFO NET/IB: Dev 0 Port 1 qpn 3971 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 3974 mtu 5 LID 26
node4:1853:1880 [0] INFO CUDA Dev 0, IB Ports : mlx4_1/1(SOC) mlx4_0/1(PHB)
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 3979 mtu 5 LID 26
node4:1853:1883 [1] INFO CUDA Dev 1, IB Ports : mlx4_1/1(SOC) mlx4_0/1(PHB)
node4:1853:1885 [2] misc/nvmlwrap.cu:170 WARN nvmlDeviceSetCpuAffinity() failed: Unknown Error
node4:1853:1885 [2] init.cu:524 WARN Failed to set CPU affinity
node4:1853:1885 [2] INFO NET/IB: Dev 0 Port 1 qpn 3982 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 3985 mtu 5 LID 26
node4:1853:1885 [2] INFO CUDA Dev 2, IB Ports : mlx4_1/1(SOC) mlx4_0/1(PHB)
node4:1854:1884 [4] INFO NET/IB: Dev 0 Port 1 qpn 3988 mtu 5 LID 26
node4:1854:1886 [5] INFO NET/IB: Dev 0 Port 1 qpn 3989 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 3992 mtu 5 LID 26
node4:1854:1884 [4] INFO CUDA Dev 4, IB Ports : mlx4_1/1(PHB) mlx4_0/1(SOC)
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 3997 mtu 5 LID 26
node4:1854:1886 [5] INFO CUDA Dev 5, IB Ports : mlx4_1/1(PHB) mlx4_0/1(SOC)
node4:1853:1887 [3] misc/nvmlwrap.cu:170 WARN nvmlDeviceSetCpuAffinity() failed: Unknown Error
node4:1853:1887 [3] init.cu:524 WARN Failed to set CPU affinity
node4:1853:1887 [3] INFO NET/IB: Dev 0 Port 1 qpn 4000 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 4003 mtu 5 LID 26
node4:1853:1887 [3] INFO CUDA Dev 3, IB Ports : mlx4_1/1(SOC) mlx4_0/1(PHB)
node4:1854:1888 [6] INFO NET/IB: Dev 0 Port 1 qpn 4006 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 4009 mtu 5 LID 26
node4:1854:1888 [6] INFO CUDA Dev 6, IB Ports : mlx4_1/1(PHB) mlx4_0/1(SOC)
node4:1854:1889 [7] INFO NET/IB: Dev 0 Port 1 qpn 4012 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 4015 mtu 5 LID 26
node4:1854:1889 [7] INFO CUDA Dev 7, IB Ports : mlx4_1/1(PHB) mlx4_0/1(SOC)
node4:1853:1880 [0] INFO Using 256 threads
node4:1853:1880 [0] INFO Min Comp Cap 5
node4:1853:1880 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
node4:1853:1880 [0] INFO Ring 00 : 0 1 2 3 4 5 6 7
node4:1853:1880 [0] INFO Ring 01 : 0 1 2 3 4 5 6 7
node4:1853:1883 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/direct pointer
node4:1854:1888 [6] INFO Ring 00 : 6[6] -> 7[7] via P2P/direct pointer
node4:1853:1880 [0] INFO 7 -> 0 via NET/IB/1
node4:1853:1885 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
node4:1854:1884 [4] INFO 3 -> 4 via NET/IB/0
node4:1853:1880 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
node4:1854:1886 [5] INFO Ring 00 : 5[5] -> 6[6] via P2P/direct pointer
node4:1854:1884 [4] INFO Ring 00 : 4[4] -> 5[5] via P2P/direct pointer
node4:1854:1889 [7] INFO NET/IB: Dev 0 Port 1 qpn 4018 mtu 5 LID 26
node4:1853:1887 [3] INFO NET/IB: Dev 1 Port 1 qpn 1418 mtu 5 LID 31
node4:1854:1886 [5] INFO Ring 01 : 5[5] -> 6[6] via P2P/direct pointer
node4:1853:1883 [1] INFO Ring 01 : 1[1] -> 2[2] via P2P/direct pointer
node4:1853:1885 [2] INFO Ring 01 : 2[2] -> 3[3] via P2P/direct pointer
node4:1854:1888 [6] INFO Ring 01 : 6[6] -> 7[7] via P2P/direct pointer
node4:1853:1880 [0] INFO 7 -> 0 via NET/IB/1
node4:1854:1884 [4] INFO 3 -> 4 via NET/IB/0
node4:1853:1880 [0] INFO Ring 01 : 0[0] -> 1[1] via P2P/direct pointer
node4:1854:1884 [4] INFO Ring 01 : 4[4] -> 5[5] via P2P/direct pointer
node4:1853:1887 [3] INFO NET/IB: Dev 1 Port 1 qpn 1421 mtu 5 LID 31
node4:1854:1889 [7] INFO NET/IB: Dev 0 Port 1 qpn 4021 mtu 5 LID 26
# NCCL Tests compiled with NCCL 2.1
# Using devices
# Rank 0 on node4 device 0 [0x04] GeForce GTX TITAN X
# Rank 1 on node4 device 1 [0x05] GeForce GTX TITAN X
# Rank 2 on node4 device 2 [0x08] GeForce GTX TITAN X
# Rank 3 on node4 device 3 [0x09] GeForce GTX TITAN X
# Rank 4 on node4 device 4 [0x85] GeForce GTX TITAN X
# Rank 5 on node4 device 5 [0x86] GeForce GTX TITAN X
# Rank 6 on node4 device 6 [0x89] GeForce GTX TITAN X
# Rank 7 on node4 device 7 [0x8a] GeForce GTX TITAN X
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
node4:1853:1853 [0] INFO Launch mode Group
16777216 4194304 float sum 23.582 0.71 1.25 N/A 22.709 0.74 1.29 N/A
17825792 4456448 float sum 23.434 0.76 1.33 N/A 25.079 0.71 1.24 N/A
18874368 4718592 float sum 23.700 0.80 1.39 N/A 23.739 0.80 1.39 N/A
19922944 4980736 float sum 26.029 0.77 1.34 N/A 27.943 0.71 1.25 N/A
20971520 5242880 float sum 26.255 0.80 1.40 N/A 26.225 0.80 1.40 N/A
22020096 5505024 float sum 27.565 0.80 1.40 N/A 27.557 0.80 1.40 N/A
23068672 5767168 float sum 31.003 0.74 1.30 N/A 32.167 0.72 1.26 N/A
24117248 6029312 float sum 33.432 0.72 1.26 N/A 33.461 0.72 1.26 N/A
25165824 6291456 float sum 35.148 0.72 1.25 N/A 34.220 0.74 1.29 N/A
26214400 6553600 float sum 36.412 0.72 1.26 N/A 35.262 0.74 1.30 N/A
27262976 6815744 float sum 37.865 0.72 1.26 N/A 38.155 0.71 1.25 N/A
28311552 7077888 float sum 39.746 0.71 1.25 N/A 39.151 0.72 1.27 N/A
29360128 7340032 float sum 40.784 0.72 1.26 N/A 40.785 0.72 1.26 N/A
30408704 7602176 float sum 39.909 0.76 1.33 N/A 42.975 0.71 1.24 N/A
31457280 7864320 float sum 44.535 0.71 1.24 N/A 42.986 0.73 1.28 N/A
32505856 8126464 float sum 45.921 0.71 1.24 N/A 45.149 0.72 1.26 N/A
33554432 8388608 float sum 47.618 0.70 1.23 N/A 47.459 0.71 1.24 N/A
Out of bounds values : 0 OK
Avg bus bandwidth : 1.29001
Hi, I encountered this error while compiling:
/usr/bin/ld:cannot find -lmpi
I have installed openmpi 4.0.2 to /openmpi4.0, and specified the MPI_HOME to be here.
Thank you!
GPU driver: 440.82
CUDA: 11.0,
NCCL: 2704
One node with 4 Tesla V100-SXM2 GPUs.
I've tried running the tests with one node and 4 GPUs with no problems.
However, my test with MPI on 40 processes failed. Here's the command I ran:
mpirun -np 40 ./build/all_gather_perf -b 8 -e 128M -f 2 -g 4
The error I got was:
nThread 1 nGpus 1 minBytes 0 maxBytes 134217728 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[20089,1],20]
Exit code: 2
However, when I tried to run the same command with number of processes <= 4, I got no problems. I'm wondering if this is the expected behavior at all. I was assuming the 4 GPUs with 1 node could support more than 4 processes, especially since the readme only mentioned that "Run with MPI could pontentially support multple nodes with 4 GPUs each". Would really appreciate some help if anyone could help me figure this out. Thanks!
I am running nccl-test on my bare metal machines. I have set up RoCE and GPU Direct RDMA. As far as I know, the allReduce implementation in NCCL is a ring-based one, so theoretically the all_reduce_perf result should be in line with reduce_scatter_perf and all_gather_perf. However, in my case, when I set NCCL_IB_DISABLE=0 and NCCL_IB_CUDA_SUPPORT=1, the all_reduce_perf result is way worse than any of reduce_scatter_perf and all_gather_perf. How is that supposed to happen if an allReduce operation is made up of a reduceScatter and an allGather?
NCCL_IB_DISABLE=0 and NCCL_IB_CUDA_SUPPORT=1, GPU0 on node 1<-> GPU0 on node 2
all_reduce_perf result
Interestingly, when I turn off GPU Direct RDMA by setting NCCL_IB_DISABLE=0 and NCCL_IB_CUDA_SUPPORT=0, the all_reduce_perf result is even better than both reduce_scatter_perf and all_gather.
NCCL_IB_DISABLE=0 and NCCL_IB_CUDA_SUPPORT=0, GPU0 on node 1<-> GPU0 on node 2
all_reduce_perf result
On top of the initial question, isn't GPU Direct RDMA supposed to speed up the communication? Is it because the GPU PCIe topology in my environment doesn't support GPU Direct RDMA at all?
I'm unable to get the nccl-test to run successfully. It fails with an internal error.
Singularity> ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 108039 on ip-0AB85F07 device 0 [0x00] Tesla V100-PCIE-16GB
# Rank 1 Pid 108039 on ip-0AB85F07 device 1 [0x00] Tesla V100-PCIE-16GB
# Rank 2 Pid 108039 on ip-0AB85F07 device 2 [0x00] Tesla V100-PCIE-16GB
# Rank 3 Pid 108039 on ip-0AB85F07 device 3 [0x00] Tesla V100-PCIE-16GB
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
ip-0AB85F07: Test NCCL failure common.cu:775 'internal error'
Trying to collect nvprof for nccl-tests errors out with 'an illegal memory access was encountered' for more than 1 GPU scenarios
nvprof -o All_reduce.4GPU.nvvp ./all_reduce_perf -g 4
nThread 1 nGpus 4 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
==102728== NVPROF is profiling process 102728, command: ./all_reduce_perf -g 4
# NCCL Tests compiled with NCCL 2.1
# Using devices
# Rank 0 on paiws7 device 0 [0x04] Tesla V100-SXM2-16GB
# Rank 1 on paiws7 device 1 [0x05] Tesla V100-SXM2-16GB
# Rank 2 on paiws7 device 2 [0x03] Tesla V100-SXM2-16GB
# Rank 3 on paiws7 device 3 [0x04] Tesla V100-SXM2-16GB
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
Cuda failure common.cu:510 'an illegal memory access was encountered'
==102728== Error: Internal profiling error 4054:1.
======== Error: CUDA profiling error.
However it works fine for 1 GPU
nvprof -f -o All_reduce.1GPU.nvvp ./all_reduce_perf -g 1
nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
==102808== NVPROF is profiling process 102808, command: ./all_reduce_perf -g 1
# NCCL Tests compiled with NCCL 2.1
# Using devices
# Rank 0 on paiws7 device 0 [0x04] Tesla V100-SXM2-16GB
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
33554432 8388608 float sum 0.090 372.75 0.00 0e+00 0.003 12402.53 0.00 0e+00
Out of bounds values : 0 OK
Avg bus bandwidth : 0
==102808== Generated result file: /root/nccl-tests/build/All_reduce.1GPU.nvvp
Hi everyone!
I would like to know what exactly datacheck represents in nccl-tests? In the README it is written that the -c option is for checking the correctness of results. This is a bit unclear to me since the tests are essentially for checking the correctness (so what is the point of datacheck?). And, I am wondering what happens if one sets -c into 0?
Hi. I try to repeat the performance of allreduce in the official document, which has the bandwidth of about 40 GB/s. I try to run the test on the similar environment. Here is the environment setting:
The NCCL version is 2.4.7 and the CUDA version is 10.0
And the command I used to run the test
mpirun -np 32 -H server1:8,server2:8,server3:8,server4:8 \
-x NCCL_DEBUG=INFO \
./build/all_reduce_perf -b 8 -e 512M -f 2
But results is far from expectation. The busbw is about 6 GB/s. I think the only different of my setting and DGX-1 is I only have 2 nvlinks within a servers, which I think is not the bottleneck.
Here are what I am confused:
Thank you for your attention.
According to https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md#allgather:
algbw = S/t
busbw = algbw * (n-1)/n
According to the code: https://github.com/NVIDIA/nccl-tests/blob/master/src/all_gather.cu#L50
algbw = S/t * (n-1)/n
busbw = algbw
Am I reading the code correctly? Seems like either the code or the document is incorrect.
I think the same issue may exist for reduce_scatter.
Hello everyone!
After running nccl tests linked with my nccl fork, I got a result as you can see in this photo:
Although the error column in the attached photo has meaningful numbers rather than 0, the final result which is Out of Bounds Values is 0 and it seems all tests are passed. I would like to know why that happened?
Hi,
When running all_reduce_perf
, NCCL sets different CPU masks when running the test through Slurm scheduler as opposed to running directly with OpenMPI. Both tests runs with the same NCCL_TOPO_FILE
variable.
# Running with slurm:
[0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-11.0/efa/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
...
[0] NCCL INFO Setting affinity for GPU 0 to ff
# Running with OpenMPI
[0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-11.0/efa/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
...
[0] NCCL INFO Setting affinity for GPU 0 to ffffff
Why would this happen? Are we missing any Slurm configuration or NCCL environment variable we need to set?
Thank you!
I run all_reduce_perf with openmpi and GPUDirectRDMA on Azure Standard_NC24rs_v3
My command is
/usr/bin/mpirun --hostfile HOSTFILE --mca btl tcp,self --mca btl_tcp_if_exclude docker0,lo --bind-to none -N 1 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=all -x LD_LIBRARY_PATH=$HOME/nccl/build/lib:/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH nccl-tests/build/all_reduce_perf --minbytes 8 --maxbytes 256M --stepfactor 2 --ngpus 1 --check 0 --nthreads 1
And the error message is:
/usr/bin/mpirun --hostfile HOSTFILE --mca btl tcp,self --mca btl_tcp_if_exclude docker0,lo --bind-to none -N 1 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=all -x LD_LIBRARY_PATH=$HOME/nccl/build/lib:/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH nccl-tests/build/all_reduce_perf --minbytes 8 --maxbytes 256M --stepfactor 2 --ngpus 1 --check 0 --nthreads 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
#
# Using devices
# Rank 0 Pid 82431 on pkb-b52c0368-0 device 0 [0x00] Tesla V100-PCIE-16GB
# Rank 1 Pid 82676 on pkb-b52c0368-1 device 0 [0x00] Tesla V100-PCIE-16GB
pkb-b52c0368-0:82431:82431 [0] NCCL INFO Bootstrap : Using [0]eth0:10.0.0.4<0> [1]eth1:fe80::215:5dff:fe33:ff27%eth1<0>
pkb-b52c0368-0:82431:82431 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
pkb-b52c0368-0:82431:82431 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB eth0:10.0.0.4<0>
NCCL version 2.5.6+cuda10.0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO Bootstrap : Using [0]eth0:10.0.0.5<0> [1]eth1:fe80::215:5dff:fe33:ff70%eth1<0>
pkb-b52c0368-1:82676:82676 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
pkb-b52c0368-1:82676:82676 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB eth0:10.0.0.5<0>
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 545505 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 582757 mtu 5 LID 57
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Intel CPU (PCI 12, InterCpu 8)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Intel CPU (PCI 12, InterCpu 8)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/df5e6558-97d1-4698-9de8-a5f603d2bef7/pci0002:00/0002:00:02.0 -> 0/0/0/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/63787d95-43b1-4da0-adf3-de0aba056eb9/pci0002:00/0002:00:02.0 -> 0/0/0/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO === System : maxWidth 12 maxSpeed 12 ===
pkb-b52c0368-1:82676:82696 [0] NCCL INFO CPU/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - PCI/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - PCI/DE
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - PCI/47505500
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - GPU/5B6C00000 (1)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - PCI/63787D95
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - NIC/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO === System : maxWidth 12 maxSpeed 12 ===
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + NET[12] - NET/0 (0)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO ==========================================
pkb-b52c0368-1:82676:82696 [0] NCCL INFO GPU/5B6C00000 :GPU/5B6C00000 (0/5000/0) CPU/0 (4/12/2) NET/0 (5/12/2)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/0 :GPU/5B6C00000 (5/12/2) CPU/0 (5/12/2) NET/0 (0/5000/0)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 24/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-1:82676:82696 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO CPU/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - PCI/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - PCI/DE
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - PCI/47505500
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - GPU/92FB00000 (0)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 12/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-1:82676:82696 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - PCI/DF5E6558
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - NIC/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + NET[12] - NET/0 (0)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO ==========================================
pkb-b52c0368-0:82431:82448 [0] NCCL INFO GPU/92FB00000 :GPU/92FB00000 (0/5000/0) CPU/0 (4/12/2) NET/0 (5/12/2)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/0 :GPU/92FB00000 (5/12/2) CPU/0 (5/12/2) NET/0 (0/5000/0)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 24/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 12/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Channel 00/02 : 0 1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Channel 01/02 : 0 1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Threads per block : 512/640/512
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Latency/AlgBw | Tree/ LL | Tree/ LL128 | Tree/Simple | Ring/ LL | Ring/ LL128 | Ring/Simple |
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Broadcast | 0.0/ 0.0| 0.0/ 0.0| 0.0/ 0.0| 4.5/ 3.0| 6.1/ 11.2| 15.0/ 12.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Reduce | 0.0/ 0.0| 0.0/ 0.0| 0.0/ 0.0| 4.5/ 3.0| 6.1/ 11.2| 15.0/ 12.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO AllGather | 0.0/ 0.0| 0.0/ 0.0| 0.0/ 0.0| 4.5/ 6.0| 6.1/ 22.5| 15.0/ 24.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO ReduceScatter | 0.0/ 0.0| 0.0/ 0.0| 0.0/ 0.0| 4.5/ 6.0| 6.1/ 22.5| 15.0/ 24.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO AllReduce | 14.4/ 3.6| 19.4/ 8.4| 100.0/ 10.8| 5.4/ 3.0| 8.6/ 11.2| 21.6/ 12.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->1|1->0->-1/-1/-1
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Threads per block : 512/640/512
pkb-b52c0368-1:82676:82696 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] 0/-1/-1->1->-1|-1->1->0/-1/-1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 742314 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 159581 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 92fb00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 5b6c00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 00 : 1[5b6c00000] -> 0[92fb00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 00 : 0[92fb00000] -> 1[5b6c00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 00 : 0[92fb00000] -> 1[5b6c00000] [send] via NET/IB/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 00 : 1[5b6c00000] -> 0[92fb00000] [send] via NET/IB/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 126471 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 424932 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 470331 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 490679 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 92fb00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 5b6c00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 01 : 1[5b6c00000] -> 0[92fb00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 01 : 0[92fb00000] -> 1[5b6c00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 01 : 0[92fb00000] -> 1[5b6c00000] [send] via NET/IB/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 01 : 1[5b6c00000] -> 0[92fb00000] [send] via NET/IB/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 179520 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 542420 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO comm 0x7f7724001aa0 rank 0 nranks 2 cudaDev 0 busId 2fb00000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO Launch mode Parallel
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-1:82676:82696 [0] NCCL INFO comm 0x7fb6cc001aa0 rank 1 nranks 2 cudaDev 0 busId b6c00000 - Init COMPLETE
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-0:82431:82449 [0] transport/net_ib.cc:774 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 32631, vendor err 129
pkb-b52c0368-0:82431:82449 [0] NCCL INFO include/net.h:28 -> 2
pkb-b52c0368-0:82431:82449 [0] NCCL INFO transport/net.cc:377 -> 2
pkb-b52c0368-0:82431:82449 [0] NCCL INFO transport.cc:166 -> 2 [Proxy Thread]
pkb-b52c0368-1:82676:82697 [0] transport/net_ib.cc:774 NCCL WARN NET/IB : Got completion with error 12, opcode 671036413, len 0, vendor err 129
pkb-b52c0368-1:82676:82697 [0] NCCL INFO include/net.h:28 -> 2
pkb-b52c0368-1:82676:82697 [0] NCCL INFO transport/net.cc:377 -> 2
pkb-b52c0368-1:82676:82697 [0] NCCL INFO transport.cc:166 -> 2 [Proxy Thread]
pkb-b52c0368-1: Test NCCL failure common.cu:345 'unhandled system error'
.. pkb-b52c0368-1: Test failure common.cu:393
.. pkb-b52c0368-1: Test failure common.cu:492
.. pkb-b52c0368-1: Test failure all_reduce.cu:103
.. pkb-b52c0368-1: Test failure common.cu:518
.. pkb-b52c0368-1: Test failure common.cu:839
pkb-b52c0368-0: Test NCCL failure common.cu:345 'unhandled system error'
.. pkb-b52c0368-0: Test failure common.cu:393
.. pkb-b52c0368-0: Test failure common.cu:492
.. pkb-b52c0368-0: Test failure all_reduce.cu:103
.. pkb-b52c0368-0: Test failure common.cu:518
.. pkb-b52c0368-0: Test failure common.cu:839
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[44150,1],1]
Exit code: 3
--------------------------------------------------------------------------
Could you help me solve this problem?
user in nccl-tests at ubuntu on master
➜ make
make -C src build
make[1]: Entering directory '/home/user/nccl-tests/src'
Compiling all_reduce.cu > ../build/all_reduce.o
Compiling common.cu > ../build/common.o
common.cu:335:38: error: missing binary operator before token "("
#if NCCL_VERSION_CODE >= NCCL_VERSION(2,4,0)
^
Makefile:72: recipe for target '../build/common.o' failed
make[1]: *** [../build/common.o] Error 1
make[1]: Leaving directory '/home/user/nccl-tests/src'
Makefile:17: recipe for target 'src.build' failed
make: *** [src.build] Error 2
This error occurs when I build
Env :
user in ~ at ubuntu
➜ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
user in ~ at ubuntu
➜ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
when I compile and run nccl-test,use
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
It can run successfully,But an error occurred when running the program with cuda-gdb.
Rank 0 Pid 12414 on 51e6e7896958 device 0 [0x1a] Tesla V100-SXM3-32GB
Rank 1 Pid 12414 on 51e6e7896958 device 1 [0x1b] Tesla V100-SXM3-32GB
Rank 2 Pid 12414 on 51e6e7896958 device 2 [0x1c] Tesla V100-SXM3-32GB
Rank 3 Pid 12414 on 51e6e7896958 device 3 [0x1e] Tesla V100-SXM3-32GB
[New Thread 0x7fffe6d73700 (LWP 12424)]
[New Thread 0x7fffe6572700 (LWP 12425)]
[New Thread 0x7fffe5a55700 (LWP 12426)]
[New Thread 0x7fffe523c700 (LWP 12427)]
[New Thread 0x7fffe4a23700 (LWP 12428)]
warning: Cuda API error detected: cudaDeviceEnablePeerAccess returned (0x2c0)
warning: Cuda API error detected: cudaGetLastError returned (0x2c0)
warning: Cuda API error detected: cudaDeviceEnablePeerAccess returned (0x2c0)
l.Does anyone have a similar problem.Or Is there any solution?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.