nvidia / nccl-tests Goto Github PK

NCCL Tests

License: BSD 3-Clause "New" or "Revised" License

Makefile 3.20% Cuda 86.91% C 1.43% C++ 8.46%

nccl-tests's Issues

Is NCCL suitable for calculating "sum = a1 + a2 + .. +an;"?

Greetings from me! I am a newbie of GPU programming and not sure whether this is the correct place to submit this issue here, if I am wrong, please forgive me, thanks!

After referring Fast Multi-GPU collectives with NCCL and using nccl-tests, I find NCCL seems most powerful in following case:

A[0] = A[0] + B[0] +C[0] + D[0];
......
A[n] = A[n] + B[n] +C[n] + D[n];

where the array count is less than GPU count, and n is a large number. But not applicable for following case:

sum = a1 + a2 + .. +an;

Is my understanding correct? If I want to utilize multiple GPU s to accelerate calculating sum = a1 + a2 + .. +an;, is there any canonical method?

Thanks very much in advance!

Best Regards
Nan Xiao

Feature request: write results to file

I'm trying to run nccl-tests and then parse the output, but it's kind of a pain if debug output is enabled because it mixes with the actual output. Would it be possible to have an option to write to a CSV file instead of just stdout?

MultiNode run does not seem to send any data across nodes

This may be incredibly dumb, but when I run mpirun -np 2 --hostfile hosts -mca btl ^openib -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=ens3 all_reduce_perf -b 8 -e 128M -f 2 -g 1 -c 1 -i 20, with the host file:

172.31.12.213 max_slots=1
172.31.6.103 max_slots=1

, I can see the program finish normally,

134217728 33554432 float sum 2.237 60.00 0.00 0e+00 0.001 139715.53 0.00 0e+00
Out of bounds values : 0 OK
Avg bus bandwidth : 0

However, when I monitor 172.31.12.213's NIC bandwidth, it' all 0!
Please advice!

Thanks.

Cuda failure common.cu:891

The following results are generated when I run quick test:

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
Cuda failure common.cu:891 'CUDA driver version is insufficient for CUDA runtime version'

**Running with NCCL_DEBUG=WARN produces the following results:**
nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
misc/ibvwrap.cu:60 WARN Failed to open libibverbs.so[.1]
Cuda failure common.cu:891 'CUDA driver version is insufficient for CUDA runtime version'

**Results from Nvidia-smi:**
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:06:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |     10MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:07:00.0 Off |                    0 |
| N/A   37C    P0    41W / 300W |     10MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0    41W / 300W |     10MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   32C    P0    43W / 300W |     10MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:85:00.0 Off |                    0 |
| N/A   35C    P0    43W / 300W |     10MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   35C    P0    41W / 300W |     10MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   37C    P0    43W / 300W |     10MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   34C    P0    43W / 300W |     10MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I am using docker container. What should be the fix for this problem?

How busbw numbers relate to network utilization?

I'm seeing 125 Gbps reported by all_reduce_perf (log) which is surprising because AWS EFA network is advertised to be only 100 Gbps.

Does this mean 100 Gbps advertised limit is being exceeded? If not, how do I figure out the busbw achievable by all_reduce_perf on a 100 Gbps net?

System hangs running nccl-tests, with 2 2080ti and NVlink bridge.

Linux: Ubuntu 20.04 LTS
GPU driver: newest NVidia driver for linux.
CUDA 10.1, CUDNN ,7.6.5
NCCL 2.6.4
Hardware :
CPU: Intel 9400f, MB: Z370,Ram : 64GB 2-channel, GPU: 2 2080ti on 2 PCIE 3.0 *8, with a NVlink bridge between them

I ran all nccl_tests, it seems NCCL is working. But when each test running(about 30 min for each test), the system freezes, I can't switch to browser or doing anything, I can only move the mouse, but the system doesn't respond to mouse-clicking or keyboard input.
when the test finishes running, system goes back to normal ,and LOG prints in console.

log is here：

#  ./all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   3795 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   3795 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float     sum     7.18    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
          16             4   float     sum     7.00    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
          32             8   float     sum     7.28    0.00    0.00  0e+00     7.19    0.00    0.00  0e+00
          64            16   float     sum     7.20    0.01    0.01  0e+00     7.05    0.01    0.01  0e+00
         128            32   float     sum     7.30    0.02    0.02  0e+00     7.19    0.02    0.02  0e+00
         256            64   float     sum     7.30    0.04    0.04  0e+00     7.20    0.04    0.04  0e+00
         512           128   float     sum     7.47    0.07    0.07  0e+00     7.12    0.07    0.07  0e+00
        1024           256   float     sum     8.14    0.13    0.13  0e+00     7.92    0.13    0.13  0e+00
        2048           512   float     sum     8.56    0.24    0.24  0e+00     8.43    0.24    0.24  0e+00
        4096          1024   float     sum     9.72    0.42    0.42  0e+00     9.49    0.43    0.43  0e+00
        8192          2048   float     sum    11.99    0.68    0.68  0e+00    11.92    0.69    0.69  0e+00
       16384          4096   float     sum    14.36    1.14    1.14  0e+00    14.21    1.15    1.15  0e+00
       32768          8192   float     sum    16.79    1.95    1.95  0e+00    16.64    1.97    1.97  0e+00
       65536         16384   float     sum    21.14    3.10    3.10  0e+00    20.55    3.19    3.19  0e+00
      131072         32768   float     sum    35.56    3.69    3.69  0e+00    35.43    3.70    3.70  0e+00
      262144         65536   float     sum    41.23    6.36    6.36  0e+00    41.21    6.36    6.36  0e+00
      524288        131072   float     sum    50.66   10.35   10.35  0e+00    50.82   10.32   10.32  0e+00
     1048576        262144   float     sum    72.54   14.45   14.45  0e+00    72.45   14.47   14.47  0e+00
     2097152        524288   float     sum    120.7   17.37   17.37  0e+00    118.4   17.71   17.71  0e+00
     4194304       1048576   float     sum    215.2   19.49   19.49  0e+00    214.7   19.53   19.53  0e+00
     8388608       2097152   float     sum    411.3   20.39   20.39  0e+00    399.1   21.02   21.02  0e+00
    16777216       4194304   float     sum    865.3   19.39   19.39  0e+00    779.6   21.52   21.52  0e+00
    33554432       8388608   float     sum   1547.9   21.68   21.68  0e+00   1699.3   19.75   19.75  0e+00
    67108864      16777216   float     sum   3115.1   21.54   21.54  0e+00   3007.4   22.31   22.31  0e+00
   134217728      33554432   float     sum   5994.3   22.39   22.39  0e+00   5991.9   22.40   22.40  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.43886 

/all_gather_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   9119 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   9119 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                             out-of-place                       in-place          
#       size         count    type     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             1   float     7.14    0.00    0.00  0e+00     7.06    0.00    0.00  0e+00
          16             2   float     7.03    0.00    0.00  0e+00     7.00    0.00    0.00  0e+00
          32             4   float     6.96    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
          64             8   float     7.10    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
         128            16   float     7.10    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
         256            32   float     7.18    0.02    0.02  0e+00     7.23    0.02    0.02  0e+00
         512            64   float     7.49    0.03    0.03  0e+00     7.47    0.03    0.03  0e+00
        1024           128   float     7.03    0.07    0.07  0e+00     6.96    0.07    0.07  0e+00
        2048           256   float     6.97    0.15    0.15  0e+00     6.97    0.15    0.15  0e+00
        4096           512   float     7.41    0.28    0.28  0e+00     7.00    0.29    0.29  0e+00
        8192          1024   float     9.59    0.43    0.43  0e+00     8.80    0.47    0.47  0e+00
       16384          2048   float    11.41    0.72    0.72  0e+00    10.78    0.76    0.76  0e+00
       32768          4096   float    13.39    1.22    1.22  0e+00    11.85    1.38    1.38  0e+00
       65536          8192   float    16.57    1.98    1.98  0e+00    13.83    2.37    2.37  0e+00
      131072         16384   float    23.07    2.84    2.84  0e+00    18.39    3.56    3.56  0e+00
      262144         32768   float    31.38    4.18    4.18  0e+00    30.27    4.33    4.33  0e+00
      524288         65536   float    36.00    7.28    7.28  0e+00    35.30    7.43    7.43  0e+00
     1048576        131072   float    47.38   11.06   11.06  0e+00    46.84   11.19   11.19  0e+00
     2097152        262144   float    70.44   14.89   14.89  0e+00    69.77   15.03   15.03  0e+00
     4194304        524288   float    120.1   17.46   17.46  0e+00    115.5   18.16   18.16  0e+00
     8388608       1048576   float    212.5   19.73   19.73  0e+00    210.2   19.95   19.95  0e+00
    16777216       2097152   float    418.5   20.05   20.05  0e+00    414.0   20.26   20.26  0e+00
    33554432       4194304   float    817.8   20.51   20.51  0e+00    785.1   21.37   21.37  0e+00
    67108864       8388608   float   1568.3   21.40   21.40  0e+00   1560.9   21.50   21.50  0e+00
   134217728      16777216   float   3298.6   20.34   20.34  0e+00   3070.3   21.86   21.86  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 6.6972 

./broadcast_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid  26256 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid  26256 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type    root     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float       0     7.24    0.00    0.00  0e+00     7.50    0.00    0.00  0e+00
          16             4   float       0     8.31    0.00    0.00  0e+00     7.69    0.00    0.00  0e+00
          32             8   float       0     8.15    0.00    0.00  0e+00     8.23    0.00    0.00  0e+00
          64            16   float       0     7.19    0.01    0.01  0e+00     7.13    0.01    0.01  0e+00
         128            32   float       0     7.25    0.02    0.02  0e+00     7.45    0.02    0.02  0e+00
         256            64   float       0     7.08    0.04    0.04  0e+00     7.16    0.04    0.04  0e+00
         512           128   float       0     7.47    0.07    0.07  0e+00     7.39    0.07    0.07  0e+00
        1024           256   float       0     7.19    0.14    0.14  0e+00    32.19    0.03    0.03  0e+00
        2048           512   float       0     7.36    0.28    0.28  0e+00     7.03    0.29    0.29  0e+00
        4096          1024   float       0     7.25    0.57    0.57  0e+00     7.07    0.58    0.58  0e+00
        8192          2048   float       0     9.11    0.90    0.90  0e+00     8.10    1.01    1.01  0e+00
       16384          4096   float       0    10.97    1.49    1.49  0e+00    10.52    1.56    1.56  0e+00
       32768          8192   float       0    13.36    2.45    2.45  0e+00    11.73    2.79    2.79  0e+00
       65536         16384   float       0    17.03    3.85    3.85  0e+00    14.24    4.60    4.60  0e+00
      131072         32768   float       0    22.66    5.78    5.78  0e+00    22.60    5.80    5.80  0e+00
      262144         65536   float       0    28.48    9.21    9.21  0e+00    28.45    9.21    9.21  0e+00
      524288        131072   float       0    40.26   13.02   13.02  0e+00    40.08   13.08   13.08  0e+00
     1048576        262144   float       0    63.48   16.52   16.52  0e+00    63.19   16.59   16.59  0e+00
     2097152        524288   float       0    110.1   19.04   19.04  0e+00    109.3   19.19   19.19  0e+00
     4194304       1048576   float       0    205.7   20.39   20.39  0e+00    237.1   17.69   17.69  0e+00
     8388608       2097152   float       0    425.1   19.73   19.73  0e+00    386.7   21.69   21.69  0e+00
    16777216       4194304   float       0    815.0   20.59   20.59  0e+00    824.0   20.36   20.36  0e+00
    33554432       8388608   float       0   1536.8   21.83   21.83  0e+00   1508.2   22.25   22.25  0e+00
    67108864      16777216   float       0   3139.2   21.38   21.38  0e+00   3124.3   21.48   21.48  0e+00
   134217728      33554432   float       0   6283.5   21.36   21.36  0e+00   5873.1   22.85   22.85  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.99748 

$ ./reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   4810 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   4810 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop    root     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float     sum       0     7.16    0.00    0.00  0e+00     7.35    0.00    0.00  0e+00
          16             4   float     sum       0     7.74    0.00    0.00  0e+00     7.67    0.00    0.00  0e+00
          32             8   float     sum       0     7.08    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
          64            16   float     sum       0     7.13    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
         128            32   float     sum       0     7.15    0.02    0.02  0e+00     7.06    0.02    0.02  0e+00
         256            64   float     sum       0     7.14    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
         512           128   float     sum       0     7.14    0.07    0.07  0e+00     7.11    0.07    0.07  0e+00
        1024           256   float     sum       0     7.09    0.14    0.14  0e+00     7.09    0.14    0.14  0e+00
        2048           512   float     sum       0     7.11    0.29    0.29  0e+00     7.12    0.29    0.29  0e+00
        4096          1024   float     sum       0     7.28    0.56    0.56  0e+00     7.20    0.57    0.57  0e+00
        8192          2048   float     sum       0     8.72    0.94    0.94  0e+00     8.59    0.95    0.95  0e+00
       16384          4096   float     sum       0    10.80    1.52    1.52  0e+00    10.78    1.52    1.52  0e+00
       32768          8192   float     sum       0    12.89    2.54    2.54  0e+00    12.64    2.59    2.59  0e+00
       65536         16384   float     sum       0    16.42    3.99    3.99  0e+00    15.88    4.13    4.13  0e+00
      131072         32768   float     sum       0    23.17    5.66    5.66  0e+00    23.27    5.63    5.63  0e+00
      262144         65536   float     sum       0    29.13    9.00    9.00  0e+00    28.88    9.08    9.08  0e+00
      524288        131072   float     sum       0    40.93   12.81   12.81  0e+00    40.93   12.81   12.81  0e+00
     1048576        262144   float     sum       0    64.30   16.31   16.31  0e+00    64.25   16.32   16.32  0e+00
     2097152        524288   float     sum       0    110.5   18.98   18.98  0e+00    110.6   18.97   18.97  0e+00
     4194304       1048576   float     sum       0    202.1   20.76   20.76  0e+00    202.1   20.76   20.76  0e+00
     8388608       2097152   float     sum       0    386.5   21.70   21.70  0e+00    386.3   21.71   21.71  0e+00
    16777216       4194304   float     sum       0    752.6   22.29   22.29  0e+00    752.5   22.30   22.30  0e+00
    33554432       8388608   float     sum       0   1485.2   22.59   22.59  0e+00   1529.3   21.94   21.94  0e+00
    67108864      16777216   float     sum       0   2947.4   22.77   22.77  0e+00   2945.2   22.79   22.79  0e+00
   134217728      33554432   float     sum       0   5873.8   22.85   22.85  0e+00   5873.8   22.85   22.85  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.22671 
$ ./reduce_scatter_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   5435 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   5435 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             1   float     sum     7.21    0.00    0.00  0e+00     7.28    0.00    0.00  0e+00
          16             2   float     sum     7.12    0.00    0.00  0e+00     7.18    0.00    0.00  0e+00
          32             4   float     sum     7.14    0.00    0.00  0e+00     7.22    0.00    0.00  0e+00
          64             8   float     sum     7.20    0.00    0.00  0e+00     7.15    0.00    0.00  0e+00
         128            16   float     sum     7.14    0.01    0.01  0e+00     7.12    0.01    0.01  0e+00
         256            32   float     sum     7.16    0.02    0.02  0e+00     7.12    0.02    0.02  0e+00
         512            64   float     sum     7.18    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
        1024           128   float     sum     7.53    0.07    0.07  0e+00     7.27    0.07    0.07  0e+00
        2048           256   float     sum     7.28    0.14    0.14  0e+00     7.23    0.14    0.14  0e+00
        4096           512   float     sum     7.64    0.27    0.27  0e+00     7.57    0.27    0.27  0e+00
        8192          1024   float     sum     9.35    0.44    0.44  0e+00     9.24    0.44    0.44  0e+00
       16384          2048   float     sum    11.33    0.72    0.72  0e+00    11.23    0.73    0.73  0e+00
       32768          4096   float     sum    12.66    1.29    1.29  0e+00    12.62    1.30    1.30  0e+00
       65536          8192   float     sum    15.39    2.13    2.13  0e+00    15.31    2.14    2.14  0e+00
      131072         16384   float     sum    21.02    3.12    3.12  0e+00    21.35    3.07    3.07  0e+00
      262144         32768   float     sum    32.36    4.05    4.05  0e+00    31.98    4.10    4.10  0e+00
      524288         65536   float     sum    39.63    6.61    6.61  0e+00    39.76    6.59    6.59  0e+00
     1048576        131072   float     sum    57.11    9.18    9.18  0e+00    56.88    9.22    9.22  0e+00
     2097152        262144   float     sum    92.96   11.28   11.28  0e+00    92.54   11.33   11.33  0e+00
     4194304        524288   float     sum    166.4   12.60   12.60  0e+00    165.9   12.64   12.64  0e+00
     8388608       1048576   float     sum    308.5   13.59   13.59  0e+00    504.4    8.32    8.32  0e+00
    16777216       2097152   float     sum   1050.1    7.99    7.99  0e+00    693.5   12.10   12.10  0e+00
    33554432       4194304   float     sum   1533.4   10.94   10.94  0e+00   1414.8   11.86   11.86  0e+00
    67108864       8388608   float     sum   2529.2   13.27   13.27  0e+00   2314.2   14.50   14.50  0e+00
   134217728      16777216   float     sum   5619.2   11.94   11.94  0e+00   4905.4   13.68   13.68  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.44552

I first submit bug to TENSORFLOW , here is the link:tensorflow/tensorflow#40027

it shows when I remove NVLINK BRIDGE, the TF code runs well ,
and when I using NVLINK BRIDGE, but not using NCCL, the TF code runs well too.
but when I using NCCL and NVLINK BRIDGE, the system halt, make me have to reboot.

ncclSend and ncclRecv undefined

When compiling in a container without CUDA 11...

Step 11/28 : RUN make MPI=1 MPI_HOME=/usr/local/mpi
---> Running in bc5f12a8f538
make -C src build
make[1]: Entering directory '/nccl-tests/src'
Compiling all_reduce.cu > ../build/all_reduce.o
Compiling common.cu > ../build/common.o
Linking ../build/all_reduce.o > ../build/all_reduce_perf
Compiling all_gather.cu > ../build/all_gather.o
Linking ../build/all_gather.o > ../build/all_gather_perf
Compiling broadcast.cu > ../build/broadcast.o
Linking ../build/broadcast.o > ../build/broadcast_perf
Compiling reduce_scatter.cu > ../build/reduce_scatter.o
Linking ../build/reduce_scatter.o > ../build/reduce_scatter_perf
Compiling reduce.cu > ../build/reduce.o
Linking ../build/reduce.o > ../build/reduce_perf
Compiling alltoall.cu > ../build/alltoall.o
alltoall.cu(69): error: identifier "ncclSend" is undefined

alltoall.cu(70): error: identifier "ncclRecv" is undefined

2 errors detected in the compilation of "/tmp/tmpxft_00000461_00000000-10_alltoall.compute_70.cpp1.ii".
Makefile:71: recipe for target '../build/alltoall.o' failed
make[1]: *** [../build/alltoall.o] Error 1
make[1]: Leaving directory '/nccl-tests/src'
make: *** [src.build] Error 2
Makefile:17: recipe for target 'src.build' failed
The command '/bin/sh -c make MPI=1 MPI_HOME=/usr/local/mpi' returned a non-zero code: 2

Is it expected that we won't be able to compile this code without CUDA 11?

common.cu:375

Hello everyone! I had two questions regarding nccl-tests:

1- why a shift is made in common.cu:375.

2- Is it enough to check the correctness of nccl by nccl-tests when the number of iters and number of warm-up iters inside nccl-tests are set to 0?

Best,

Stuck when running MPI test

I've compiled nccl, then tried with the following command
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

Then I see the following error. What's the problem?

[kyoungrok-ryzen:12576] *** Process received signal ***
[kyoungrok-ryzen:12576] Signal: Segmentation fault (11)
[kyoungrok-ryzen:12576] Signal code: Address not mapped (1)
[kyoungrok-ryzen:12576] Failing at address: 0x44000098
[kyoungrok-ryzen:12576] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f339f39c890]
[kyoungrok-ryzen:12576] [ 1] /usr/lib/x86_64-linux-gnu/libmpi.so.20(MPI_Comm_size+0x42)[0x7f33a4d353b2]
[kyoungrok-ryzen:12576] [ 2] ./build/all_reduce_perf[0x402101]
[kyoungrok-ryzen:12576] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f339e010b97]
[kyoungrok-ryzen:12576] [ 4] ./build/all_reduce_perf[0x40398a]
[kyoungrok-ryzen:12576] *** End of error message ***
[1]    12576 segmentation fault (core dumped)  ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

NCCL failure all_reduce.cu:95 'unhandled cuda error'

OS Platform: ubuntu 16.04
Cuda: 8.0
NCCL: 1.0
Driver Version: 384.66

Build

root@node03:~/nvidia/nccl/nccl-tests/src# make CUDA_HOME=/usr/local/cuda-8.0 NCCL_HOME=/root/nvidia/nccl/nccl-1.3.4-1/build

When I run the all_reduce_perf on 8 GPUs, it got an error:

root@node03:~/nvidia/nccl/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
# NCCL Tests compiled with NCCL 1.0
# Using devices
#   Rank  0 on     node03 device  0 [0x00] Tesla P100-PCIE-16GB
#   Rank  1 on     node03 device  1 [0x00] Tesla P100-PCIE-16GB
#   Rank  2 on     node03 device  2 [0x00] Tesla P100-PCIE-16GB
#   Rank  3 on     node03 device  3 [0x00] Tesla P100-PCIE-16GB
#   Rank  4 on     node03 device  4 [0x00] Tesla P100-PCIE-16GB
#   Rank  5 on     node03 device  5 [0x00] Tesla P100-PCIE-16GB
#   Rank  6 on     node03 device  6 [0x00] Tesla P100-PCIE-16GB
#   Rank  7 on     node03 device  7 [0x00] Tesla P100-PCIE-16GB

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
NCCL failure all_reduce.cu:95 'unhandled cuda error'

Some questions about running with MPI

Hi,

I am trying to run the distributed tests with MPI. I am using OpenMPI.

First Problem: I am able to run across nodes but not within the node using MPI. Is this expected behavior?

Second problem: Across the nodes, I see that I run with -n 2 but I see Rank 0,1, and 2 so my question is why there are 3 processes when I am launching only 2 processes?

The nodes have 1 K-80 so we should be able to use 2 CUDA devices within a node.

Any thoughts?

Within a Node:

mpirun -n 2 -npernode 2 -hostfile ./hosts -x LD_LIBRARY_PATH=/opt/nccl2/lib:$LD_LIBRARY_PATH ./build/all_reduce_perf -g 2 -c 0 
nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 0 
Cuda failure common.cu:891 'invalid device ordinal'
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[61045,1],1]
  Exit code:    1
--------------------------------------------------------------------------

Across Nodes:

mpirun -n 2 -npernode 1 -hostfile ./hosts -x LD_LIBRARY_PATH=/opt/nccl2/lib:$LD_LIBRARY_PATH ./build/all_reduce_perf -g 2 -c 0 
nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 0 
# Using devices
#   Rank  0 on      gpu07 device  0 [0x05] Tesla K80
#   Rank  1 on      gpu07 device  1 [0x06] Tesla K80
#   Rank  2 on      gpu08 device  0 [0x05] Tesla K80

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  3 on      gpu08 device  1 [0x06] Tesla K80
    33554432       8388608   float     sum    8.642   3.88   5.82  	N/A    8.153   4.12   6.17  	N/A
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 5.99874

nccl send recv perf benchmark?

I see that NCCL has released send recv API. When will nccl-test include send/recv benchmarking code? Thanks!

Multinode NCCL 2.0 MPI Test code failure

I was trying this code using 2 nodes, each with 8 GPUs. The code ran well on single node, but when I tried two nodes, I am seeing the code either hangs or runs for infinite time. I tried three different cases as following, but were not successful.

mpirun -np 16 ./build/all_reduce_perf -b 32M -e 128M -f 2 -t 1 -g 1 -c 0

nThread 1 nGpus 1 minBytes 33554432 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
Cuda failure common.cu:891 'invalid device ordinal'

mpirun -np 2 ./build/all_reduce_perf -b 32M -e 128M -f 2 -t 1 -g 8 -c 0
In this case, after printing the following line, the code hangs

nThread 1 nGpus 8 minBytes 33554432 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0

mpirun -np 2 ./build/all_reduce_perf -b 32M -e 128M -f 2 -t 8 -g 1 -c 0
Same as Case 2, after printing the following line, the code hangs

nThread 8 nGpus 1 minBytes 33554432 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0

I used SLURM to run the code using the following. In all the above cases, I set ntasks equals to np

#SBATCH --nodes=2
#SBATCH --ntasks=16
#SBATCH --gres=gpu:8

I set the following in the sbatch script.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/xyz/nccl/build/lib/
export LD_LIBRARY_PATH=/home/xyz/openmpi-2.0.1-sm-gcc48-cuda-8.0-slurm-14.11.7/lib:/home/xyz/cuda-8.0.61/lib64:$LD_LIBRARY_PATH
export PATH=/home/xyz/openmpi-2.0.1-sm-gcc48-cuda-8.0-slurm-14.11.7/bin:/home/xyz/cuda-8.0.61/bin:$PATH

I used openmpi-2.0.1 and cuda-8.0.61 during compilation and run. NCCL 2.0 were used. Thanks for your help.

ibv_poll_cq error during all_reduce_perf with openmpi running on two GPU server

Today I run all_reduce_perf with openmpi and GPUDirectRDMA support, and I have two GPU servers and each server install 4 NVIDIA V100 GPU and a Mellonx MT27800 NIC, the hardware is sure to support GPUDirectRDMA technique. And the error message is:

gpu5:35917:35994 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
gpu5:35917:35994 [0] NCCL INFO include/net.h:34 -> 2
gpu5:35917:35994 [0] NCCL INFO transport/net.cu:537 -> 2
gpu5:35917:35994 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]

gpu4:35902:35977 [0] transport/net_ib.cu:788 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
gpu4:35902:35977 [0] NCCL INFO include/net.h:34 -> 2
gpu4:35902:35977 [0] NCCL INFO transport/net.cu:537 -> 2
gpu4:35902:35977 [0] NCCL INFO transport.cu:163 -> 2 [Proxy Thread]
gpu5:35917:35917 [0] NCCL INFO Destroyed comm 0x7f24c4001af0 rank 4
gpu4:35902:35902 [0] NCCL INFO Destroyed comm 0x7f1de8001af0 rank 0

gpu5:35917:35917 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu5:35917:35917 [1] NCCL INFO init.cu:1143 -> 1
gpu5: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu5: Test failure common.cu:393
 .. gpu5: Test failure common.cu:492
 .. gpu5: Test failure all_reduce.cu:103
 .. gpu5: Test failure common.cu:518
 .. gpu5: Test failure common.cu:839

gpu4:35902:35902 [1] init.cu:194 NCCL WARN Cuda failure 'an illegal memory access was encountered'
gpu4:35902:35902 [1] NCCL INFO init.cu:1143 -> 1
gpu4: Test NCCL failure common.cu:343 'unhandled cuda error'
 .. gpu4: Test failure common.cu:393
 .. gpu4: Test failure common.cu:492
 .. gpu4: Test failure all_reduce.cu:103
 .. gpu4: Test failure common.cu:518
 .. gpu4: Test failure common.cu:839
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[36982,1],1]
  Exit code:    3
--------------------------------------------------------------------------

I check line 788 in transport/net_ib.cu, it seems that return value of ibv_poll_cq is bad.

command to run all_reduce_perf:

/usr/local/openmpi/bin/mpirun -np 2 -host 10.0.21.2:1,10.0.23.4:1 \
	-x NCCL_SOCKET_IFNAME=rdma0 -x NCCL_DEBUG=INFO \
 	-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 \
 	-mca orte_base_help_aggregate 0 \
	./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 -c 0

I have search this error in google and few people meet this problem, so can you help me to solve this problem?

Out of bounds value. test case: allgather max bytes 4G

environment:
gpu:V100-SXM2-32GB x 8 + NVLink
cuda driver version:418.87.01
cuda version: 10.0
nccl version: 2.4.2 2.5.6

test case:

all_gather_perf -g 8 -b 4294967296 -e 4294967296 -n 100
                                             out-of-place                       in-place
       size         count    type     time   algbw   busbw  error     time   algbw   busbw  error
        (B)    (elements)             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  4294967296     134217728   float    29122  129.05  129.05  1e+00    29169  128.84  128.84  1e+00
 Out of bounds values : 2 FAILED
 Avg bus bandwidth    : 128.942

#nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity
GPU0	 X 	NV1	NV1	NV2	SYS	SYS	NV2	SYS	0-95
GPU1	NV1	 X 	NV2	NV1	SYS	SYS	SYS	NV2	0-95
GPU2	NV1	NV2	 X 	NV2	NV1	SYS	SYS	SYS	0-95
GPU3	NV2	NV1	NV2	 X 	SYS	NV1	SYS	SYS	0-95
GPU4	SYS	SYS	NV1	SYS	 X 	NV2	NV1	NV2	0-95
GPU5	SYS	SYS	SYS	NV1	NV2	 X 	NV2	NV1	0-95
GPU6	NV2	SYS	SYS	SYS	NV1	NV2	 X 	NV1	0-95
GPU7	SYS	NV2	SYS	SYS	NV2	NV1	NV1	 X 	0-95

show me the nccl.h is not found. who can help me ?? ^_^

~~# make
make -C src build
make[1]: Entering directory /root/nccl-tests/src' Compiling all_reduce.cu > ../build/all_reduce.o In file included from all_reduce.cu:8:0: common.h:9:18: fatal error: nccl.h: No such file or directory #include "nccl.h" ^ compilation terminated. make[1]: *** [../build/all_reduce.o] Error 1 make[1]: Leaving directory /root/nccl-tests/src'
make: *** [src.build] Error 2

Is PGI a suported compiler?

make MPI=1 MPI_HOME=/nasa/hpcx/2.5.0/ompi-pgi PREFIX=/nobackupp16/swbuild/dkokron/cuda/nccl-tests/install CUDA_HOME=/nasa/cuda/10.2 NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70" NCCL_HOME=/nobackupp16/swbuild/dkokron/cuda/nccl/build
src buildmake -C src build
make[1]: Entering directory '/nobackupp16/swbuild/dkokron/cuda/nccl-tests/src'
Compiling all_reduce.cu > ../build/all_reduce.o
/nasa/pgi/19.10/linux86-64-llvm/19.10/include/edg/xmmintrin.h(2514): internal error: assertion failed at: "/dvs/p4/build/sw/rel/gpu_drv/r440/TC440_70/drivers/compiler/edg/EDG_5.0/src/sys_predef.c", line 574

1 catastrophic error detected in the compilation of "/tmp/pbs.8178378.pbspl1.nas.nasa.gov/tmpxft_00005bae_00000000-4_all_reduce.cpp4.ii".
Compilation aborted.
nvcc error : 'cudafe++' died due to signal 6

MPI_HOME alone is not able to build since mpi.h does not reside in /usr/include/mpi.h

On CentOS 7, openmpi-devel have libmpi.so in /usr/lib64/openmpi/lib/libmpi.so and mpi.h resides in /usr/include/openmpi-x86_64/mpi.h

Stuck when running MPI test (reopened)

Hello. I'm hanged when try to test MPI in ryzen 2700x system. Please refer to the comment below.

#18 (comment)

nv_peer_mem NCCL2 nccl-tests fails with: Out of bounds values : 24 FAILED

On two different GPU clusters, nv_peer_mem NCCL2 failed to pass nccl sanity tests.
MVAPICH2-GDR + gdrcopy passwd the tests with the Same HW/SW.

This is a mirror issue of https://github.com/Mellanox/nv_peer_memory/issues/38, and related to #7.
Anyone can help?
Thanks.

Configurations:

MPI: OpenMPI 1.8.8/2.1.3/3.0.1
CUDA lib: CUDA 8.0/9.0/9.1
NCCL lib: NCCL 2.0.5/2.1.15
GDR lib: nv_peer_memory master
OFED: MLNX_OFED_LINUX-4.2-1
OS: Ubuntu1604/CentOS7.4
GPU: Kepler K80/Pascal P100
Server: Supermicro 4028-TR/4028-TR2
Topo interconnect: PIX
Driver Version: 390.30

To Reproduce

nccl-tests fail with GDR enabled:

-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -mca btl_openib_want_cuda_gdr 1

[15:37:58](root):~ # /root/mpi/cuda-9.0/ompi3-cuda/bin/mpirun -v --allow-run-as-root \
-x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=1 \
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -mca btl_openib_want_cuda_gdr 1 \
-x LD_LIBRARY_PATH=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/lib:/usr/local/cuda-9.0/lib64 \
-mca btl_openib_if_include mlx5_3:1 \
-np 2 -host clx-mld-45,clx-mld-46 -pernode --oversubscribe \
/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/ompi3tests/all_reduce_perf -b 9 -e 4M -g 1 -c 1 -z 0

nThread 1 nGpus 1 minBytes 9 maxBytes 4194304 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
# NCCL Tests compiled with NCCL 2.1
# Using devices
#   Rank  0 on clx-mld-45 device  0 [0x04] Tesla P100-PCIE-16GB

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on clx-mld-46 device  0 [0x04] Tesla P100-PCIE-16GB
           8             2   float     sum    0.144   0.00   0.00    0e+00    0.015   0.00   0.00    0e+00
     1048584        262146   float     sum    0.212   4.95   4.95    2e+00    0.209   5.02   5.02    2e+00
     2097160        524290   float     sum    0.379   5.53   5.53    2e+00    0.379   5.53   5.53    2e+00
     3145736        786434   float     sum    0.549   5.73   5.73    2e+00    0.548   5.74   5.74    2e+00
 Out of bounds values : 24 FAILED
 Avg bus bandwidth    : 4.06216 

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[2940,1],0]
  Exit code:    1
--------------------------------------------------------------------------

nccl-tests OK, with GDR disabled:

-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 -mca btl_openib_want_cuda_gdr 1

[15:50:24](root):~/mpi # /root/mpi/cuda-9.0/ompi3-cuda/bin/mpirun -v --allow-run-as-root \
-x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=1 \
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 -mca btl_openib_want_cuda_gdr 1 \
-x LD_LIBRARY_PATH=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/lib:/usr/local/cuda-9.0/lib64 \
-mca btl_openib_if_include mlx5_3:1 \
-np 2 -host clx-mld-45,clx-mld-46 -pernode --oversubscribe \
/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/ompi3tests/all_reduce_perf -b 9 -e 4M -g 1 -c 1 -z 0

nThread 1 nGpus 1 minBytes 9 maxBytes 4194304 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
# NCCL Tests compiled with NCCL 2.1
# Using devices
#   Rank  0 on clx-mld-45 device  0 [0x04] Tesla P100-PCIE-16GB

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on clx-mld-46 device  0 [0x04] Tesla P100-PCIE-16GB
           8             2   float     sum    0.087   0.00   0.00    0e+00    0.018   0.00   0.00    0e+00
     1048584        262146   float     sum    0.396   2.65   2.65    0e+00    0.394   2.66   2.66    0e+00
     2097160        524290   float     sum    0.772   2.72   2.72    0e+00   25.292   0.08   0.08    0e+00
     3145736        786434   float     sum   27.539   0.11   0.11    0e+00   69.042   0.05   0.05    0e+00
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 1.03398

To Building the faulty OpenMPI environment:

OpenMPI

cd /root/mpi/cuda-8.0/ompi3.0.1 && \
    rm -fr /root/mpi/cuda-8.0/ompi3.0.1/* && git checkout v3.0.1 && git reset --hard && \
    ./autogen.pl && \
    CC=/usr/bin/gcc CXX=/usr/bin/g++ FC=/usr/bin/gfortran ./configure --with-verbs --with-cuda=/usr/local/cuda-8.0 --prefix=/root/mpi/cuda-8.0/ompi3-cuda && \
    time make -j $(nproc) install

nccl-tests

cd /root/mpi/cuda-9.1/git/nccl-tests && \
    make MPI=1 NCCL_HOME=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64 CUDA_HOME=/usr/local/cuda-9.1 MPI_HOME=/root/mpi/cuda-9.1/ompi1-cuda DST_DIR=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64/ompi1tests -j $(nproc) && \
    make MPI=1 NCCL_HOME=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64 CUDA_HOME=/usr/local/cuda-9.1 MPI_HOME=/root/mpi/cuda-9.1/ompi2-cuda DST_DIR=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64/ompi2tests -j $(nproc) && \
    make MPI=1 NCCL_HOME=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64 CUDA_HOME=/usr/local/cuda-9.1 MPI_HOME=/root/mpi/cuda-9.1/ompi3-cuda DST_DIR=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64/ompi3tests -j $(nproc)

The Same HW/SW and Tests work properly with MVAPICH2-GDR + gdrcopy

nccl-tests OK, with GDR enabled:

-genv NCCL_IB_DISABLE=0 -genv NCCL_IB_CUDA_SUPPORT=1

[16:06:47](root):~/mpi # /opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5/bin/mpirun \
-genv LD_LIBRARY_PATH=/root/mpi/cuda-9.0/nccl_2.0.5-3+cuda9.0_amd64/lib:/usr/local/cuda-9.0/lib64:/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5/lib64 \
-genv MV2_GPUDIRECT_GDRCOPY_LIB=/root/mpi/cuda-9.0/gdr/lib64/libgdrapi.so \
-genv GDRCOPY_ENABLE_LOGGING=1 -genv GDRCOPY_LOG_LEVEL=5 -genv MV2_USE_GPUDIRECT=1 \
-genv NCCL_IB_DISABLE=0 -genv NCCL_IB_CUDA_SUPPORT=1 -genv NCCL_DEBUG=0 -genv NCCL_SOCKET_IFNAME=enp5s0f0 \
-np 2 -host clx-mld-45,clx-mld-46  /root/mpi/cuda-9.0/nccl_2.0.5-3+cuda9.0_amd64/mvapich2tests/all_reduce_perf -b 9 -e 4M -g 4 -c 1 -z 0

nThread 1 nGpus 4 minBytes 9 maxBytes 4194304 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
# NCCL Tests compiled with NCCL 2.0
# Using devices
#   Rank  0 on clx-mld-45 device  0 [0x04] Tesla P100-PCIE-16GB
#   Rank  1 on clx-mld-45 device  1 [0x06] Tesla P100-PCIE-16GB
#   Rank  2 on clx-mld-45 device  2 [0x07] Tesla P100-PCIE-16GB
#   Rank  3 on clx-mld-45 device  3 [0x08] Tesla P100-PCIE-16GB
#   Rank  4 on clx-mld-46 device  0 [0x04] Tesla P100-PCIE-16GB
#   Rank  5 on clx-mld-46 device  1 [0x06] Tesla P100-PCIE-16GB
#   Rank  6 on clx-mld-46 device  2 [0x07] Tesla P100-PCIE-16GB

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  7 on clx-mld-46 device  3 [0x08] Tesla P100-PCIE-16GB
           8             2   float     sum    0.149   0.00   0.00    0e+00    0.151   0.00   0.00    0e+00
     1048584        262146   float     sum    0.308   3.41   5.96    1e-06    0.304   3.45   6.04    1e-06
     2097160        524290   float     sum    0.491   4.27   7.48    1e-06    0.486   4.32   7.56    1e-06
     3145736        786434   float     sum    0.678   4.64   8.12    1e-06    0.678   4.64   8.12    1e-06
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 5.40981

To Building the workable MVAPICH2 environment:

gdrcopy

cd /root/mpi/cuda-8.0/git/gdrcopy && \
    make PREFIX=/root/mpi/cuda-8.0/gdr CUDA=/usr/local/cuda-8.0 -j $(nproc) all install
cd /root/mpi/cuda-9.0/git/gdrcopy && \
    make PREFIX=/root/mpi/cuda-9.0/gdr CUDA=/usr/local/cuda-9.0 -j $(nproc) all install

nccl-tests

cd /root/mpi/cuda-9.0/git/nccl-tests && \
    make MPI=1 NCCL_HOME=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64 CUDA_HOME=/usr/local/cuda-9.0 MPI_HOME=/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5 LIBRARY_PATH=/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5/lib64 DST_DIR=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/mvapich2tests -j $(nproc)

Out of bounds values : 248 FAILED

hi, when I use command "mpirun -v --allow-run-as-root -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_IB_CUDA_SUPPORT=1 -mca btl_tcp_if_include ib0 -np 2 -host 192.168.1.14,192.168.1.13
-pernode ./build/all_reduce_perf -b 9 -e 32M -g 8 -c 1 -z 0" , the result show "Out of bounds values : 248 FAILED", I have no idea about this error, So It‘s Very grateful that you have any message to me about this error。

The result：

WARNING: There are more than one active ports on host 'seedsmed-f02hpc02', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.

nThread 1 nGpus 8 minBytes 9 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1

A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

Local host: [[62619,1],1] (PID 3535)

If you are absolutely sure that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.

[seedsmed-f02hpc01:03622] 1 more process has sent help message help-mpi-btl-openib.txt / default subnet prefix
[seedsmed-f02hpc01:03622] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[seedsmed-f02hpc01:03622] 1 more process has sent help message help-opal-runtime.txt / opal_init:warn-fork

NCCL Tests compiled with NCCL 2.1

Using devices

Rank 0 on seedsmed-f02hpc01 device 0 [0x04] Tesla V100-PCIE-16GB

Rank 1 on seedsmed-f02hpc01 device 1 [0x06] Tesla V100-PCIE-16GB

Rank 2 on seedsmed-f02hpc01 device 2 [0x07] Tesla V100-PCIE-16GB

Rank 3 on seedsmed-f02hpc01 device 3 [0x08] Tesla V100-PCIE-16GB

Rank 4 on seedsmed-f02hpc01 device 4 [0x0c] Tesla V100-PCIE-16GB

Rank 5 on seedsmed-f02hpc01 device 5 [0x0d] Tesla V100-PCIE-16GB

Rank 6 on seedsmed-f02hpc01 device 6 [0x0e] Tesla V100-PCIE-16GB

Rank 7 on seedsmed-f02hpc01 device 7 [0x0f] Tesla V100-PCIE-16GB

Rank 8 on seedsmed-f02hpc02 device 0 [0x04] Tesla V100-PCIE-16GB

Rank 9 on seedsmed-f02hpc02 device 1 [0x06] Tesla V100-PCIE-16GB

Rank 10 on seedsmed-f02hpc02 device 2 [0x07] Tesla V100-PCIE-16GB

Rank 11 on seedsmed-f02hpc02 device 3 [0x08] Tesla V100-PCIE-16GB

Rank 12 on seedsmed-f02hpc02 device 4 [0x0c] Tesla V100-PCIE-16GB

Rank 13 on seedsmed-f02hpc02 device 5 [0x0d] Tesla V100-PCIE-16GB

Rank 14 on seedsmed-f02hpc02 device 6 [0x0e] Tesla V100-PCIE-16GB

out-of-place in-place

bytes N type op time algbw busbw res time algbw busbw res

Rank 15 on seedsmed-f02hpc02 device 7 [0x0f] Tesla V100-PCIE-16GB

       8             2   float     sum    0.445   0.00   0.00    1e-06    0.106   0.00   0.00    0e+00
 1048584        262146   float     sum    0.624   1.68   3.15    1e+01    0.624   1.68   3.15    1e+01
 2097160        524290   float     sum    1.026   2.04   3.83    1e+01    1.019   2.06   3.86    1e+01
 3145736        786434   float     sum    1.359   2.31   4.34    1e+01    1.358   2.32   4.34    1e+01
 4194312       1048578   float     sum    1.732   2.42   4.54    1e+01    1.728   2.43   4.55    1e+01
 5242888       1310722   float     sum    2.070   2.53   4.75    1e+01    2.078   2.52   4.73    1e+01
 6291464       1572866   float     sum    2.447   2.57   4.82    1e+01    2.445   2.57   4.82    1e+01
 7340040       1835010   float     sum    2.794   2.63   4.93    1e+01    2.798   2.62   4.92    1e+01
 8388616       2097154   float     sum    3.165   2.65   4.97    1e+01    3.168   2.65   4.97    1e+01
 9437192       2359298   float     sum    3.529   2.67   5.01    1e+01    3.526   2.68   5.02    1e+01
10485768       2621442   float     sum    3.918   2.68   5.02    1e+01    3.938   2.66   4.99    1e+01
11534344       2883586   float     sum    4.308   2.68   5.02    1e+01    4.308   2.68   5.02    1e+01
12582920       3145730   float     sum    4.682   2.69   5.04    1e+01    4.718   2.67   5.00    1e+01
13631496       3407874   float     sum    5.051   2.70   5.06    1e+01    5.065   2.69   5.05    1e+01
14680072       3670018   float     sum    5.455   2.69   5.05    1e+01    5.462   2.69   5.04    1e+01
15728648       3932162   float     sum    5.864   2.68   5.03    1e+01    5.841   2.69   5.05    1e+01
16777224       4194306   float     sum    6.239   2.69   5.04    1e+01    6.228   2.69   5.05    1e+01
17825800       4456450   float     sum    6.612   2.70   5.06    1e+01    6.631   2.69   5.04    1e+01
18874376       4718594   float     sum    7.052   2.68   5.02    1e+01    7.022   2.69   5.04    1e+01
19922952       4980738   float     sum    7.419   2.69   5.03    1e+01    7.420   2.69   5.03    1e+01
20971528       5242882   float     sum    7.826   2.68   5.02    1e+01    7.811   2.68   5.03    1e+01
22020104       5505026   float     sum    8.214   2.68   5.03    1e+01    8.182   2.69   5.05    1e+01
23068680       5767170   float     sum    8.629   2.67   5.01    1e+01    8.590   2.69   5.04    1e+01
24117256       6029314   float     sum    9.007   2.68   5.02    1e+01    8.979   2.69   5.04    1e+01
25165832       6291458   float     sum    9.368   2.69   5.04    1e+01    9.441   2.67   5.00    1e+01
26214408       6553602   float     sum    9.750   2.69   5.04    1e+01    9.749   2.69   5.04    1e+01
27262984       6815746   float     sum   10.181   2.68   5.02    1e+01   10.150   2.69   5.04    1e+01
28311560       7077890   float     sum   10.546   2.68   5.03    1e+01   10.535   2.69   5.04    1e+01
29360136       7340034   float     sum   10.940   2.68   5.03    1e+01   10.944   2.68   5.03    1e+01
30408712       7602178   float     sum   11.317   2.69   5.04    1e+01   11.296   2.69   5.05    1e+01
31457288       7864322   float     sum   11.695   2.69   5.04    1e+01   11.688   2.69   5.05    1e+01
32505864       8126466   float     sum   12.052   2.70   5.06    1e+01   12.092   2.69   5.04    1e+01

Out of bounds values : 248 FAILED
Avg bus bandwidth : 4.72189

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[62619,1],1]
Exit code: 1

NCCL failure all_reduce.cu:95 'unhandled cuda error'

I use nccl 1.3 and cuda 9.2 and follow the instructions from readme but I can not run the test. Is there any additional advice someone can give me?

$ make MPI=1 MPI_HOME=/usr/local/mvapich2     CUDA_HOME=/usr/local/cuda     NCCL_HOME=/usr/local/cuda/3rdparty/nccl1 -j64
make -C src build
make[1]: Entering directory '/home/somebody/source/nccl-tests/src'
Compiling  all_reduce.cu                       > ../build/all_reduce.o
Compiling  common.cu                           > ../build/common.o
Compiling  all_gather.cu                       > ../build/all_gather.o
Compiling  broadcast.cu                        > ../build/broadcast.o
Compiling  reduce_scatter.cu                   > ../build/reduce_scatter.o
Compiling  reduce.cu                           > ../build/reduce.o
Linking  ../build/all_reduce.o               > ../build/all_reduce_perf
Linking  ../build/all_gather.o               > ../build/all_gather_perf
Linking  ../build/broadcast.o                > ../build/broadcast_perf
Linking  ../build/reduce_scatter.o           > ../build/reduce_scatter_perf
Linking  ../build/reduce.o                   > ../build/reduce_perf

$ export LD_LIBRARY_PATH=/usr/local/mvapich2/lib:/usr/local/cuda-9.2/lib64:/usr/local/cuda-9.2/3rdparty/nccl1/lib

$ NCCL_DEBUG=WARN ./build/all_reduce_perf -b 8 -e 128M -f2 -g4
nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
NCCL version 1.3.5 compiled with CUDA 9.2
# NCCL Tests compiled with NCCL 1.0
# Using devices
#   Rank  0 on dell-gpu142 device  0 [0x1a] Tesla V100-SXM2-16GB
#   Rank  1 on dell-gpu142 device  1 [0x1c] Tesla V100-SXM2-16GB
#   Rank  2 on dell-gpu142 device  2 [0x1d] Tesla V100-SXM2-16GB
#   Rank  3 on dell-gpu142 device  3 [0x1e] Tesla V100-SXM2-16GB

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
WARN src/all_reduce.cu:212 Cuda failure 'invalid resource handle'

NCCL failure all_reduce.cu:95 'unhandled cuda error'

make ignores $NCCL_HOME instead links to nccl from /usr/local/cuda/lib

Getting "help-mpi-btl-base.txt / btl:no-nics" when trying to run on Ethernet network

Is there a special way of building the test if I want to run it on Ethernet? Trying to run it on AWS 100Gbps network and getting following error

ubuntu@ip-172-31-15-234:~$ /usr/local/mpi/bin/mpirun --host 172.31.15.234,172.31.3.83 -np 2 -N 1 -x LD_LIBRARY_PATH=~/nccl/nccl-2.3.7/nccl/build/lib:$LD_LIBRARY_PATH ~/nccl/nccl-2.3.7/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 
--------------------------------------------------------------------------
[[33448,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: ip-172-31-3-83

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
[ip-172-31-15-234:73547] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[ip-172-31-15-234:73547] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

nccl-test with mpi hangs

I compiled with MPI=1, but it hangs when i run the following:

ubuntu@ip-172-32-45-72:~/latest-drivers/nccl-tests$ /opt/amazon/openmpi/bin/mpirun -np 2 -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

# hang here

I see that the process is busy all the time from top:

12405 ubuntu    20   0  540652  18376  10436 R 100.5  0.0   0:04.27 all_reduce_perf
12404 ubuntu    20   0  540652  18292  10352 R 100.0  0.0   0:04.27 all_reduce_perf

it does not hang if I run

ubuntu@ip-172-32-45-72:~/latest-drivers/nccl-tests$ /opt/amazon/openmpi/bin/mpirun -np 1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  12275 on ip-172-32-45-72 device  0 [0x00] Tesla V100-SXM2-32GB
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2   float     sum     4.43    0.00    0.00  0e+00     0.31    0.03    0.00  0e+00
          16             4   float     sum     4.34    0.00    0.00  0e+00     0.31    0.05    0.00  0e+00
          32             8   float     sum     4.30    0.01    0.00  0e+00     0.30    0.11    0.00  0e+00
          64            16   float     sum     4.36    0.01    0.00  0e+00     0.30    0.21    0.00  0e+00
         128            32   float     sum     4.27    0.03    0.00  0e+00     0.30    0.42    0.00  0e+00
         256            64   float     sum     4.26    0.06    0.00  0e+00     0.30    0.85    0.00  0e+00
         512           128   float     sum     4.36    0.12    0.00  0e+00     0.30    1.70    0.00  0e+00
        1024           256   float     sum     4.33    0.24    0.00  0e+00     0.30    3.38    0.00  0e+00
        2048           512   float     sum     4.32    0.47    0.00  0e+00     0.31    6.66    0.00  0e+00
        4096          1024   float     sum     4.27    0.96    0.00  0e+00     0.30   13.47    0.00  0e+00
        8192          2048   float     sum     4.36    1.88    0.00  0e+00     0.30   26.96    0.00  0e+00
       16384          4096   float     sum     4.30    3.81    0.00  0e+00     0.30   53.84    0.00  0e+00
       32768          8192   float     sum     4.27    7.67    0.00  0e+00     0.30  108.23    0.00  0e+00
       65536         16384   float     sum     4.36   15.04    0.00  0e+00     0.30  215.47    0.00  0e+00
      131072         32768   float     sum     4.35   30.13    0.00  0e+00     0.30  430.24    0.00  0e+00
      262144         65536   float     sum     4.37   59.96    0.00  0e+00     0.30  864.73    0.00  0e+00
      524288        131072   float     sum     4.34  120.92    0.00  0e+00     0.30  1728.04    0.00  0e+00
     1048576        262144   float     sum     6.10  171.98    0.00  0e+00     0.30  3444.73    0.00  0e+00
     2097152        524288   float     sum     8.78  238.96    0.00  0e+00     0.31  6866.90    0.00  0e+00
     4194304       1048576   float     sum    14.64  286.48    0.00  0e+00     0.30  13751.82    0.00  0e+00
     8388608       2097152   float     sum    25.98  322.90    0.00  0e+00     0.30  27786.05    0.00  0e+00
    16777216       4194304   float     sum    47.48  353.35    0.00  0e+00     0.30  55942.70    0.00  0e+00
    33554432       8388608   float     sum    90.58  370.45    0.00  0e+00     0.30  111107.39    0.00  0e+00
    67108864      16777216   float     sum    176.7  379.73    0.00  0e+00     0.30  221517.95    0.00  0e+00
   134217728      33554432   float     sum    349.2  384.30    0.00  0e+00     0.30  443841.69    0.00  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0

I am able to ssh to localhost. How do i fix it?

multiple processes

The test rely on MPI to work on multiple processes. Does the multiple processes must reply on MPI? Can the NCCL2 work with rpc while using two or more machines?

NCCL failure common.cu:874 'unhandled system error'

I tested nccl-tests in nvidia-pytorch docker (downloaded from nvidia-cloud). The compiling went through. However, it failed when i tested "./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8".
The error msg is
"nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
NCCL failure common.cu:874 'unhandled system error'"

It seems that the problem caused by "NCCLCHECK(ncclGetUniqueId(&ncclId));"

Thanks for help.

"Benchmark" tests

Apologies if this is not the correct place to post this.
Sat in on the excellent Connect with the Experts NCCL session this morning. I may not have heard correctly but I was under the impression there was a page somewhere with expected performance numbers for comparison. What I thought I heard was to run the Perf tests from this github link and compare with the numbers at ????? "to check your hardware is working at the speed it should be". I was expecting to find some table with expected numbers for typical platforms but couldn't immediately find anything. Did I misunderstand something?

mpi.h: No such file or directory

When I run the following command:

make MPI=1 MPI_HOME=/usr/include/mpi NCCL_HOME=/usr/local

I get the following error:

common.h:14:17: fatal error: mpi.h: No such file or directory
# include "mpi.h"

However I am able to verify that mpi.h is in that directory

Running NCCL test on multiple nodes

Hi, I'm trying to building NCCL benchmark and I want to run some tests on multiple nodes but the present tests don't seem to support that. How can I change it?

all_reduce_perf getting linked against libmpi.so.12 instead of libmpi.so.40

I'm trying to build nccl-tests + nccl 2.7 and for some reason it ends up linking against
libmpi.so.12 whereas nccl 2.6 gets linked to libmpi.so.40. My build procedure is here, and I'm running it in identical way, except for specifying different checkout command for NCCL.

Any suggestions how to debug this? Who is bringing in libmpi.so.12?

NCCL version 2.6

[ec2-user@ip-172-31-24-194 ~]$ ldd $FOLDER_ROOT/nccl-tests/build/all_reduce_perf
	linux-vdso.so.1 =>  (0x00007ffc9a35f000)
	libcudart.so.10.0 => /usr/local/cuda-10.0/lib64/libcudart.so.10.0 (0x00007f6ba3f49000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f6ba3d41000)
	libmpi.so.40 => /opt/amazon/efa/lib64/libmpi.so.40 (0x00007f6ba3a46000)
	libcurand.so.10.0 => /usr/local/cuda-10.0/lib64/libcurand.so.10.0 (0x00007f6b9f8df000)
	libnccl.so.2 => /home/ec2-user/nccl/nccl-2.4.6/nccl/build/lib/libnccl.so.2 (0x00007f6b9ae6a000)
	libnvToolsExt.so.1 => /usr/local/cuda-10.0/lib64/libnvToolsExt.so.1 (0x00007f6b9ac61000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f6b9aa45000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f6b9a841000)
	libstdc++.so.6 => /home/ec2-user/anaconda3/lib/libstdc++.so.6 (0x00007f6b9a507000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f6b9a205000)
	libgcc_s.so.1 => /home/ec2-user/anaconda3/lib/libgcc_s.so.1 (0x00007f6ba43bf000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f6b99e38000)
	libopen-rte.so.40 => /opt/amazon/efa/lib64/libopen-rte.so.40 (0x00007f6b99b85000)
	libopen-pal.so.40 => /opt/amazon/efa/lib64/libopen-pal.so.40 (0x00007f6b9987f000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00007f6b9967c000)
	libz.so.1 => /home/ec2-user/anaconda3/lib/libz.so.1 (0x00007f6b99465000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6ba41c3000)

nccl version 2.7

[ec2-user@ip-172-31-24-194 ~]$ ldd $FOLDER_ROOT/nccl-tests/build/all_reduce_perf
	linux-vdso.so.1 =>  (0x00007ffcf057d000)
	libcudart.so.10.0 => /usr/local/cuda-10.0/lib64/libcudart.so.10.0 (0x00007f119bc4b000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f119ba43000)
	libmpi.so.12 => not found
	libcurand.so.10.0 => /usr/local/cuda-10.0/lib64/libcurand.so.10.0 (0x00007f11978dc000)
	libnccl.so.2 => /home/ec2-user/nccl/nccl-2.4.6/nccl/build/lib/libnccl.so.2 (0x00007f1192e67000)
	libnvToolsExt.so.1 => /usr/local/cuda-10.0/lib64/libnvToolsExt.so.1 (0x00007f1192c5e000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f1192a42000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f119283e000)
	libstdc++.so.6 => /home/ec2-user/anaconda3/lib/libstdc++.so.6 (0x00007f1192504000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f1192202000)
	libgcc_s.so.1 => /home/ec2-user/anaconda3/lib/libgcc_s.so.1 (0x00007f119c0c1000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f1191e35000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f119bec5000)
[ec2-user@ip-172-31-24-194 ~]$ export FOLDER_ROOT=/home/ec2-user/nccl/nccl-2.4.6

Add build to gitignore.

After build, /build will show up in git:

$ git status
HEAD detached at c864b73
Untracked files:
  (use "git add <file>..." to include in what will be committed)

        build/

nothing added to commit but untracked files present (use "git add" to track)

This is particularly annoying when using nccl-tests as a submodule, as the submodule will be marked as "modified" in parent repo:

modified:   nccl-tests (untracked content)

NCCL failure common.cu:908 'unhandled cuda error'

I have installed cuda 8.0.61, cudnn 5.1.10 and nccl 2.1.15 on Ubuntu 14.04. I have successfully verified cuda and cudnn using official examples.
However, I run into errors using nccl-tests
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
NCCL failure common.cu:908 'unhandled cuda error'.

I have tried to install nccl2 locally and using network repo. But both ways failed.

Anyone can help?

undefine reference to ncclTestEngine at compile time

[barajasc@gpunode102 nccl-tests]$ make
make -C src build
make[1]: Entering directory `/users/barajasc/tests/nccl/nccl-tests/src'
Compiling  all_reduce.cu                       > ../build/all_reduce.o
Compiling  common.cu                           > ../build/common.o
Linking  ../build/all_reduce.o               > ../build/all_reduce_perf
ld: ../build/common.o: in function `threadRunTests(threadArgs*)':
/users/barajasc/tests/nccl/nccl-tests/src/common.cu:520: undefined reference to `ncclTestEngine'
ld: /users/barajasc/tests/nccl/nccl-tests/src/common.cu:520: undefined reference to `ncclTestEngine'
ld: ../build/common.o: in function `run()':
/users/barajasc/tests/nccl/nccl-tests/src/common.cu:763: undefined reference to `ncclTestEngine'
make[1]: *** [../build/all_reduce_perf] Error 1
make[1]: Leaving directory `/users/barajasc/tests/nccl/nccl-tests/src'
make: *** [src.build] Error 2

I have both CUDA_HOME and NCCL_HOME set. I'm currently using CUDA 10.1.243 and NCCL 2.7.6 . I've tried both icc 2019a and gcc 8.2.0 and it just will not compile. I cannot seem to find any other information related to this. I've tried both GNU make 3.0 and 3.4 just in case but nothing worked.

I'm hoping it is something simple but I cannot imagine what. I should also say that I have compiled NCCL from source but had no errors during compile time.

Test CUDA failure common.cu:730 'no CUDA-capable device is detected'

I run the test after install nccl from rpm

sudo yum install libnccl-2.4.8-1+cuda10.0 libnccl-devel-2.4.8-1+cuda10.0 libnccl-static-2.4.8-
1+cuda10.0

The error does not show any mean I understand.

nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
Using devices
458398: Test CUDA failure common.cu:730 'no CUDA-capable device is detected'

The machine I run contain 8 Tesla T4

Could you help me solve this?

MPIRun stuck on EC2

I'm seeing the same problem as otherson two Amazon EC2 p3.2xlarge instances with V100 GPU. Here is the command I have:

mpirun -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME -np 2 -mca btl ^openib --hostfile hosts all_reduce_perf -b 1K -e 100M -c 0

where hosts are simply:

ip1 max_slots=1
ip2 max_slots=1

and NCCL_SOCKET_IFNAME=ens3

The symptom is what I see:
nThread 1 nGpus 1 minBytes 1024 maxBytes 104857600 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 0

Nothing else gets printed. Any ideas?

Running in container got failed: misc/nvmlwrap.cu:170 WARN nvmlDeviceSetCpuAffinity() failed: Unknown Error

Docker base image: nvidia/cuda:8.0-cudnn7-devel-ubuntu16.04
Cuda: 8.0
Driver Version: 375.39
NCCL: 2.1.15-1+cuda8.0

Has anyone encountered this failure in the output log: misc/nvmlwrap.cu:170 WARN nvmlDeviceSetCpuAffinity() failed: Unknown Error

Here is the log:

root@node4:/opt/nccl-tests# mpirun --allow-run-as-root -np 2    \
>     -host host_6176,host_6177\
>     -byslot -v -x NCCL_SHM_DISABLE=1 \
>     -mca btl_openib_want_cuda_gdr 1 \
>     -mca btl_tcp_if_include ib0 -x NCCL_DEBUG=INFO \
>     build/all_reduce_perf -b 16M -e 32M -g 4 -c 0 -z 0 -d float
--------------------------------------------------------------------------
The following command line options and corresponding MCA parameter have
been deprecated and replaced as follows:

  Command line options:
    Deprecated:  --byslot, -byslot
    Replacement: --map-by slot

  Equivalent MCA parameter:
    Deprecated:  rmaps_base_byslot
    Replacement: rmaps_base_mapping_policy=slot

The deprecated forms *will* disappear in a future version of Open MPI.
Please update to the new syntax.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'node4', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
nThread 1 nGpus 4 minBytes 16777216 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 0 
[node4:01848] 1 more process has sent help message help-mpi-btl-openib.txt / default subnet prefix
[node4:01848] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
node4:1853:1853 [0] INFO NET : Using interface ib0:192.168.56.134<0>
node4:1853:1853 [0] INFO NET/IB : Using interface ib0 for sideband communication
node4:1853:1853 [0] INFO NET/IB: [0] mlx4_1:1/IB 
node4:1853:1853 [0] INFO NET/IB: [1] mlx4_0:1/IB 
node4:1853:1853 [0] INFO Using internal Network IB
node4:1854:1854 [4] INFO Using NCCL Low-latency algorithm for sizes below 16384

node4:1853:1880 [0] misc/nvmlwrap.cu:170 WARN nvmlDeviceSetCpuAffinity() failed: Unknown Error 

node4:1853:1880 [0] init.cu:524 WARN Failed to set CPU affinity

node4:1853:1883 [1] misc/nvmlwrap.cu:170 WARN nvmlDeviceSetCpuAffinity() failed: Unknown Error 

node4:1853:1883 [1] init.cu:524 WARN Failed to set CPU affinity
node4:1853:1880 [0] INFO NET/IB: Dev 0 Port 1 qpn 3970 mtu 5 LID 26
node4:1853:1883 [1] INFO NET/IB: Dev 0 Port 1 qpn 3971 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 3974 mtu 5 LID 26
node4:1853:1880 [0] INFO CUDA Dev 0, IB Ports : mlx4_1/1(SOC) mlx4_0/1(PHB) 
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 3979 mtu 5 LID 26
node4:1853:1883 [1] INFO CUDA Dev 1, IB Ports : mlx4_1/1(SOC) mlx4_0/1(PHB) 

node4:1853:1885 [2] misc/nvmlwrap.cu:170 WARN nvmlDeviceSetCpuAffinity() failed: Unknown Error 

node4:1853:1885 [2] init.cu:524 WARN Failed to set CPU affinity
node4:1853:1885 [2] INFO NET/IB: Dev 0 Port 1 qpn 3982 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 3985 mtu 5 LID 26
node4:1853:1885 [2] INFO CUDA Dev 2, IB Ports : mlx4_1/1(SOC) mlx4_0/1(PHB) 
node4:1854:1884 [4] INFO NET/IB: Dev 0 Port 1 qpn 3988 mtu 5 LID 26
node4:1854:1886 [5] INFO NET/IB: Dev 0 Port 1 qpn 3989 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 3992 mtu 5 LID 26
node4:1854:1884 [4] INFO CUDA Dev 4, IB Ports : mlx4_1/1(PHB) mlx4_0/1(SOC) 
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 3997 mtu 5 LID 26
node4:1854:1886 [5] INFO CUDA Dev 5, IB Ports : mlx4_1/1(PHB) mlx4_0/1(SOC) 

node4:1853:1887 [3] misc/nvmlwrap.cu:170 WARN nvmlDeviceSetCpuAffinity() failed: Unknown Error 

node4:1853:1887 [3] init.cu:524 WARN Failed to set CPU affinity
node4:1853:1887 [3] INFO NET/IB: Dev 0 Port 1 qpn 4000 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 4003 mtu 5 LID 26
node4:1853:1887 [3] INFO CUDA Dev 3, IB Ports : mlx4_1/1(SOC) mlx4_0/1(PHB) 
node4:1854:1888 [6] INFO NET/IB: Dev 0 Port 1 qpn 4006 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 4009 mtu 5 LID 26
node4:1854:1888 [6] INFO CUDA Dev 6, IB Ports : mlx4_1/1(PHB) mlx4_0/1(SOC) 
node4:1854:1889 [7] INFO NET/IB: Dev 0 Port 1 qpn 4012 mtu 5 LID 26
node4:1853:1862 [0] INFO NET/IB: Dev 0 Port 1 qpn 4015 mtu 5 LID 26
node4:1854:1889 [7] INFO CUDA Dev 7, IB Ports : mlx4_1/1(PHB) mlx4_0/1(SOC) 
node4:1853:1880 [0] INFO Using 256 threads
node4:1853:1880 [0] INFO Min Comp Cap 5
node4:1853:1880 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
node4:1853:1880 [0] INFO Ring 00 :    0   1   2   3   4   5   6   7
node4:1853:1880 [0] INFO Ring 01 :    0   1   2   3   4   5   6   7
node4:1853:1883 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/direct pointer
node4:1854:1888 [6] INFO Ring 00 : 6[6] -> 7[7] via P2P/direct pointer
node4:1853:1880 [0] INFO 7 -> 0 via NET/IB/1
node4:1853:1885 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
node4:1854:1884 [4] INFO 3 -> 4 via NET/IB/0
node4:1853:1880 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
node4:1854:1886 [5] INFO Ring 00 : 5[5] -> 6[6] via P2P/direct pointer
node4:1854:1884 [4] INFO Ring 00 : 4[4] -> 5[5] via P2P/direct pointer
node4:1854:1889 [7] INFO NET/IB: Dev 0 Port 1 qpn 4018 mtu 5 LID 26
node4:1853:1887 [3] INFO NET/IB: Dev 1 Port 1 qpn 1418 mtu 5 LID 31
node4:1854:1886 [5] INFO Ring 01 : 5[5] -> 6[6] via P2P/direct pointer
node4:1853:1883 [1] INFO Ring 01 : 1[1] -> 2[2] via P2P/direct pointer
node4:1853:1885 [2] INFO Ring 01 : 2[2] -> 3[3] via P2P/direct pointer
node4:1854:1888 [6] INFO Ring 01 : 6[6] -> 7[7] via P2P/direct pointer
node4:1853:1880 [0] INFO 7 -> 0 via NET/IB/1
node4:1854:1884 [4] INFO 3 -> 4 via NET/IB/0
node4:1853:1880 [0] INFO Ring 01 : 0[0] -> 1[1] via P2P/direct pointer
node4:1854:1884 [4] INFO Ring 01 : 4[4] -> 5[5] via P2P/direct pointer
node4:1853:1887 [3] INFO NET/IB: Dev 1 Port 1 qpn 1421 mtu 5 LID 31
node4:1854:1889 [7] INFO NET/IB: Dev 0 Port 1 qpn 4021 mtu 5 LID 26
# NCCL Tests compiled with NCCL 2.1
# Using devices
#   Rank  0 on  node4 device  0 [0x04] GeForce GTX TITAN X
#   Rank  1 on  node4 device  1 [0x05] GeForce GTX TITAN X
#   Rank  2 on  node4 device  2 [0x08] GeForce GTX TITAN X
#   Rank  3 on  node4 device  3 [0x09] GeForce GTX TITAN X
#   Rank  4 on  node4 device  4 [0x85] GeForce GTX TITAN X
#   Rank  5 on  node4 device  5 [0x86] GeForce GTX TITAN X
#   Rank  6 on  node4 device  6 [0x89] GeForce GTX TITAN X
#   Rank  7 on  node4 device  7 [0x8a] GeForce GTX TITAN X

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
node4:1853:1853 [0] INFO Launch mode Group
    16777216       4194304   float     sum   23.582   0.71   1.25       N/A   22.709   0.74   1.29      N/A
    17825792       4456448   float     sum   23.434   0.76   1.33       N/A   25.079   0.71   1.24      N/A
    18874368       4718592   float     sum   23.700   0.80   1.39       N/A   23.739   0.80   1.39      N/A
    19922944       4980736   float     sum   26.029   0.77   1.34       N/A   27.943   0.71   1.25      N/A
    20971520       5242880   float     sum   26.255   0.80   1.40       N/A   26.225   0.80   1.40      N/A
    22020096       5505024   float     sum   27.565   0.80   1.40       N/A   27.557   0.80   1.40      N/A
    23068672       5767168   float     sum   31.003   0.74   1.30       N/A   32.167   0.72   1.26      N/A
    24117248       6029312   float     sum   33.432   0.72   1.26       N/A   33.461   0.72   1.26      N/A
    25165824       6291456   float     sum   35.148   0.72   1.25       N/A   34.220   0.74   1.29      N/A
    26214400       6553600   float     sum   36.412   0.72   1.26       N/A   35.262   0.74   1.30      N/A
    27262976       6815744   float     sum   37.865   0.72   1.26       N/A   38.155   0.71   1.25      N/A
    28311552       7077888   float     sum   39.746   0.71   1.25       N/A   39.151   0.72   1.27      N/A
    29360128       7340032   float     sum   40.784   0.72   1.26       N/A   40.785   0.72   1.26      N/A
    30408704       7602176   float     sum   39.909   0.76   1.33       N/A   42.975   0.71   1.24      N/A
    31457280       7864320   float     sum   44.535   0.71   1.24       N/A   42.986   0.73   1.28      N/A
    32505856       8126464   float     sum   45.921   0.71   1.24       N/A   45.149   0.72   1.26      N/A
    33554432       8388608   float     sum   47.618   0.70   1.23       N/A   47.459   0.71   1.24      N/A
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 1.29001

/usr/bin/ld:cannot find -lmpi

Hi, I encountered this error while compiling:

/usr/bin/ld:cannot find -lmpi

I have installed openmpi 4.0.2 to /openmpi4.0, and specified the MPI_HOME to be here.

Thank you!

Run with MPI on 40 processes test failed.

GPU driver: 440.82
CUDA: 11.0,
NCCL: 2704
One node with 4 Tesla V100-SXM2 GPUs.

I've tried running the tests with one node and 4 GPUs with no problems.

However, my test with MPI on 40 processes failed. Here's the command I ran:

mpirun -np 40 ./build/all_gather_perf -b 8 -e 128M -f 2 -g 4

The error I got was:

nThread 1 nGpus 1 minBytes 0 maxBytes 134217728 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1

Using devices
e860e6c2d0e6: Test CUDA failure common.cu:732 'invalid device ordinal'
e860e6c2d0e6: Test CUDA failure common.cu:732 'invalid device ordinal'
e860e6c2d0e6: Test CUDA failure common.cu:732 'invalid device ordinal'
e860e6c2d0e6: Test CUDA failure common.cu:732 'invalid device ordinal'
e860e6c2d0e6: Test CUDA failure common.cu:732 'invalid device ordinal'
e860e6c2d0e6: Test CUDA failure common.cu:732 'invalid device ordinal'
e860e6c2d0e6: Test CUDA failure common.cu:732 'invalid device ordinal'
e860e6c2d0e6: Test CUDA failure common.cu:732 'invalid device ordinal'

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[20089,1],20]
Exit code: 2

However, when I tried to run the same command with number of processes <= 4, I got no problems. I'm wondering if this is the expected behavior at all. I was assuming the 4 GPUs with 1 node could support more than 4 processes, especially since the readme only mentioned that "Run with MPI could pontentially support multple nodes with 4 GPUs each". Would really appreciate some help if anyone could help me figure this out. Thanks!

Why is all_reduce_perf result not consistent with that of reduce_scatter_perf and all_gather_perf?

I am running nccl-test on my bare metal machines. I have set up RoCE and GPU Direct RDMA. As far as I know, the allReduce implementation in NCCL is a ring-based one, so theoretically the all_reduce_perf result should be in line with reduce_scatter_perf and all_gather_perf. However, in my case, when I set NCCL_IB_DISABLE=0 and NCCL_IB_CUDA_SUPPORT=1, the all_reduce_perf result is way worse than any of reduce_scatter_perf and all_gather_perf. How is that supposed to happen if an allReduce operation is made up of a reduceScatter and an allGather?
NCCL_IB_DISABLE=0 and NCCL_IB_CUDA_SUPPORT=1, GPU0 on node 1<-> GPU0 on node 2
all_reduce_perf result

reduce_scatter_perf

all_gather_perf

Interestingly, when I turn off GPU Direct RDMA by setting NCCL_IB_DISABLE=0 and NCCL_IB_CUDA_SUPPORT=0, the all_reduce_perf result is even better than both reduce_scatter_perf and all_gather.
NCCL_IB_DISABLE=0 and NCCL_IB_CUDA_SUPPORT=0, GPU0 on node 1<-> GPU0 on node 2
all_reduce_perf result

reduce_scatter_perf

all_gather_perf

On top of the initial question, isn't GPU Direct RDMA supposed to speed up the communication? Is it because the GPU PCIe topology in my environment doesn't support GPU Direct RDMA at all?

Internal error

I'm unable to get the nccl-test to run successfully. It fails with an internal error.

Singularity> ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid 108039 on ip-0AB85F07 device  0 [0x00] Tesla V100-PCIE-16GB
#   Rank  1 Pid 108039 on ip-0AB85F07 device  1 [0x00] Tesla V100-PCIE-16GB
#   Rank  2 Pid 108039 on ip-0AB85F07 device  2 [0x00] Tesla V100-PCIE-16GB
#   Rank  3 Pid 108039 on ip-0AB85F07 device  3 [0x00] Tesla V100-PCIE-16GB
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
ip-0AB85F07: Test NCCL failure common.cu:775 'internal error'

Nvprof errors for nccl tests

Trying to collect nvprof for nccl-tests errors out with 'an illegal memory access was encountered' for more than 1 GPU scenarios

nvprof -o  All_reduce.4GPU.nvvp ./all_reduce_perf -g 4
nThread 1 nGpus 4 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
==102728== NVPROF is profiling process 102728, command: ./all_reduce_perf -g 4
# NCCL Tests compiled with NCCL 2.1
# Using devices
#   Rank  0 on     paiws7 device  0 [0x04] Tesla V100-SXM2-16GB
#   Rank  1 on     paiws7 device  1 [0x05] Tesla V100-SXM2-16GB
#   Rank  2 on     paiws7 device  2 [0x03] Tesla V100-SXM2-16GB
#   Rank  3 on     paiws7 device  3 [0x04] Tesla V100-SXM2-16GB

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
Cuda failure common.cu:510 'an illegal memory access was encountered'
==102728== Error: Internal profiling error 4054:1.
======== Error: CUDA profiling error.


However it works fine for 1 GPU

nvprof -f -o  All_reduce.1GPU.nvvp ./all_reduce_perf -g 1
nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
==102808== NVPROF is profiling process 102808, command: ./all_reduce_perf -g 1
# NCCL Tests compiled with NCCL 2.1
# Using devices
#   Rank  0 on     paiws7 device  0 [0x04] Tesla V100-SXM2-16GB

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
    33554432       8388608   float     sum    0.090  372.75   0.00    0e+00    0.003  12402.53   0.00    0e+00
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 0

==102808== Generated result file: /root/nccl-tests/build/All_reduce.1GPU.nvvp

datacheck in nccl-tests

Hi everyone!

I would like to know what exactly datacheck represents in nccl-tests? In the README it is written that the -c option is for checking the correctness of results. This is a bit unclear to me since the tests are essentially for checking the correctness (so what is the point of datacheck?). And, I am wondering what happens if one sets -c into 0?

Performance issues of NCCL Allreduce

Hi. I try to repeat the performance of allreduce in the official document, which has the bandwidth of about 40 GB/s. I try to run the test on the similar environment. Here is the environment setting:

4 servers, and each has 8 V100 16GB GPUs.
The GPUs within a server are connected by two nvlinks.
The servers are interconnected by 4xEDR 100Gbps infiniband.

The NCCL version is 2.4.7 and the CUDA version is 10.0

And the command I used to run the test

mpirun -np 32 -H server1:8,server2:8,server3:8,server4:8 \
-x NCCL_DEBUG=INFO \
./build/all_reduce_perf -b 8 -e 512M -f 2

But results is far from expectation. The busbw is about 6 GB/s. I think the only different of my setting and DGX-1 is I only have 2 nvlinks within a servers, which I think is not the bottleneck.

Here are what I am confused:

Why the performance of nccl in my environment is so slow? I think it should be at least 12 GB/s.
Why the bandwidth in DGX-1 can reach 40 GB/s? From my understanding, one EDR can provide the bandwidth of about 3GB/s. So the upperbound bandwidth for 4xEDR is 12 GB/s. And it would become the bottleneck of the allreduce operation. Is there anything special make the connection faster in DGX-1?

Thank you for your attention.

bus bandwidth seems slightly off for all_gather

According to https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md#allgather:

algbw = S/t
busbw = algbw * (n-1)/n

According to the code: https://github.com/NVIDIA/nccl-tests/blob/master/src/all_gather.cu#L50

algbw = S/t * (n-1)/n
busbw = algbw

Am I reading the code correctly? Seems like either the code or the document is incorrect.

I think the same issue may exist for reduce_scatter.

why the Out of Bounds values is 0?

Hello everyone!

After running nccl tests linked with my nccl fork, I got a result as you can see in this photo:

Although the error column in the attached photo has meaningful numbers rather than 0, the final result which is Out of Bounds Values is 0 and it seems all tests are passed. I would like to know why that happened?

GPU affinity sets different CPU masks when using the same NCCL_TOPO_FILE

Hi,

When running all_reduce_perf, NCCL sets different CPU masks when running the test through Slurm scheduler as opposed to running directly with OpenMPI. Both tests runs with the same NCCL_TOPO_FILE variable.

# Running with slurm:
[0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-11.0/efa/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
...
[0] NCCL INFO Setting affinity for GPU 0 to ff

# Running with OpenMPI
[0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/cuda-11.0/efa/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
...
[0] NCCL INFO Setting affinity for GPU 0 to ffffff

Why would this happen? Are we missing any Slurm configuration or NCCL environment variable we need to set?

Thank you!

error during all_reduce_perf with openmpi running on Azure Standard_NC24rs_v3 Infiniband.

I run all_reduce_perf with openmpi and GPUDirectRDMA on Azure Standard_NC24rs_v3

My command is

/usr/bin/mpirun --hostfile HOSTFILE --mca btl tcp,self --mca btl_tcp_if_exclude docker0,lo --bind-to none -N 1 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=all -x LD_LIBRARY_PATH=$HOME/nccl/build/lib:/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH nccl-tests/build/all_reduce_perf --minbytes 8 --maxbytes 256M --stepfactor 2 --ngpus 1 --check 0 --nthreads 1

And the error message is:

/usr/bin/mpirun --hostfile HOSTFILE --mca btl tcp,self --mca btl_tcp_if_exclude docker0,lo --bind-to none -N 1 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=all -x LD_LIBRARY_PATH=$HOME/nccl/build/lib:/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH nccl-tests/build/all_reduce_perf --minbytes 8 --maxbytes 256M --stepfactor 2 --ngpus 1 --check 0 --nthreads 1 
# nThread 1 nGpus 1 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 0 
#
# Using devices
#   Rank  0 Pid  82431 on pkb-b52c0368-0 device  0 [0x00] Tesla V100-PCIE-16GB
#   Rank  1 Pid  82676 on pkb-b52c0368-1 device  0 [0x00] Tesla V100-PCIE-16GB
pkb-b52c0368-0:82431:82431 [0] NCCL INFO Bootstrap : Using [0]eth0:10.0.0.4<0> [1]eth1:fe80::215:5dff:fe33:ff27%eth1<0>
pkb-b52c0368-0:82431:82431 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
pkb-b52c0368-0:82431:82431 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB eth0:10.0.0.4<0>
NCCL version 2.5.6+cuda10.0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO Bootstrap : Using [0]eth0:10.0.0.5<0> [1]eth1:fe80::215:5dff:fe33:ff70%eth1<0>
pkb-b52c0368-1:82676:82676 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
pkb-b52c0368-1:82676:82676 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB eth0:10.0.0.5<0>
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 545505 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 582757 mtu 5 LID 57
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Intel CPU (PCI 12, InterCpu 8)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Intel CPU (PCI 12, InterCpu 8)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/df5e6558-97d1-4698-9de8-a5f603d2bef7/pci0002:00/0002:00:02.0 -> 0/0/0/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/63787d95-43b1-4da0-adf3-de0aba056eb9/pci0002:00/0002:00:02.0 -> 0/0/0/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO === System : maxWidth 12 maxSpeed 12 ===
pkb-b52c0368-1:82676:82696 [0] NCCL INFO CPU/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - PCI/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO             + PCI[12] - PCI/DE
pkb-b52c0368-1:82676:82696 [0] NCCL INFO                         + PCI[12] - PCI/47505500
pkb-b52c0368-1:82676:82696 [0] NCCL INFO                                     + PCI[12] - GPU/5B6C00000 (1)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO                         + PCI[12] - PCI/63787D95
pkb-b52c0368-1:82676:82696 [0] NCCL INFO                                     + PCI[12] - NIC/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO === System : maxWidth 12 maxSpeed 12 ===
pkb-b52c0368-1:82676:82696 [0] NCCL INFO                                                 + NET[12] - NET/0 (0)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO ==========================================
pkb-b52c0368-1:82676:82696 [0] NCCL INFO GPU/5B6C00000 :GPU/5B6C00000 (0/5000/0) CPU/0 (4/12/2) NET/0 (5/12/2) 
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/0 :GPU/5B6C00000 (5/12/2) CPU/0 (5/12/2) NET/0 (0/5000/0) 
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 24/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-1:82676:82696 [0] NCCL INFO  0 : NET/0 GPU/1 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO CPU/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - PCI/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO             + PCI[12] - PCI/DE
pkb-b52c0368-0:82431:82448 [0] NCCL INFO                         + PCI[12] - PCI/47505500
pkb-b52c0368-0:82431:82448 [0] NCCL INFO                                     + PCI[12] - GPU/92FB00000 (0)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 12/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-1:82676:82696 [0] NCCL INFO  0 : NET/0 GPU/1 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO                         + PCI[12] - PCI/DF5E6558
pkb-b52c0368-0:82431:82448 [0] NCCL INFO                                     + PCI[12] - NIC/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO                                                 + NET[12] - NET/0 (0)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO ==========================================
pkb-b52c0368-0:82431:82448 [0] NCCL INFO GPU/92FB00000 :GPU/92FB00000 (0/5000/0) CPU/0 (4/12/2) NET/0 (5/12/2) 
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/0 :GPU/92FB00000 (5/12/2) CPU/0 (5/12/2) NET/0 (0/5000/0) 
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 24/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 12/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Channel 00/02 :    0   1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Channel 01/02 :    0   1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Threads per block : 512/640/512
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Latency/AlgBw | Tree/    LL | Tree/ LL128 | Tree/Simple | Ring/    LL | Ring/ LL128 | Ring/Simple |
pkb-b52c0368-0:82431:82448 [0] NCCL INFO     Broadcast |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.5/  3.0|    6.1/ 11.2|   15.0/ 12.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO        Reduce |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.5/  3.0|    6.1/ 11.2|   15.0/ 12.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO     AllGather |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.5/  6.0|    6.1/ 22.5|   15.0/ 24.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO ReduceScatter |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.5/  6.0|    6.1/ 22.5|   15.0/ 24.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO     AllReduce |   14.4/  3.6|   19.4/  8.4|  100.0/ 10.8|    5.4/  3.0|    8.6/ 11.2|   21.6/ 12.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->1|1->0->-1/-1/-1
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Threads per block : 512/640/512
pkb-b52c0368-1:82676:82696 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] 0/-1/-1->1->-1|-1->1->0/-1/-1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 742314 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 159581 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 92fb00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 5b6c00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 00 : 1[5b6c00000] -> 0[92fb00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 00 : 0[92fb00000] -> 1[5b6c00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 00 : 0[92fb00000] -> 1[5b6c00000] [send] via NET/IB/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 00 : 1[5b6c00000] -> 0[92fb00000] [send] via NET/IB/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 126471 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 424932 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 470331 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 490679 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 92fb00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 5b6c00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 01 : 1[5b6c00000] -> 0[92fb00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 01 : 0[92fb00000] -> 1[5b6c00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 01 : 0[92fb00000] -> 1[5b6c00000] [send] via NET/IB/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 01 : 1[5b6c00000] -> 0[92fb00000] [send] via NET/IB/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 179520 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 542420 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO comm 0x7f7724001aa0 rank 0 nranks 2 cudaDev 0 busId 2fb00000 - Init COMPLETE
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO Launch mode Parallel
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-1:82676:82696 [0] NCCL INFO comm 0x7fb6cc001aa0 rank 1 nranks 2 cudaDev 0 busId b6c00000 - Init COMPLETE
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0

pkb-b52c0368-0:82431:82449 [0] transport/net_ib.cc:774 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 32631, vendor err 129
pkb-b52c0368-0:82431:82449 [0] NCCL INFO include/net.h:28 -> 2
pkb-b52c0368-0:82431:82449 [0] NCCL INFO transport/net.cc:377 -> 2
pkb-b52c0368-0:82431:82449 [0] NCCL INFO transport.cc:166 -> 2 [Proxy Thread]

pkb-b52c0368-1:82676:82697 [0] transport/net_ib.cc:774 NCCL WARN NET/IB : Got completion with error 12, opcode 671036413, len 0, vendor err 129
pkb-b52c0368-1:82676:82697 [0] NCCL INFO include/net.h:28 -> 2
pkb-b52c0368-1:82676:82697 [0] NCCL INFO transport/net.cc:377 -> 2
pkb-b52c0368-1:82676:82697 [0] NCCL INFO transport.cc:166 -> 2 [Proxy Thread]
pkb-b52c0368-1: Test NCCL failure common.cu:345 'unhandled system error'
 .. pkb-b52c0368-1: Test failure common.cu:393
 .. pkb-b52c0368-1: Test failure common.cu:492
 .. pkb-b52c0368-1: Test failure all_reduce.cu:103
 .. pkb-b52c0368-1: Test failure common.cu:518
 .. pkb-b52c0368-1: Test failure common.cu:839
pkb-b52c0368-0: Test NCCL failure common.cu:345 'unhandled system error'
 .. pkb-b52c0368-0: Test failure common.cu:393
 .. pkb-b52c0368-0: Test failure common.cu:492
 .. pkb-b52c0368-0: Test failure all_reduce.cu:103
 .. pkb-b52c0368-0: Test failure common.cu:518
 .. pkb-b52c0368-0: Test failure common.cu:839
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[44150,1],1]
  Exit code:    3
--------------------------------------------------------------------------

Could you help me solve this problem?

make error `error: missing binary operator before token "("`

user in nccl-tests at ubuntu on  master
➜ make
make -C src build
make[1]: Entering directory '/home/user/nccl-tests/src'
Compiling  all_reduce.cu                       > ../build/all_reduce.o
Compiling  common.cu                           > ../build/common.o
common.cu:335:38: error: missing binary operator before token "("
 #if NCCL_VERSION_CODE >= NCCL_VERSION(2,4,0)
                                      ^
Makefile:72: recipe for target '../build/common.o' failed
make[1]: *** [../build/common.o] Error 1
make[1]: Leaving directory '/home/user/nccl-tests/src'
Makefile:17: recipe for target 'src.build' failed
make: *** [src.build] Error 2

This error occurs when I build

Env :

user in ~ at ubuntu
➜ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

user in ~ at ubuntu
➜ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

cuda-gdb error

when I compile and run nccl-test,use
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
It can run successfully,But an error occurred when running the program with cuda-gdb.

Rank 0 Pid 12414 on 51e6e7896958 device 0 [0x1a] Tesla V100-SXM3-32GB
Rank 1 Pid 12414 on 51e6e7896958 device 1 [0x1b] Tesla V100-SXM3-32GB
Rank 2 Pid 12414 on 51e6e7896958 device 2 [0x1c] Tesla V100-SXM3-32GB
Rank 3 Pid 12414 on 51e6e7896958 device 3 [0x1e] Tesla V100-SXM3-32GB
[New Thread 0x7fffe6d73700 (LWP 12424)]
[New Thread 0x7fffe6572700 (LWP 12425)]
[New Thread 0x7fffe5a55700 (LWP 12426)]
[New Thread 0x7fffe523c700 (LWP 12427)]
[New Thread 0x7fffe4a23700 (LWP 12428)]
warning: Cuda API error detected: cudaDeviceEnablePeerAccess returned (0x2c0)
warning: Cuda API error detected: cudaGetLastError returned (0x2c0)
warning: Cuda API error detected: cudaDeviceEnablePeerAccess returned (0x2c0)

l.Does anyone have a similar problem.Or Is there any solution?

nvidia / nccl-tests Goto Github PK

nccl-tests's Issues

Configurations:

To Reproduce

nccl-tests fail with GDR enabled:

nccl-tests OK, with GDR disabled:

To Building the faulty OpenMPI environment:

OpenMPI

nccl-tests

The Same HW/SW and Tests work properly with MVAPICH2-GDR + gdrcopy

nccl-tests OK, with GDR enabled:

To Building the workable MVAPICH2 environment:

gdrcopy

nccl-tests

NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_default_gid_prefix to 0.

nThread 1 nGpus 8 minBytes 9 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1

If you are absolutely sure that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0.

NCCL Tests compiled with NCCL 2.1

Using devices

Rank 0 on seedsmed-f02hpc01 device 0 [0x04] Tesla V100-PCIE-16GB

Rank 1 on seedsmed-f02hpc01 device 1 [0x06] Tesla V100-PCIE-16GB

Rank 2 on seedsmed-f02hpc01 device 2 [0x07] Tesla V100-PCIE-16GB

Rank 3 on seedsmed-f02hpc01 device 3 [0x08] Tesla V100-PCIE-16GB

Rank 4 on seedsmed-f02hpc01 device 4 [0x0c] Tesla V100-PCIE-16GB

Rank 5 on seedsmed-f02hpc01 device 5 [0x0d] Tesla V100-PCIE-16GB

Rank 6 on seedsmed-f02hpc01 device 6 [0x0e] Tesla V100-PCIE-16GB

Rank 7 on seedsmed-f02hpc01 device 7 [0x0f] Tesla V100-PCIE-16GB

Rank 8 on seedsmed-f02hpc02 device 0 [0x04] Tesla V100-PCIE-16GB

Rank 9 on seedsmed-f02hpc02 device 1 [0x06] Tesla V100-PCIE-16GB

Rank 10 on seedsmed-f02hpc02 device 2 [0x07] Tesla V100-PCIE-16GB

Rank 11 on seedsmed-f02hpc02 device 3 [0x08] Tesla V100-PCIE-16GB

Rank 12 on seedsmed-f02hpc02 device 4 [0x0c] Tesla V100-PCIE-16GB

Rank 13 on seedsmed-f02hpc02 device 5 [0x0d] Tesla V100-PCIE-16GB

Rank 14 on seedsmed-f02hpc02 device 6 [0x0e] Tesla V100-PCIE-16GB

out-of-place in-place

bytes N type op time algbw busbw res time algbw busbw res

Rank 15 on seedsmed-f02hpc02 device 7 [0x0f] Tesla V100-PCIE-16GB

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[62619,1],1] Exit code: 1

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Recommend Projects

Recommend Topics

Recommend Org

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.

If you are absolutely sure that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[62619,1],1]
Exit code: 1

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.