Is there a special way of building the test if I want to run it on Ethernet? Trying to

Thanks, that <a href="https://github.com/cybertronai/aws-network-benchmarks/blob/maste

Have you tried this PR against NCCL; <a class="issue-link js-issue-link" data-error-te

Closing since issue was solved. The solution to oversubion was to use hosts.

Getting "help-mpi-btl-base.txt / btl:no-nics" when trying to run on Ethernet network about nccl-tests HOT 9 CLOSED

nvidia commented on August 20, 2024

Getting "help-mpi-btl-base.txt / btl:no-nics" when trying to run on Ethernet network

from nccl-tests.

Comments (9)

sjeaugey commented on August 20, 2024 1

When running with MPI, you would usually want to not set -g nor -t (keep both equal to one) and just vary the -np argument of mpirun to match the total number of GPUs.

This would match what most framework do.

from nccl-tests.

sjeaugey commented on August 20, 2024

There is no Infiniband on AWS, you can add -mca ^openib to mpirun to suppress this warning.

Usually problems are due to MPI trying to use docker0 to communicate, so I usually add -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 as well to force the usage of the ENA card.

from nccl-tests.

yaroslavvb commented on August 20, 2024

Thanks, that fixed it!
I'm now seeing the following by doing 1GB allreduce on 2 nodes.

/usr/local/mpi/bin/mpirun --host 172.31.15.234,172.31.3.83 -np 4 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH -oversubscribe ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2 -g 4

# nThread 1 nGpus 4 minBytes 1342177280 maxBytes 1342177280 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   8932 on ip-172-31-15-234 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid   8932 on ip-172-31-15-234 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid   8932 on ip-172-31-15-234 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid   8932 on ip-172-31-15-234 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank  4 Pid   8933 on ip-172-31-15-234 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank  5 Pid   8933 on ip-172-31-15-234 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank  6 Pid   8933 on ip-172-31-15-234 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank  7 Pid   8933 on ip-172-31-15-234 device  7 [0x00] Tesla V100-SXM2-32GB
#   Rank  8 Pid   8299 on ip-172-31-3-83 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  9 Pid   8299 on ip-172-31-3-83 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank 10 Pid   8299 on ip-172-31-3-83 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank 11 Pid   8299 on ip-172-31-3-83 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank 12 Pid   8300 on ip-172-31-3-83 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank 13 Pid   8300 on ip-172-31-3-83 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank 14 Pid   8300 on ip-172-31-3-83 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank 15 Pid   8300 on ip-172-31-3-83 device  7 [0x00] Tesla V100-SXM2-32GB
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  1342177280     335544320   float     sum   396609    3.38    6.35  4e-07   396519    3.38    6.35  4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 6.34596 
#

Some questions remain:

specifying -g 4 and mpirun -np 4 gives me 8 GPUs per node ....I expected 4 gpus per node
sudo nload shows my utilization to be about 25 Gbps, meanwhile I'm getting 93 Gbps using iperf3 with 5 processes and 10 connections each. AWS throttles each connection to 10 Gbps, so I need at least 10 connections between these two machines, is there some setting I can add to increase number of connections? I've been running nccl from this branch

Setting NCCL_MIN_NRINGS=16 didn't have an effect

from nccl-tests.

AddyLaddy commented on August 20, 2024

Have you tried this PR against NCCL; NVIDIA/nccl#223

from nccl-tests.

yaroslavvb commented on August 20, 2024

Yes, that's the version I'm testing above

from nccl-tests.

AddyLaddy commented on August 20, 2024

Ok let's continue this discussion over on the NCCL project

from nccl-tests.

AddyLaddy commented on August 20, 2024

Typically we have been running with 1 process per GPU, so you can use mpirun -n $GPUS -N 8 for an 8 GPU per node job. You would then need to use -g1 on the all_reduce_perf command line

from nccl-tests.

yaroslavvb commented on August 20, 2024

Thanks, it works without -g arg now. However, I'm still confused why I need --oversubscribe argument. Basically I'm wondering if I'm missing some slots configuration somewhere.

/usr/local/mpi/bin/mpirun --host 172.31.34.74,172.31.45.216 -np 16 -N 8 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2

There are not enough slots available in the system to satisfy the 16 slots
that were requested by the application:
  /home/ubuntu/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf

Either request fewer slots for your application, or make more slots available
for use.

When I add oversubscribe, it works

ubuntu@ip-172-31-34-74:~$ /usr/local/mpi/bin/mpirun --host 172.31.34.74,172.31.45.216 -np 16 -N
 8 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help
_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH -oversubscribe ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2;

# nThread 1 nGpus 1 minBytes 1342177280 maxBytes 1342177280 step: 2(factor) warmup iters: 5 iters
: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  84278 on ip-172-31-34-74 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid  84279 on ip-172-31-34-74 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid  84280 on ip-172-31-34-74 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid  84281 on ip-172-31-34-74 device  3 [0x00] Tesla V100-SXM2-32GB
...

from nccl-tests.

yaroslavvb commented on August 20, 2024

Closing since issue was solved.
The solution to oversubscription was to use hosts.slots file instead of host string for mpirun

from nccl-tests.

Getting "help-mpi-btl-base.txt / btl:no-nics" when trying to run on Ethernet network about nccl-tests HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent