Giter VIP home page Giter VIP logo

Comments (9)

sjeaugey avatar sjeaugey commented on August 20, 2024 1

When running with MPI, you would usually want to not set -g nor -t (keep both equal to one) and just vary the -np argument of mpirun to match the total number of GPUs.

This would match what most framework do.

from nccl-tests.

sjeaugey avatar sjeaugey commented on August 20, 2024

There is no Infiniband on AWS, you can add -mca ^openib to mpirun to suppress this warning.

Usually problems are due to MPI trying to use docker0 to communicate, so I usually add -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 as well to force the usage of the ENA card.

from nccl-tests.

yaroslavvb avatar yaroslavvb commented on August 20, 2024

Thanks, that fixed it!
I'm now seeing the following by doing 1GB allreduce on 2 nodes.

/usr/local/mpi/bin/mpirun --host 172.31.15.234,172.31.3.83 -np 4 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH -oversubscribe ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2 -g 4

# nThread 1 nGpus 4 minBytes 1342177280 maxBytes 1342177280 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   8932 on ip-172-31-15-234 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid   8932 on ip-172-31-15-234 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid   8932 on ip-172-31-15-234 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid   8932 on ip-172-31-15-234 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank  4 Pid   8933 on ip-172-31-15-234 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank  5 Pid   8933 on ip-172-31-15-234 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank  6 Pid   8933 on ip-172-31-15-234 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank  7 Pid   8933 on ip-172-31-15-234 device  7 [0x00] Tesla V100-SXM2-32GB
#   Rank  8 Pid   8299 on ip-172-31-3-83 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  9 Pid   8299 on ip-172-31-3-83 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank 10 Pid   8299 on ip-172-31-3-83 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank 11 Pid   8299 on ip-172-31-3-83 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank 12 Pid   8300 on ip-172-31-3-83 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank 13 Pid   8300 on ip-172-31-3-83 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank 14 Pid   8300 on ip-172-31-3-83 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank 15 Pid   8300 on ip-172-31-3-83 device  7 [0x00] Tesla V100-SXM2-32GB
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  1342177280     335544320   float     sum   396609    3.38    6.35  4e-07   396519    3.38    6.35  4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 6.34596 
#

Some questions remain:

  • specifying -g 4 and mpirun -np 4 gives me 8 GPUs per node ....I expected 4 gpus per node
  • sudo nload shows my utilization to be about 25 Gbps, meanwhile I'm getting 93 Gbps using iperf3 with 5 processes and 10 connections each. AWS throttles each connection to 10 Gbps, so I need at least 10 connections between these two machines, is there some setting I can add to increase number of connections? I've been running nccl from this branch

Setting NCCL_MIN_NRINGS=16 didn't have an effect

from nccl-tests.

AddyLaddy avatar AddyLaddy commented on August 20, 2024

Have you tried this PR against NCCL; NVIDIA/nccl#223

from nccl-tests.

yaroslavvb avatar yaroslavvb commented on August 20, 2024

Yes, that's the version I'm testing above

from nccl-tests.

AddyLaddy avatar AddyLaddy commented on August 20, 2024

Ok let's continue this discussion over on the NCCL project

from nccl-tests.

AddyLaddy avatar AddyLaddy commented on August 20, 2024

Typically we have been running with 1 process per GPU, so you can use mpirun -n $GPUS -N 8 for an 8 GPU per node job. You would then need to use -g1 on the all_reduce_perf command line

from nccl-tests.

yaroslavvb avatar yaroslavvb commented on August 20, 2024

Thanks, it works without -g arg now. However, I'm still confused why I need --oversubscribe argument. Basically I'm wondering if I'm missing some slots configuration somewhere.

ie

/usr/local/mpi/bin/mpirun --host 172.31.34.74,172.31.45.216 -np 16 -N 8 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2

There are not enough slots available in the system to satisfy the 16 slots
that were requested by the application:
  /home/ubuntu/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf

Either request fewer slots for your application, or make more slots available
for use.

When I add oversubscribe, it works

ubuntu@ip-172-31-34-74:~$ /usr/local/mpi/bin/mpirun --host 172.31.34.74,172.31.45.216 -np 16 -N
 8 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help
_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH -oversubscribe ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2;

# nThread 1 nGpus 1 minBytes 1342177280 maxBytes 1342177280 step: 2(factor) warmup iters: 5 iters
: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  84278 on ip-172-31-34-74 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid  84279 on ip-172-31-34-74 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid  84280 on ip-172-31-34-74 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid  84281 on ip-172-31-34-74 device  3 [0x00] Tesla V100-SXM2-32GB
...

from nccl-tests.

yaroslavvb avatar yaroslavvb commented on August 20, 2024

Closing since issue was solved.
The solution to oversubscription was to use hosts.slots file instead of host string for mpirun

from nccl-tests.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.