Giter VIP home page Giter VIP logo

Comments (5)

rashikakheria avatar rashikakheria commented on July 28, 2024

yaroslavvb@ Thanks for summarizing your efforts. Have you given nccl-tests busbw measurement a try while scaling nodes? If yes, could you please post some numbers here. (along with the command line you used to invoke them)

from aws-ofi-nccl.

yaroslavvb avatar yaroslavvb commented on July 28, 2024

The links to nccl-test logs and commands are summarized in this page:
NVIDIA/nccl#235 (comment)

Note that nccl-test busbw is incorrectly reported for 2 machines, so I rely on algbw instead use formula in doc above to convert between busbw and algbw

For instance copying from 32-machine run log page

Running command:
/opt/amazon/efa/bin/mpirun   -n 256 -N 8 --hostfile hosts.slots  -x FI_EFA_MR_CACHE_ENABLE="1" -x FI_OFI_RXR_INLINE_MR_ENABLE="1" -x FI_OFI_RXR_RX_COPY_OOO="1" -x FI_OFI_RXR_RX_COPY_UNEXP="1" -x FI_PROVIDER="efa" -x LD_LIBRARY_PATH="/usr/local/cuda/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib64" -x NCCL_DEBUG="INFO" -x NCCL_TREE_THRESHOLD="42949672960"   --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0  -mca orte_base_help_aggregate 0  --bind-to none  $HOME/packages/nccl-tests/build/all_reduce_perf -b 8 -e 4096M -f 2 -g 1 -c 1 -n 100

#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       


ip-172-31-43-246:32158:32255 [6] NCCL INFO comm 0x7f7950002590 rank 86 nranks 256 cudaDev 6 nvmlDev 6 - Init COMPLETE
           8             2   float     sum    251.9    0.00    0.00  1e-06    247.8    0.00    0.00  1e-06
          16             4   float     sum    246.8    0.00    0.00  5e-07    246.9    0.00    0.00  5e-07
          32             8   float     sum    247.3    0.00    0.00  1e-06    249.0    0.00    0.00  1e-06
          64            16   float     sum    250.0    0.00    0.00  1e-06    249.3    0.00    0.00  1e-06
         128            32   float     sum    251.6    0.00    0.00  1e-06    250.6    0.00    0.00  1e-06
         256            64   float     sum    256.5    0.00    0.00  1e-06    255.7    0.00    0.00  1e-06
         512           128   float     sum    265.4    0.00    0.00  1e-06    264.0    0.00    0.00  1e-06
        1024           256   float     sum    273.1    0.00    0.01  1e-06    273.6    0.00    0.01  1e-06
        2048           512   float     sum    291.2    0.01    0.01  1e-06    291.5    0.01    0.01  1e-06
        4096          1024   float     sum    355.4    0.01    0.02  1e-06    353.9    0.01    0.02  1e-06
        8192          2048   float     sum    896.9    0.01    0.02  1e-06    828.2    0.01    0.02  1e-06
       16384          4096   float     sum   1222.0    0.01    0.03  1e-06   1065.7    0.02    0.03  1e-06
       32768          8192   float     sum   1197.6    0.03    0.05  1e-06   1092.5    0.03    0.06  1e-06
       65536         16384   float     sum   1308.3    0.05    0.10  1e-06   1159.2    0.06    0.11  1e-06
      131072         32768   float     sum   1388.7    0.09    0.19  2e-06   1190.2    0.11    0.22  2e-06
      262144         65536   float     sum   1357.0    0.19    0.38  2e-06   1356.2    0.19    0.39  2e-06
      524288        131072   float     sum   1620.1    0.32    0.64  2e-06   1612.9    0.33    0.65  2e-06
     1048576        262144   float     sum   2157.4    0.49    0.97  2e-06   2161.8    0.49    0.97  2e-06
     2097152        524288   float     sum   2146.4    0.98    1.95  2e-06   2147.8    0.98    1.95  2e-06
     4194304       1048576   float     sum   2928.8    1.43    2.85  2e-06   2937.5    1.43    2.84  2e-06
     8388608       2097152   float     sum   4641.1    1.81    3.60  2e-06   4636.5    1.81    3.60  2e-06
    16777216       4194304   float     sum   7001.4    2.40    4.77  2e-06   6991.1    2.40    4.78  2e-06
    33554432       8388608   float     sum    11609    2.89    5.76  2e-06    11626    2.89    5.75  2e-06
    67108864      16777216   float     sum    21008    3.19    6.36  2e-06    21036    3.19    6.36  2e-06
   134217728      33554432   float     sum    39490    3.40    6.77  2e-06    39524    3.40    6.77  2e-06
   268435456      67108864   float     sum    79617    3.37    6.72  2e-06    79562    3.37    6.72  2e-06
   536870912     134217728   float     sum   166922    3.22    6.41  2e-06   160034    3.35    6.68  2e-06
  1073741824     268435456   float     sum   311080    3.45    6.88  2e-06   311824    3.44    6.86  2e-06
  2147483648     536870912   float     sum   633049    3.39    6.76  2e-06   671957    3.20    6.37  2e-06
  4294967296    1073741824   float     sum  1226477    3.50    6.98  2e-06  1215340    3.53    7.04  2e-06

from aws-ofi-nccl.

rashikakheria avatar rashikakheria commented on July 28, 2024

Please re-open issue if this still happens.

from aws-ofi-nccl.

yaroslavvb avatar yaroslavvb commented on July 28, 2024

Is this closed because the numbers reported above are what is expected on EFA?

from aws-ofi-nccl.

rashikakheria avatar rashikakheria commented on July 28, 2024

We expect a busbw of ~10 GB/s on AWS EC2 when instances are launched within same placement group.

from aws-ofi-nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.